I recently authored a post over at the Exchange Team Blog about Exchange ESE and Windows Disk Timeout values.
Whilst writing that article and especially researching how ESE timeouts have evolved, I began thinking about the history of Exchange availability and how the product has evolved over my time working with it…
Back when I began in enterprise infrastructure (1997) the name of the game was all about server uptime, not necessarily service uptime. What was interesting back then was that our servers were monitored for availability but we didn't have the technology to record service available very well. We didn't really have many options to improve our service availability with Exchange 5.5 – I used to pack out my servers with resilient components such as redundant network cards, power supplies and raid controllers to ensure that they would keep running for as long as possible, but if the server did go down our only option was to bring it back up as quickly as possible. My first concern during an outage was always the health of my 30GB EDB files – I knew that if one of these was damaged or the storage was unrecoverable then I was in for a minimum 10 hour restore from tape, I also knew that my manager would be pretty unimpressed when I delivered this news, so I would go to elaborate lengths to try and avoid it.
Towards the end of my time with Exchange 5.5 we were encouraged (by Microsoft) to look at clustering technology. I duly did this in my test lab (with real physical servers and a SCSI “Y” cable!) and I have to admit that I was pretty impressed initially. This clustering I would be able to tolerate an entire server failure and be back online in a few minutes. This was a huge step forward.
It was only when I started to write my business case to justify my new hardware requirements, I realised that over the previous two years all of my serious outages had come from storage issues (Raid controller firmware, cache failure, multiple HDD failure). I was hoping to use previous outage details to justify my request for new Exchange clusters, but the reality was that we had only had 1 real server failure (main board) and I was able to replace that fairly quickly with one from the test lab and it was back online in 2 hours. All of the serious service outages had come from storage related failures where i had to revert to tape restoration. I realised right at this point that clustering based on a shared storage model probably wasn't going to give me what i needed.
A few years later I was going through the design phase for Exchange 2003. My primary goal for the new platform was to improve our service levels, which I was informed by our service delivery manager were fairly steady at 99% for messaging. I was still very aware at this point that to do this I needed to improve my restore times from tape. Some progress had been made with the switch to DLT drives, but my EDB files were growing at an incredible rate and it only took 18 months before my restore times were back at 10 hours per server again. I re-visited clustering during this timeframe but came to the same conclusion as before that it was just adding complexity and wasn't going to materially effect my service availability. Instead I chose some fancy SAN technology which allowed me to use snapshots to recover my EDB files in minutes rather than hours. This technology also allowed me to mirror my data off-site. Management were duly impressed with my solution, until they realised how much it was going to cost!. Still, they eventually implemented the design and service levels did improve somewhat (although not as much as I had hoped).
What I had totally neglected during this design was that my new storage technology was a lot more complicated than my old raid controllers were. I quickly discovered that it was very easy to break things on a monumental scale. One wrong command given to my clever new storage technology would be sufficient to stop the entire messaging service. This realization very quickly lead me to adopt the “if it isn't broke, don't fix it” approach to maintaining service. This brought its own issues though and I was faced with an annual update task to bring the messaging servers back inline with our enterprise standards, this was always a disaster and would entail almost a days worth of downtime while we tried to get the magical combination of OS hotfixes, drivers, firmware and SAN revision right. Then, once we had a shiny rack of twinkly green lights we would leave it alone for another year…
Exchange 2007 seemed to address all of my previous concerns. I was now working for Microsoft so I was getting to see lots of customers, all struggling with the same basic issue I had previously experienced as a customer, i.e how do we recover service quickly in the event of a database problem. Exchange 2007 seemed to solve this elegantly with the introduction of CCR clusters. I loved the simplicity of this solution – just copy all of the changes from the active server to the passive server and then fail over in the event of a problem. It meant that even in a serious failure scenario we could just bring up our passive copy and be back online in minutes! The solution was out-of-box and so there were no supportability issues either! Customers loved CCR and I thought it was the best Exchange feature ever developed.
Over time though I started to see issues with the CCR model. The most frustrating was that we had to fail the whole server over to the passive node. This meant that if we had a isolated storage failure on the active node we had a difficult decision to make… leave the users on the failed database offline until a maintenance window was available, or interrupt service for everyone on the server and fail its workload over to the passive node. Not a great position to be in and often it would depend “who” was on the failed database and how important they were, rather than anything more scientific!
Exchange 2010 took the CCR model and addressed many of the issues reported by customers. Now we could fail over individual databases between server nodes. This was a huge step forward and meant that many customers were now actually hitting 99.9% availability. Given that best practices were followed it took an unusual situation to take Exchange 2010 offline for any significant period of time.
Over the past couple of years I have noticed some interesting service outages for my customers though. Given that the server experiences a clean failure of a storage component the database hosted on that storage will simply move to an alternate copy and service will be resumed. However, if the storage device does not fail cleanly and instead just begins responding slowly or intermittently the database will not simply fail over. This is quite common with JBOD or where the storage controller suffers an unusual failure, such as overheating or memory corruption.
Now we return to the present day (well, earlier this year). I was in a storage design meeting with a customer and their core storage vendors. I began talking about changes in Exchange 2010 SP1 and how Exchange would force BSOD (bugcheck) a server if we didn't hear back from a LUN in 4 minutes. At this point I could hear a sharp intake of breath from pretty much everyone involved. The feeling in the room was that forcing a server to crash reboot was insane!
I began to question this behaviour myself and started to think back to some of the failures I had seen and how or if this behaviour would have helped. I have to admit that forcing a server to blue screen does seem pretty extreme, however upon reflection I came to the conclusion that I quite liked this behaviour. Given that we have multiple independent copies of our database in a DAG then would I rather a workload remained on a server with a storage I/O problem (4 minutes is a long time to get a response back from your storage!) or that it was moved to another copy? Well, obviously I want it to be moved and actually I'm not even sure i want to wait 4 minutes!
For me the decision is pretty obvious. If I have multiple independent copies of my databases, I want Exchange to switch over intelligently if it detects a problem with the currently active copy. It may seem counterintuitive to crash reboot a server hosting an active service to improve service availability, but the alternative is to leave the service running on a wounded server until a human being comes along and does the same thing. Frequently in my support days I would arrive in the datacentre to find a hung server with a black screen. I would try remote RDP access, maybe trigger a remote reboot via RPC but all too frequently once a server gets into this state you need to press and hold the big red button on the front to get it to come back up and begin troubleshooting the event logs. All of the time the server is in this hung state the service is unavailable. By triggering a bugcheck someone still needs to troubleshoot the root cause, but at least the service is only interrupted for a few minutes rather than a few hours…