I have the pleasure of working for the Exchange Critical Situation team in North Carolina, so quite a few Disaster Recovery cases end up on my phone line. The number one reason they end up there is that the company just has not planned for a failure. We all assume that nothing is going to happen to us; (I am the same way, I live in hurricane country and yet I don’t have a hurricane kit or plan) but with Email and thus Exchange being mission critical to most businesses running it today, it is not an assumption any reasonable IT administrator can afford to make.
With that in mind I would like to share some basic things that everyone who is running Exchange needs to do in order to be prepared. While a lot of this is not "brand new" knowledge, it does address many of the things that I see go wrong in the cases that I work on. DR Happens - Will you be ready?
Service Level Agreement
Here the common scenario we run into is that people have not thought about what is important if the Exchange server has a failure. I will be on the phone working with a customer and we determine they need to do a restore from backup. At this point (especially with 2003) there are a few ways we can go about doing that. What is important to your company will help to determine the best way to do the restore.
Not having this information decided in advance leads to long conversations with management to get the decisions made. This can drastically slow down the pace of recovery. In some rare cases I have ended up spending more time talking about what we could do then we spent actually doing it.
What you need to have decided ahead of time are a few simple questions:
1) Which one is more important to my users: Restoration of Mail Flow or Recovery of Historical Data?
2) How long can we afford to be down with out any Mail Flow?
3) How long can we afford to be down with no Historical Data Recovered?
4) If Historical Data is our top priority at what point does Mail Flow become more important and vice versa?
These four questions will help to define your options and what you can and cannot do in order to restore Exchange to the functional level that you desire in the minimal amount of time. Having these decided in general is the first step to having a smooth disaster recovery.
This is another common situation that I run into. We are doing a restore from tape of a 120 Gb storage group and suddenly it is realized that the restore is going to take another 18 hours to finish and it is 11pm right now. So it is going to cut into the business day and that can’t be allowed to happen. Now we end up in a panic situation where people are willing to try any crazy scheme they can think of to get it back up before the morning.
This situation almost always comes about because people plan their database size based on their disk size and not the limitations of their Backup and Restore plan. Database size should be determined almost solely by your SLA and your backup and restore speed. This will ensure that when something goes wrong you will be able to get everything back up and running in a predictable and timely manner.
So what you need to do with database size is work it backwards. Determined how long you can be without Historical Data. Then determine how fast you can restore from tape. Use those two numbers with some padding for troubleshooting when the failure is discovered and some padding for log file replay after the restore is done to determine how large your databases can be.
You also need to figure out if that number will hold when you have to restore a whole storage group of 5 databases or what if you have to restore a whole server of 20 databases? In most cases you will probably want an SLA for each of those three situations. Since it clearly will take more time to restore 5 or 20 databases then it will to restore one.
Now let us say that you have been diligent and you have your SLA written out and you have your Databases at a reasonable size; you are all prepared right? Wrong. When it is time to do a disaster recovery mistakes are measured in hours. Checking the wrong box on your backup software can cost you your entire SLA window. Plus you don’t want to spend 30 minutes reading the directions for your Backup Software, or calling your backup vendor to figure out how to get the restore off of tape while you are offline. You need to already be at least basically familiar with the restore process.
What you need to do is practice as if your Exchange server had failed. We call this process running a Fire Drill. You should run an Exchange Fire Drill at least once a Quarter to keep everyone up to date on how the restore process works and how to perform it.
To run a Fire Drill you should setup a server (beefy workstation) with sufficient drive space to accommodate the Exchange database from at least one of your servers. You would then set it up on its own network with its own Domain Controller (if you are not testing full server restore then this can be a new domain). Install Exchange to the server and your backup software and make sure you can get access to the data on tape.
Now you are ready to go. Come in the next morning and declare “The Exchange server/Storage Group/Database (which ever you want to practice) just went down. We need to get it back up and running we have “X” hours to do so.” That X hours should be the time from your SLA that you have laid out before hand. Also make sure that you have management involvement so that you can concentrate on doing the restore just as if the Exchange server was actually down.
Write a Cheat Sheet
Now you have gone thru the process of doing a Fire Drill and you learned what worked and what didn’t. You have figured out all of the little check boxes and the fact that you have to keep the intern away from the tape drive power button. Take all of the knowledge and the make yourself up a cheat sheet for next time.
This cheat sheet should contain an outline of the steps and processes that you need to go thru in order to do your planned restore. It should include reminders of the little steps that you found are easy to miss. If possible you should also include screen shots of all of the settings you need to have to do the restore on your backup software. This cheat sheet will basically become your Restore Bible when it comes time for the real thing.
Practice some more
Last but not least you need to bring that cheat sheet out on a regular basis and practice with it. Make sure your organization is doing an Exchange Fire Drill at least once a quarter. Make sure that not just the Exchange guy is there for that, he should have a backup, in case he is on vacation, which can use the cheat sheet if necessary. After each of these practice sessions go back over the cheat sheet and make sure nothing needs to be updated.
If you do these basic simple things you will be more prepared for when an Exchange Disaster does happen. This should ensure that your disaster recovery goes smoothly with the minimum amount of down time. With Disaster Recovery mistakes are measured in hours so it pays to be prepared.
Exchange Server 2003 Disaster Recovery Operations Guide
Worksheet: Disaster Recovery Preparation for Exchange Server 2003
Preview: Exchange Server 2003 Disaster Recovery Planning Guide
- Matthew Byrd