Have you ever tried to restore a server? What about a Production server? How about in the middle of the night? It never goes smoothly. Your cellphone never stops ringing. Often, the only thing that gets the server recovered is your own ingenuity and rock-star efforts. Let’s spend some cycles and try to get ahead of this.
As PFEs, one of our major roles and responsibilities is to help our customers realize “the gaps” and assist them in addressing them proactively. After an eye-opening conference call discussing recovery plans, or lack thereof, I felt even more compelled to create a post with some DR considerations. Hopefully, this will stir some thoughts and discussions (and ACTIONS!) around the matter of recovery.
Recovery can be defined as (among other things):
In our World of IT, we could be doing any or all of these actions during what we often refer to as “Disaster Recovery,” or “DR.”
It could be from a natural or man-made disaster or other large-scale event.
It could be a rogue admin or disgruntled employee. Often, it was due to an IT Pro making an innocent mistake – either small or large-scale.
Consider the statement: We do full backups of the ‘whole’ server, so in order to recover after an outage, we would simply do a full recovery of the box and be done.
Many times, a ‘full’ server backup doesn’t get key files – such as those files that are in use. DBs, transaction logs, application exe files, etc, are often not backed up during backup jobs via default settings or without special agents. We usually don’t realize this until we’re in dire straits. Or, perhaps, there is a Scheduled Task that is supposed to pause/quiesce the app/DB so the backup can get a copy of the proper flat file(s)? However, the Task isn’t being monitored and it hasn’t run for 9 months (since the svc acct got locked out and we’re not monitoring it with SCOM). Also, since that last backup 9 months ago, the app owner has upgraded the app two versions.
Consider the statement: We test recovery of our systems at the annual/recurring DR exercise/effort/mtg (you do have one of those, don’t you?)
However, as a “year in the life” passes for a system or server, it gets patched, service packed, drivers updated, settings changed (or drift), etc. Sometimes, the steps that enabled you to recover the system during the last DR exercise no longer work and the recovery suffers an epic failure.
BE PREPARED – as much as you can. Like many things, DR is always a work in progress and always changing as our systems evolve, get patched, updated or otherwise changed. Be vigilant! Be disciplined! Add Recovery to your normal work routine so it doesn't catch you off-guard. Consider recovery before a system is even deployed. Make sure it is part of the design. Test the recovery design prior to deployment and again at regular intervals.
One tip is to add recovery testing to your own day-to-day work items.
Now for a few DR pointers. Much of this is obvious and self-evident. It is painful, though, how often we neglect or forget the obvious.
Document. Document. DOCUMENT!
Hopefully, the information here reminds you of DR, gets you thinking about DR, brings up an idea or two about DR, or even stirs you to setup some Outlook appointments.
Now, take action and be at least a little better prepared.