Disk Image Backups and Multi-Master Databases (or: how to avoid early retirement)

Disk Image Backups and Multi-Master Databases (or: how to avoid early retirement)

  • Comments 8
  • Likes

Hi folks, Ned here again. We published a KB a while back around the dangers of using virtualized snapshots with DFSR:

Distributed File System Replication (DFSR) no longer replicates files after restoring a virtualized server's snapshot

Customers have asked me some follow up questions I address today. Not because the KB is missing info (it's flawless, I wrote it ;-P) but because they were now nervous about their DCs and backups. With good reason, it turns out.

Today I discuss the risks of restoring an entire disk image of a multi-master server. In practical Windows OS terms, this refers to Domain Controllers, servers running DFSR, or servers running FRS; the latter two servers might be member servers or also DCs. All of them use databases to interchange files or objects with no single server being the only originator of data.

The Dangerous Way to Backup Multi-Master Servers

  • Backing up only a virtualized multi-master server's VHD file from outside the running OS. For example, running Windows Server Backup or DPM on a hyper-V host machine and backing up all the guest VHD files. This includes full volume backups of the hyper-v host.
  • Backing up only a multi-master server's disk image from outside the running OS. For example, running a SAN disk block-based backup that captures the servers disk partitions as raw data blocks, and does not run a VSS-based backup within the running server OS.

Note: It is ok to take these kinds of outside backups as long as you are also getting a backup that runs within the running multi-master guest computers. Naturally, this internal backup requirement makes the outside backup redundant though.

What happens

What's the big deal? Haven't you read somewhere that we recommend VSS full disk backups?

Yes and no. And no. And furthermore, no.

Starting in Windows Server 2008, we incorporated special VSS writer and Hyper-V integration components to prevent insidiously difficult-to-fix USN issues that came from restoring domain controllers as "files". Rather than simply chop a DC off at the knees with USN Rollback protection, the AD developers had a clever idea: the integration components tell the guest OS that the server is a restored backup and resets its invocation ID.

After restore, you'll see this Directory Services 1109 event when the DC boots up:

image

This only prevents a problem; it's not the actual solution. Meaning that this DC immediately replicates inbound from a partner and discards all of its local differences that came from the restored "backup". Anything created on that DC before it last replicated outbound is lost forever. Quite like these "oh crap" steps we have here for the truly desperate who are fighting snapshot USN rollbacks; much better than nothing.

Now things get crummy:

  • This VSS+Hyper-V behavior only works if you back up the running Windows Server 2008 and 2008 R2 DC guests. If backed up while turned off, the restore will activate USN rollback protection as noted in KB875495 (events 2095, 1113, 1115, 2103) and trash AD on that DC.
  • Windows Server 2008 and 2008 R2 only implement this protection as part of Hyper-V integration components so third party full disk image restores or other virtualization products have to implement it themselves. They may not, leading to USN rollback protection as noted in KB875495 (events 2095, 1113, 1115, 2103) and trash AD on that DC.
  • Windows Server 2003 DCs do not have this restore capability even as part of Hyper-V. Restoring their VHD as a file immediately invokes USN rollback protection as noted in KB875495 (events 2095, 1113, 1115, 2103), again leading to trashed AD on that DC.
  • DFSR (for SYSVOL or otherwise) does not have this restore capability in any OS version. Restoring a DFSR server's VHD file or disk image leads to the same database destruction as noted in KB2517913 (events 2212, 2104, 2004, 2106).
  • FRS (for SYSVOL or otherwise) does not have this restore capability in any OS version. Restoring an FRS server's VHD file or disk image does not stop FRS replication for new files. However, all subfolders under the FRS-replicated folder (such as SYSVOL) - along with their file and folder contents - disappear from the server. This deletion will not replicate outbound, but if you add a new DC and use this restored server as a source DC, the new DC will have inconsistent data. There is no indication of the issue in the event logs. Files created in those subfolders on working servers will not replicate to this server, nor will their parent folders. To repair the issue, perform a "D2 burflag" operation on the restored server for all FRS replicas, as described in KB290762.

Multi-master databases are some of the most complex software in the world and one-size-fits all backup and restore solutions are not appropriate for them.

The Safe Way to Backup Multi-Master Servers

When dealing with any Windows server that hosts a multi-master database, the safest method is taking a full/incremental (and specifically including System State) backup using VSS within the running operating system itself. System state backs up all aspects of a DC (including SYSVOL DFSR and FRS), but does not include custom DFSR or FRS, which is why we recommend full/incremental backups for all the volumes. This goes for virtualized guests or physical servers. Avoid relying solely on techniques that involve backing up the entire server as a single virtualized guest VHD file or backing up the raw disk image of that server. As I've shown above, this makes the backups easier, but you are making the restore much harder.

And when it gets to game time, the restore is what keeps you employed: your boss doesn't care how easy you made your life with backups that don’t work.

Final thoughts

Beware any vendor that claims they can do zero-impact server restores like those that I mentioned in the "Dangerous" section and make them prove that they can restore a single domain controller in a two-DC domain without any issues and where you created new users and group policies after the backup. Don't take the word of some salesman: make them demonstrate my scenario above. You don’t want to build your backup plans around something that doesn’t work as advertised.

Our fearless writers are banging away on TechNet as I write this to ensure we're not giving out any misleading info around virtualized server backups and restores. If you find any articles that look scary, please feel free to send us an email and I'll see to the edits.

Until next time.

- Ned "one of these servers is not like the other" Pyle

  • Thanks for this Ned, this is highly useful information. I've got a Server 2008 R2 Hyper-V guest that serves as our primary DFSR hub and has VHD's for each folder target/replication group. Couple questions though:

    1. If I understand correctly, we shouldn't be using any tool to back up these VHD's from the host (using Symantec Backup Exec, DPM, etc) correct? Is there a recommendation to back up the guest OS without backing up the VHD, so in the event of a OS failure, we can restore without pooching DFSR?

    2. You stated "It is ok to take these kinds of outside backups as long as you are also getting a backup that runs within the running multi-master guest computers." How would this scenario be safer in a restore process? Would it be safe to restore the VHD backup, and then the file-level backup overtop of that? I just don't understand how this 'becomes' safe.

    If that does work, we'd still do the redundant process since a full backup & restore of our 1.5TB VHD takes much less time than a file level backup & restore from inside the guest. We could restore our weekly full VHD + any file-level differentials in short order.

  • Hi Jadus01,

    1. Most backup software allows you to exclude files by path, extension (Windows Server Backup does, for example) so you could block them that way if you were inclined.

    2. If you were NOT so inclined in #1 above and wanted to back up the VHDs as files, #2 still saves you. If you backup *inside* a guest OS (i.e. the guest OS is running backup software and is online), any restore of the VHD file is not terminal because you could then run the guest's own restore internal to the guest. If you had 100 VHDs on a host and only 2 of them were DCs, this might make sense as 98 of your backups can be restored with teh full disk and the two DCs can be fixed with their internal backups after the fact.

    We want you to treat the multi-master guest OSes as 'real' computers, not as VHD files. Full VHD backups simply don't work on them like they do with a simple file server or web server or app server; multi-master is too complex and interrelated. Plus even in the "good" case where we have USN rollback protection in place, your "restore" is losing all changed data. It's more like an amputation that saves the patient's life but costs him a hand.

    Now: if you had all the multi-master related guests on the same host, you could restore a full host and not see any problems. But if you;re doing that, you're breaking a principle of availability - all your eggs are in one basket. So for different reasons, that's just as dangerous.

    Thanks, and continue let me know if I'm not making sense. I want to make sure everybody gets this and understands what happens if you restore full VHDs.

  • That does make sense. Ultimately I'm just trying to determine if we need to modify our backup strategy for our DFSR server.

    So if we lose our DFSR member guest in Hyper-V, and we restore the C_guest.vhd (OS) and D_files.vhd (1.5 TB of files + DFSR database), and then boot up that guest disconnected from the network, and then do a further restore of a backup taken inside the guest before it failed, then when we connect it back to the network, it should begin communicating with the other DFSR members properly?

    Will it come back online as non-primary, so that it receives changes made on other members while it was down, even if the downed member was the hub?

    I think where my current confusion lies is in that second "inside guest" restore; You specify that the backup should be "full (and specifically including System State) backup using VSS". Does that mean incremental backups of the DFSR folder target aren't valid in the disaster recovery scenario? If we're backing up using UNC path to the folder target on the guest with a backup agent, is that going to be sufficient to get the DFSR member back online?

    If not, it almost sounds like it'd be easier to restore the vhd's as a new guest, and rejoin the replication group as a new member, with the 1.5TB vhd acting as pre-seeded data for an initial replication.

  • Ah I see what you mean - incrementals are totally valid as well as long as you also have the full of course. There's no expectation that anyone needs to always run Full (capital F) backups everytime, just that the backups fully cover all the files. I modified the article to be clearer about this.

    There is an easier solution after restoring a full VHD but I am loathe to talk about it publically as it is so often absued and used incorrectly. I am going to anyway though:

    A D2 burflag in FRS resets the database. If you were to delete the DFSR databases on all the volumes after the restore, you will fix the problem. This causes a complete non-authoritative sync and fixes everything up just like how DCs do when their invocation ID is reset. Now, the reason I don't like to bring this up is because you are now potentially syncing many terabytes of data, and you are definitely LOSING all conflicts (meaning that data that was newer on this server is arbitrarily deleted forever). It's not a great solution. You can't mark the restore authoriattive either when the databse is deleted; you'd insetad have to do (as you mentioned) a database deletion as well as recreate your RG and set his server primary all over again.

    I would also make mention that DFSR performance is going to be a lot better with a passthrudisk than a VHD file; it's very IO dependent. And if you are using passthru disks you have nothing to worry about. Especially since DFSR keeps a database on every volume that is replicating files.  

  • > Windows Server 2008 and 2008 R2 only implement protection this as part of Hyper-V integration components so third party full disk image restores or other virtualization products have to implement it themselves.

    I've asked before and I'll ask it again :) Do we document how we did this and how they are supposed to do this? I mean is there any special sauce or we just need to call the NTDS VSS writer in the end of restore and make it aware that the restore took place?

  • It's only documented internally at this point, AFAIK. I'm sure if a backup vendor asked us how to do this we'd explain it to them though. Ultimately, you're just setting:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

    Database restored from backup = 1

  • Hi Ned,

    Does this also apply to scearios where you have done a P2V or V2V of a server either using VMM 2008 R2, or using a more manual method such as Disk2VHD?

    Cheers

    Janson

  • Yes, the method that got you to a virtualized DC doesn't matter. The issue isn't specific to virtualization either - if you have a SAN that performs "volume snapshots" of physical servers you could run into these scearios as well. Anything that makes an entire server time travel backwards as a single image is at risk.