DC’s and VM’s – Avoiding the Do-Over

DC’s and VM’s – Avoiding the Do-Over

  • Comments 23
  • Likes

Hello everyone, Mark from DS again. With more and more companies using virtualization, such as Microsoft Virtual Server, Server 2008 Hyper-V or VMWare, in their environments these days you may end up in the following situation I recently worked on:

1) Customer wanted to roll back one of his DC’s in his test environment to basically “back out” of some changes that had been made recently. This was a single domain forest that consisted on two Domain Controllers. Both of the DC’s were running Windows 2003 SP2.

2) Virtual Machine snapshots were being taken instead of normal system state backups.

3) They restored one of the DC’s from one of the snapshots.

4) Replication was broken.

Replication symptoms consisted of the following:

1) The Netlogon service is in a paused state.

2) In the Directory Service event log a replication error was logged, Source was NTDS Replication with the Event ID 2095.

3) Also in the Directory Service event logs were two warnings, Source was NTDS General with the event ID’s 1113 and 1115.

Here are samples of the Directory Service event log events with the description of the event.

Event Type: Error
Event Source: NTDS Replication
Event Category: Replication
Event ID: 2095
Date:
Time:
User:
Computer:
Description: During an Active Directory replication request, the local domain controller (DC) identified a remote DC which has received replication data from the local DC using already-acknowledged USN tracking numbers. Because the remote DC believes it is has a more up-to-date Active Directory database than the local DC, the remote DC will not apply future changes to its copy of the Active Directory database or replicate them to its direct and transitive replication partners that originate from this local DC. If not resolved immediately, this scenario will result in inconsistencies in the Active Directory databases of this source DC and one or more direct and transitive replication partners. Specifically the consistency of users, computers and trust relationships, their passwords, security groups, security group memberships and other Active Directory configuration data may vary, affecting the ability to log on, find objects of interest and perform other critical operations. To determine if this misconfiguration exists, query this event ID using
http://support.microsoft.com or contact your Microsoft product support. The most probable cause of this situation is the improper restore of Active Directory on the local domain controller. User Actions: If this situation occurred because of an improper or unintended restore, forcibly demote the DC.

Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1113
Date:
Time:
User:
Computer:
Description: Inbound replication has been disabled by the user.
Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1115
Date:
Time: 
User: 
Computer: 
Description: Outbound replication has been disabled by the user.

If you run the command repadmin /options <The DC Name> you can verify that inbound and outbound replication is disabled. You will see something similar to this:

Current DC Options: IS_GC DISABLE_INBOUND_REPL DISABLE_OUTBOUND_REPL

With more and more companies using Virtualization to replace actual physical hardware, especially in test environments, I believe we are going see more issues such as this one. This can also happen in situations where you are converting physical hardware to virtual machines which we refer to as “PtoV” (physical to virtual).

First we need to understand some basic background information regarding Active Directory (AD) replication. Domain Controllers (DC’s) use Update Sequence Numbers (USN’s) to track the updates that need to be replicated between replication partners. Every time a change in made to the data in the directory the USN is incremented to indicate a change was made. For each directory the DC stores, USN’s are used to track the latest updates that a DC has received from each source replication partner. Each DC also has a table where it knows about every other DC highest USN that stores a replica of that directory partition. Each DC also has a value on its NTDS Settings object called an invocation ID. This value is used to indentify its version of its local AD database.

There are two values that use USN’s during the replication process. One is the up-to-dateness vector, the other is the high water mark. The up-to-dateness vector is a value that the destination DC maintains for tracking the originating updates that are received from its source DC’s. When the destination DC requests its updates for a directory partition it supplies its up-to-dateness to the source DC who can use that value to reduce the set of attributes it needs to send to the destination DC. The source DC will send its up-to-dateness vector value to the destination DC once the replication cycle has completed. The high water mark is a value that the destination DC maintains to keep track of the latest change it has received from a specific source DC for an object in a specific directory partition. This value prevents the source DC from sending out changes to the destination DC that have already been applied by the destination DC.

The invocation ID is a GUID value that identifies the directory database running on a DC and is maintained separately from the identity of the server object. The server object identity never changes but the identity of the directory database (invocation ID) will change when a system state is restored by using the Microsoft API’s. All the domain controllers keep track of the directory database on its source replication partners. Both the up-to-dateness vector and the high water mark refer to the invocation ID so that other DC’s know which copy of the AD the replication is coming from.

I know this can be confusing so let’s add some graphics that may help to understand this better. Let’s say we have two DC’s, DC1 and DC2. Both of these DC’s are running as Virtual Machines on a host machine running your favorite Virtualization Software. For all intents and purposes we are assuming that replication is working fine and both of the DC’s are up to date on replication. Before we start, we take a “snapshot” of DC1. As we can see below we add a new user “Jeff Smith” on DC1. The USN is incremented from 4710 to 4711 on DC1.

image

Now we replicate the new user to DC2. DC1 will notify DC2 that it has changes that it needs to replicate. DC2 will then request the changes and send DC1 what it thinks is DC1’s high water mark is. In this case DC2 thinks that value is 4710 so that is what it sends. When they are done replicating DC1 will send DC2 its up-to-dateness vector so DC2 will have the new value.

image

Now let’s suppose that other changes in the environment are occurring and replicating as they should. “Jeff” logs on and changes his password. When he does this DC2 is the DC where the change takes place. This will increment the USN on DC2 as it was 2452 and we increment the USN for DC2 to 2453.

image

Next we replicate that password change over to DC1. DC2 tells DC1 that it has changes it needs to get. DC1 will send DC2 what it thinks DC2’s USN is, in this case DC1 thinks DC2 is at 2452.

image

Once they are done replicating DC1 USN will be 5040 and DC2 will know it DC1 is at 5040. DC2 will be at 2453 and DC1 will know that value as well.

Now you want to roll that one DC back. You apply the snapshot to the DC as a restore procedure. When this happens, the invocation ID remains the same, the USN’s are “rolled back” to the time the snapshot was taken. Now when the replication process starts the “snapshot” DC requests changes from its source DC it sends the old up-to-dateness vector to the source DC. The source DC sees this value and it knows what the value should be and they are different. The value sent has a lower value then the source DC has in its table for the destination DC. The response sent back to the destination DC by the source DC basically telling the destination DC its database is out of date. When this happens we have built-in protection so that the destination DC will take measures not replicate with other. This is referred to as a “USN rollback” situation.

The protection that the USN rollback system will take will be is:

1) Pause the Netlogon service.

2) Disable the inbound and outbound replication.

To correct this situation we need to do the following on the DC that has the roll back issue.

1) Forcefully demote the DC by running dcpromo /forceremoval. This will remove AD from the server without attempting to replicate any changes off. Once it is done and you reboot the server and it will be a standalone serve in a workgroup.

2) Run a metadata cleanup of the DC that was demoted per KB article 216498 on one of the replication partners.

3) If the demoted server held any of the FSMO (Flexible Single Master Operations) roles then use the KB article 255504 to seize the roles to another DC.

4) Once replication has occurred end to end in your environment you can rejoin the demoted server back to the domain then promote to a DC.

To prevent this from happening adhere to the following best practices:

1) Do not use imaging software to take an image of the DC.

2) Do not take or apply snapshots of the DC.

3) Do not shut the Virtual Machine down and simply copy the virtual disk as a backup.

4) If you have the ability to “discard changes” as you do if you are running “Virtual Server 2005 R2”, do not enable this type of setting on a DC Virtual Machine.

5) Use NTBACKUP.EXE, WBADMIN.EXE, or any third party software that is available as long as it is certified to be AD-compatible to take system state backups.

6) Only restore a system state to the DC or restore a full backup.

References:

875495 How to detect and recover from a USN rollback in Windows Server 2003

http://support.microsoft.com/default.aspx?scid=kb;EN-US;875495

Appendix A: Virtualized Domain Controllers and Replication Issues

http://technet.microsoft.com/en-us/library/dd348479.aspx

Backup and Restore Considerations for Virtualized Domain Controllers

http://technet.microsoft.com/en-us/library/dd363545.aspx

 

- Mark Ramey

  • PingBack from http://serversarea.com/blog/2009/06/ask-the-directory-services-team-dcs-and-vms-%e2%80%93-avoiding-the-do-over-2/

  • Good article.

    Yeah, seeing this popping up in forums a lot recently.  I'll just link to this article from now on in my reply to those posts :)

  • Important reference, bookmarked :-)

  • 242 Microsoft Team blogs searched, 102 blogs have new articles in the past 7 days. 259 new articles found

  • All - great info.  However, in MSFT's VS 2005 whitepaper about running DCs on VMs, there is mention of a regchange (notice I didn't say 'reghack') that

    prevents USN rollback:

    1. Using the previous .vhd, start the domain controller in Directory Services Restore mode.

    2. In a registry editor, if the entry DSA Previous Restore Count under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters is visible, make a note of the value. If the entry is not visible, assume a value of 0. Do not add the entry.

    3. Add the registry entry Database restored from backup under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

    Data type: REG_DWORD

    Value=1

    Is that no longer valid?

  • That is not supported, because it is being used to circumvent a proper backup. It's hacking. Using snapshots with any virtualization technology and a DC is 100% unsupported, always.

    That article was clearly written back when people were cowboying virtualization because they had no idea what they were doing, and the author clearly did not either. 5 years ago, sounds about right. Don't use that article and expect to be supported if you have DC issues afterwards.

  • So this technet article is all wrong?! It applies to Server 2008

    "To restore a previous version of a virtual domain controller VHD without system state data backup"

    http://technet.microsoft.com/en-us/library/dd363545(WS.10).aspx

  • How is that article wrong? It specifically says:

    "Do not use the Snapshot feature as a backup to restore a virtual machine that was configured as a domain controller. "

  • It also says:

    "If you do not have a system state data backup that predates the virtual machine failure, you can use a previous VHD file to restore a domain controller that is running on a virtual machine"

    If the VHD file has never been started in normal mode it looks to me like this article says it's ok to do it as long as you set the "DSA Previous Restore Count" to 0 in DSRM.

  • Uggghhh...

    Don't follow that direction until you hear back from me here. I am tracking that down to get some more info on what would happen here if there was more than one DC in the domain. And if you have that, why are you restoring the server?

    This looks like untested, unsupported, ancient and naive documentation from 6 years ago around Virtual Server.

    Use system state backups.

  • I agree with you that system state backup is the bullet proof way, but there are many MVP blogs and forums out there saying the "DSA Previous Restore Count" method is an "alternative" to system state. My guess is that they have read the technet article.

    Looking forward to hear back from you and what you find out :)

  • From chatting with one of the PQPM's here, we in Support fought to have that documentation removed, and lost. I cannot vouch for its supportability in any way I'm afraid, nor have I been able to get a developer to vouch for it. I am still asking around though.

  • I hope this didn't cause you any delay in the "Friday Mail Sack" ;)

  • Nah, but plenty of other stuff is... :-/

    So, back to your question. I was able to dig up the right folks and get some calrification. I plan on having this article edited for clarity, but:

    1. As long as you never boot the hyper-v snapshot until after you’ve set the ‘dsa restoring from backup’ key, then you’re good and supported (this was tested in Win2008 R2). However, if you ever accidentally boot the hyper-v snapshot before you’ve set the key, then you’re in a USN rollback scenario.

    2. Step 12 is baloney. If the value is not present or correct, you cannot start over with this VHD. You must have another snapshot to restore or have made a copy of this image before you started all these steps.

    3. And finally - the reason the article starts with "Do not use the Snapshot feature as a backup to restore a virtual machine that was configured as a domain controller" but then goes on to give steps is for the absolute last resort, last gasp, "OMG we're all gonna die man" scenarios where your system state backups are not working. The SS backups are still the mechanism you should be using, and the snapshots should never, ever be done in lieu of system state backups. That's why this article is hard to find, but USN rollback articles are easy to find - we want people using system state backups.

  • Hi Ned and many, many thanks fot the clarifications on this.

    Could you please take a minute and review what I've written on general backup theory in TN Wiki? What I tried to do there is (among other similar goals) to explain the “USN Rollback” feature in “Simple English” languate avoiding any technical details and still giving explicitly all necessary warnings and support “do and don't”s.

    The article is located at http://social.technet.microsoft.com/wiki/contents/articles/backup-and-restore-special-considerations.aspx

    Now I gonna write another article there in the same style that talks specifically on VM backup and restore (I'm VM MVP after all). But before that I want to make sure that I'm OK with all application-specific points.