“Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

“Lag site” or “hot site” (aka delayed replication) for Active Directory Disaster Recovery support

  • Comments 13
  • Likes

Hi, Gary from Directory Services here and I’m going to talk today about the concept of “lag sites” or “hot sites” as a recovery strategy. I recently had a case where the customer asked if the replication interval for a site link could be set higher than 10,080 minutes (7 days). The quick answer was that Active Directory only supports values from 15 up to 10,080 minutes and the schedule is based on a week. If the replinterval attribute on the site link is manually set to something lower than 15 it will use the default of 15. If it is set to something higher than 10,080, it will be ignored and 10,080 will be used.

But the underlying question kept coming back to the recommendation of a latent “lag site”.

First let me give a quick definition of a lag site or hot site and its general intended purpose. A lag site is just an Active Directory site that is configured with a replication schedule of one, two or maybe three days out of the week. That way it will have data that would be intentionally out-of-date as of the last successful inbound replication. It is sometimes used as a quick way to recover accidentally deleted objects without having to resort to finding the most recent successful backup within the tombstone lifetime of the domain that has the data.

This sounds like a decent idea, in theory. However, Microsoft Support does not recommend a lag site as a disaster recovery strategy. Servicing products such as hotfixes and service packs not recognize quasi-offline DC state monitoring software may also detect the state of a lag site DC as malfunctioning and attempt to re-enable it (or tell an unwitting administrator to do so). Microsoft makes no guarantees that the servicing and monitoring products would not re-enable Netlogon and KDC services in a lag site. In addition, other Microsoft products, such as Exchange Server, are not designed to operate in a lag site and they may not function properly with lag site DCs.

The following lists some reasons why lag sites should not be relied upon as a disaster recovery strategy, especially in lieu of proper Active Directory System State backups:

Lag sites are not guaranteed to be intact in a disaster:

  • If the disaster is not discovered in time before replication occurs, the problem is replicated to the lag site, and the lag site cannot be used to undo the disaster. A lag site typically needs to be three days latent in order to cover situations that occur during the weekend where visibility is low. However this means that you are actually forced to ‘lose’ more changes than a reliable daily backup being run on domain controllers.
  • Thus, the administrator must act immediately when a disaster occurs: inbound and outbound replications must be disabled and repadmin /force must be forbidden.

Replicating from lag site might have unrecoverable consequences:

  • Since a lag site contains out-of-date data, using it as a replication source may result in data loss depending on the amount of latency between the disaster and the last replication to the lag site.
  • If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes.

Lag sites pose security threats to the corporate environment:

  • For example, when an employee is fired from the company, his/her account is immediately deleted (or disabled) from Active Directory, but the account might still be left behind in the lag site. If the lag site domain controllers allow logons, this could potentially lead to unauthorized users with access to corporate resources during the lag site replication delay “window”.

Careful consideration must be put in configuring and deploying lag sites:

  • An Administrator needs to decide the number of lag sites to deploy in a forest. The more domains that have lag sites, the more likely one can recover from a replicated disaster. However, this would also mean increased hardware and maintenance costs.
  • An Administrator needs to decide the amount of latency to introduce. The shorter the latency, the more up-to-date and useful the data would be in the lag site. However, this would also mean that administrators must act quickly to stop replication to the lag site when a disaster occurs.

The above list is not exhaustive, and there could be other unseen problems with deploying lag sites as a disaster recovery strategy. It has always been strongly recommended that the best way to prepare for disasters such as mass deletions, mass password changes, etc. is to backup domain controllers daily and verify these backups regularly through test restorations.

Finally, keep in mind that testing your disaster recovery routine is vital both prior to beginning to rely on that routine in case of failure as well as once you begin to use it as your recovery strategy. Surprise is never good when a disaster strikes.

Here are some links to Microsoft recommended recovery steps and practices:

840001 How to restore deleted user accounts and their group memberships in Active Directory - http://support.microsoft.com/kb/840001

Useful shelf life of a system-state backup of Active Directory - http://support.microsoft.com/kb/216993

Managing Active Directory Backup and Restore - http://technet2.microsoft.com/windowsserver/en/library/5d683eeb-e76c-46e9-92f4-fcb2a10f955f1033.mspx

Step-by-Step Guide for Windows Server 2008 AD DS Backup and Recovery - http://technet.microsoft.com/en-us/library/cc771290.aspx

Active Directory Backup and Restore in Windows Server 2008 - http://technet.microsoft.com/en-us/magazine/cc462796.aspx

- Gary Mudgett

  • Thanks for the info - finally some meat on the bones around lag sites.

  • All I read here is that you need to know what you are doing and should have a clear design and operational model in mind if you use lag sites. I would argue you should know what you are doing and have a good plan if you are responsible for AD at all.

    Everything you list here can be covered if you have knowledgeable informed admins. If you have an unwitting admin, you already have a problem, maybe the lag site will help you catch the problem and eradicate it before something serious happens.

    Even the repadmin /force can be stopped dead in its tracks if necessary. The methods may or may not be supported by PSS but it doesn't mean they don't work just fine. Lots of things in the real world aren't supported by PSS... Yet.... that work just fine.

    Point: Lag sites are not guaranteed to be intact in a disaster:

    CounterPoint: Ditto for backups. You should hopefully (again not guaranteed if your admins are the idiots that couldn't properly run a lag site) have enough backups over time to go back far enough but if the issue occurred pre-TSL, you are SOL either way. A friend of mine presented to the DS PG an awesome way to poison the backups several years ago that was completely undectable under normal circumstances. That hole has since been plugged because of that conversation but if MSFT now guarantees that backups will be intact in the event of a disaster, I would like to see that guarantee in writing. If not, this point is moot with the understanding that Lag Sites are not a COMPLETE DR solution, but could be PART of an overall solution.

    Point: Replicating from lag site might have unrecoverable consequences

    Counterpoint: And restoring from tape is also using out of date data correct? Do you have the same concern there??? Logic says you should. Also doing a schema update can do the same, should we not do those either? This is the same scare tactics used by folks in the early days of AD to warn them off from doing Schema changes. We quickly learned that if we know what we are doing and use proper precautions and procedures we will be fine. I especially enjoyed the "may have to do a forest recovery..." bit. Had that been presented to me in a meeting with MSFT in front of a customer, I likely would have been unable to control my chuckling.

    Point: Lag sites pose security threats to the corporate environment

    Counterpoint: This one gave me a good chuckle too. Ever hear of normal slow convergence across a large enterprise? Ever hear of Kerberos Tickets? At what point did Kerberos start validating if a currently unexpired ticket was tied to a disabled or deleted userid? Yes there *could* be additional issues if auth is possible through the lag site, but this is simply a design and operational criteria to take into account for lag sites as well as normal overall convergence of data churn. It could be a bad thing that happens when repl gets plugged or when a site is normally more latent than other sites or with "official" lag sites or if someone adjusts kerb ticket configuration settings. It isn't a "oh my god the sky is falling don't do lage sites because of this" item.

    Point: Careful consideration must be put in configuring and deploying lag sites:

    Counterpoint: Of course. Careful consideration must be put in configuring and deploying ANY site as well as ANY domain or ANY forest or ANY domain controller.

    You likely should have stopped with your post after stating that one week is the hardcoded upper limit on normal replication schedules. The rest of this was unhelpful and again reminded me of all of the Schema Updates are bad scare tactics that went around for the initial years of AD.

    If you wanted, you could have stated that Lag Sites need to be properly planned. They need to be properly managed. They aren't a complete DR plan but they can be part of an overall DR plan that is used for various scenarios along with tombstone reanimation, Snapshot data recovery in Window Server 2008, and god forbid tape recovery. As a personal point of interest, I would much rather restore objects out of a lag site than from a backup file. I trust the lag sites more than I trust the backup/restore process.

    Going forward, please don't give advice based on misinformation, little information, or just plain "let's scare em" type scenarios.

    THe wrapup is that a lag site is simply a site that replicates on a longer convergence frequency than "normal sites". Possibly up to a week out of convergence. This is a fully supported configuration by MSFT. It just isn't supported as your sole Disaster Recovery solution. And it shouldn't be because it isn't a full Disaster Recovery solution.

     joe (www.joeware.net)

  • Hello Gary - good post that should allow plenty of discussions on this topic. I would say the most important statement you are making indirectly with this blog, is that the implementation of lag-sites is GENERALLY SUPPORTED.  Not recommended, but supported.

    And I totally agree that they shouldn't be leveraged and implemented by people that don't know what they're doing. Lag-Sites by no means replace normal domain controller backups (and periodic recovery testing) - they are merely one of many options to increase the speed of object recovery in case those should be required. Regardless of the processes being used, the AD administrator must know what he or she is doing when backing up and restoring objects in AD. This is especially the case in multi-domain AD forests, where by design AD lacks the capability to completely recover objects including all potential cross-domain links, which can cause a lot of pain for AD administrators.  

    Lag-sites can help reduce this pain, if the administrator knows how to leverage them correctly. Here are a few thoughts on your arguments against Lag-Sites:

    1 - Lag Sites are not guaranteed to be intact in a disaster

    Yep, that's one of the reasons why you certainly still need normal domain controller backups. However, the majority of "accidental" deletions are typically detected very fast. And the whole point of implementing lag-sites is to be able to react quickly and to be able to recover objects quickly without the need to first recover a DC from backup (or, to do it right in a multi-domain environment, recover a DC from every domain from backup).  In a lag-site, all you have to do is boot the respective DC into Directory Services Restore Mode (DSRM) and run the authoritative object restores directly. There is additional work to do to fully recover cross-domain links such as memberships in local groups in another domain of the forest, but leveraging lag-site DCs from those domains  (for example to check what the group memberships of a given user should be) give admins a clear advantage over the need to first restore DCs from every domain to be able to recover those cross domain links. Naturally, other methods can also be used to ensure backup and recoverability of those links, but only relying on normal domain controller backups is actually a bad thing (in multi-domain forests).

    2 - Replicating from lag site might have unrecoverable consequences

    I do not see any value at all in this argument. The whole process of object recovery in AD relies on the use of "out-of-date data". If I first have to reboot the DC into DSRM and then recover the database to a previous version from my DC backups, I'm doing nothing else: I'm putting "out-of-date data" on the DC so that I can increase the version number on the respective objects I want to recover using the "authoritative restore" method with NTDSUTIL. The same this is done with DCs in lag-sites, with the big difference being the time that it takes for me to get to the point that allows me to perform the auth restore: it's much faster on a lag-site DC since I don't first need to recover the DC's system state from the backup (which depending on the size of the AD database can take a long time - and this time has even increased quite a bit in Win2008 due to the changes of the underlying backup mechanisms).

    As such the risk you are stating to scare your readers "If something goes wrong during recovery from a lag site, a forest recovery might be required in order to rollback the changes" should fully apply the same way when restoring objects on a DC that was first recovered from backup - clearly in both cases admins can do stupid things, but this risk is not higher for lag-sites.

    3 - Lag sites pose security threats to the corporate environment

    As you write "If the lag site domain controllers allow logons...", this would indeed be a risk and goes back to my initial statement that admins leveraging lag-sites need to know what they're doing. If they do _and_ they monitor the lag-site DCs to ensure they stay configured appropriately, then I would say this risk is mitigated.

    4 - Careful consideration must be put in configuring and deploying lag sites

    Fully agree - and this is not necessarily a downside for lag-sites either. Careful consideration is required to plan any stable backup and recovery method for AD. Especially in multi-domain AD forests administrators that don't sufficiently understand the AD replication and object linking model and it's implications for object recovery will find themselves in a situation of potentially not being able to fully recover an object to it's previous state when only relying on domain controller backups. Lag-Sites can certainly help to ease this pain.

    Clearly, for many admins that don't want to or don't care to dig down into the AD internals, Lag-Sites should not be recommended and instead the use of third party tools should be considered for AD backup and recovery support.

    I think that this blog entry - especially since it references Win2008 AD backup/recovery links - should also highlight some important changes in Win2008 with respect to recoverability of objects in AD. And I don't mean the new VSS snapshot support for AD including the capability of mounting a "previous version" of the AD database, which can also be leveraged to support the recovery of objects.

    I actually mean the fact that the AD service on Win2008 DCs can be stopped and restarted.  While it's not supported to restore the AD database in "NTDS stopped" state, I have confirmation that it IS SUPPORTED to perform an auth-restore of objects in this state.

    The question that I don't have an answer to yet is: how long is it supported to have a DC run in "NTDS stopped" state?  

    Clearly, this feature hasn't been designed to be used for lag-sites, but is there anything that speaks AGAINST using it in this fashion? Why shouldn't I be able to leave the service stopped for three days, and only start it up periodically to replicate with it's partners - and if required, use it to first auth-restore objects in AD...  i.e. the new "Win2008 Lag-Site feature" ;-)

    Would be great to get your feedback on my comment - especially on the last question.  Cheers, Guido

  • Hi Joe,

    Great comments!

    To add a few thoughts here (as Gary is out for a few days; I'll let him reply in depth when he returns).

    The lag site is *not* a fully supported scenario. That is the point of this post. If you call me and my team here and ask for advice on how to best configure a lag site, we will tell you the same. 'Supported' has a very specific meaning when you talk to our product group and us - it means we exhaustively test the scenario: this is not done for lag sites. It's also why if you read our technet documentation you will not find a guide to creating lag sites.

    The other main point that Gary was trying to reach is that we have found in Support that many thousands of customers have been using Lag Sites *exclusively*. They don't use, maintain, or test their systemstate backup systems - then we work tons of cases each year where they thought that their lag sites would save them, and they did not. So this wasn't directly pulled from Gary's behind - we have 10 years of 3rd tier support cases evidence to back it up.

    And your main point is well taken - you probably will not have good backups or a good disaster recovery strategy if you're not doing your job as an admin.

    (PS: love your webpage, tools, and general AD passion)

    - Ned

  • Hi Guido,

    I'm noodling on your points, but for your direct question:

    "The question that I don't have an answer to yet is: how long is it supported to have a DC run in "NTDS stopped" state?  "

    Tombstone Lifetime. Exactly the same as how long it is supported to have a DC turned off, basically.

  • Microsoft Support Documents 2008 "Lag site" or "hot site" (aka delayed replication) for Active Directory Disaster Recovery support

  • Hey Ned, glad you enjoy the utilities/site/etc. :)

    So which part of the lag site concept isn't supported?

    My understanding from speaking to various folks around MS within PPS and the PG is that what isn't supported is that a lag site be used as the sole DR recovery mechanism. Again, I fully agree with that. That is an insane position to put yourself into.

    Anyway, lets break it down to some of the various components that may or may not be used in any given lag site configuration...

    * Delayed replication sites are supported.

    * Auth restoring objects on any arbitrary DC in a domain is supported.

    * Disabling registration of domain SRV record specific DNS entries pointing to a given site is supported

    * Disabling replication entirely (or shutting DCs down) for periods not exceeding the forest TSL on a given DC or every DC in a site is supported

    I have been involved in various situations where PSS has indicated one or more of each of those be done for a given situation. Heck anyone who has been on a call with a customer and PSS in a major accidental deletion incident has likely heard "has the deletion replicated to all DCs in the domain?" and if not that is followed by "stop replication to that DC immediately and let's restore the objects from there". I have heard a multitude of stories from the PG that started that way. Every time that is done it is acknowledgement of the concept of the lag site.

    Will PSS help someone set up a lag site if someone asks for that specific thing. Sounds like no and I can understand the reticence to do so unless you have a thorough understanding of the overall DR plan/process for a given customer. Will PSS help a customer set up a site to replicate on a schedule that is measured in days instead of hours or minutes... Absolutely, I have talked to customers who have been walked through the process by PSS. Will PSS help a customer auth restore an object from any arbitrary DC? Absolutely, have seen it with my own eyes. Ditto for the other items.

    What seems to be the issue PSS has is the intent behind the uses of these features in the technology, not the use of the features themselves.

    The comment that "many thousands of customers" have been using lag sites exclusively scares me. That would seem to me that someone at MSFT isn't getting the concepts of how backup/restore works in AD out there very well. I am also just surprised to hear that number. I work in a very large services org for my full time job and have dealt with many large customers over the years and have seen very few instances of lag sites that I wasn't involved in some way in setting up. Smaller companies never seem that interested due to the hardware and OS licensing investment.

    Not to bust your chops but I think the 10 years of cases is a bit of an exageration Ned. We are on the 7th year of truly popular use of AD (though some of us had it in large scale Fortune x if not Fortune xx production as early as 99 or 2000) and lag sites didn't really start catching mainstream attention until several years into AD being in production. Some of us picked up on the idea that a latent (non-converged) site (which is what those of us who were publicly discussing it called it initially) could be used for this type of recovery but the people talking about it were people who could work it out on their own and also understood the repercussions. I recall the first time I heard the "lag site" moniker was at one of the DEC conferences four or five years ago at which point the concept started to explode.

    Anyway, people do a lot of stupid things in their production ADs. Lag sites are a relatively painless and innocuous item. I am far more worried and have seen far more issues with DC virtualization than lag sites though I do recommend lag sites be running on virtual machines when I recommend lag sites. ;)  And yes, I do officially recommend them to companies. I also give them the caveats of when it is and isn't good to use and make sure they fully realize it is a mitigator, not a total DR solution.

    Let's face it, setting up a lag site isn't rocket science. If someone can't work it out themselves, they likely shouldn't be doing it for a variety of reason. Being who I am I would also go as far as to say they probably shouldn't be running AD at all but that's just me. No one who has to call PSS to ask how it should be set up, should be doing it.

      joe

  • "My understanding from speaking to various folks around MS within PPS and the PG is that what isn't supported is that a lag site be used as the sole DR recovery mechanism. Again, I fully agree with that. That is an insane position to put yourself into."

    That's the key unsupported config - while a multi-master replication system naturally has plenty of unavoidable lag sites, we don't want you to use that as your disaster mechanism exclusively (either intentionally or not). You;re right, we even reference in a KB article how to use a latent site for an authrestore operation - the key point being that we didn't assume it was intnetional, just that it was a naturally latent site.

    "Will PSS help someone set up a lag site if someone asks for that specific thing. Sounds like no and I can understand the reticence to do so unless you have a thorough understanding of the overall DR plan/process for a given customer. Will PSS help a customer set up a site to replicate on a schedule that is measured in days instead of hours or minutes... Absolutely, I have talked to customers who have been walked through the process by PSS. Will PSS help a customer auth restore an object from any arbitrary DC? Absolutely, have seen it with my own eyes. Ditto for the other items. "

    100% correct - there's 'Officially Supported' and there's 'PSS will bend over backwards to help any customer because we want to do what's right to fix your problem Supported'. :) That's the mantra I preach to my engineers, at least.

    "The comment that "many thousands of customers" have been using lag sites exclusively scares me. That would seem to me that someone at MSFT isn't getting the concepts of how backup/restore works in AD out there very well. I am also just surprised to hear that number. I work in a very large services org for my full time job and have dealt with many large customers over the years and have seen very few instances of lag sites that I wasn't involved in some way in setting up. Smaller companies never seem that interested due to the hardware and OS licensing investment. "

    We actually see it just as often in smaller businesses as larger ones. They have a remote site (with a set-aside DC in its own logical site) that they use in lieu of backups. It often becomes apparent that they did this because they previously had AD data lost and had *no backups* - this was their solution. Crummy.

    "Not to bust your chops but I think the 10 years of cases is a bit of an exageration Ned"

    I probably rounded up. :) It's closer to 7-8 years where it's been common to see, but limited usage was around even in the 2000 beta.

    - Ned

  • Thanks Ned for the details on support for AD stopped state ("Tombstone Lifetime. Exactly the same as how long it is supported to have a DC turned off, basically") - which makes sense.

    And clearly my question didn't really expect a DC to be in this mode for this long - however, using AD stopped state would allow me to "protect" this DC from forced replication much better than I could do so today with Win2000/2003 - and it would also be rather secure as a DC running AD in stopped state can't be used directly to authenticate users.

    As using "Lag Site" - or should it now be called a "Stopped Site" :) - is certainly worth considering for a Win2008 deployment.

    Again, I'm not saying that this would be the "only" AD DR solution, but one of the pieces in the puzzle for  a complete solution.

    /Guido

  • I'd agree - definitely part of the puzzle. :)

    Does everyone here now know that we have publically started talking about AD Recycle Bin? It will make this whole discussion elementary.

    - Ned

  • We’ve been at this for over a year (since August 2007), with more than 100 posts (127 to be exact), so

  • yep - very much aware of the upcoming AD Recycle Bin feature and glad we can talk about it now publically. But it'll still be a few years out until companies will have reached the Windows 2008 R2 Forest Functional level, which I understand is a requirement to leverage this very cool new feature.

    For large companies this is easily 1-2+ years post release of R2. So 2-3+ years to go.  And in this time they still need a recovery solution - potentially using Lag Sites ;-)

    /Guido

  • The recycle bin will be a great tool to recover from delete/mass delete scenarios with all objects and their attributes intact. On the other hand it won't help you in any way if someone mistakenly modifies attributes while still retaining the objects themselves.

    Leveraging AD snapshots to address the second scenario is one way to go. One example of this approach can be found here: http://lindstrom.nullsession.com/?page_id=11