Troubleshooting replication with ReplDiag.exe [part 1 of 4]

Article
10/13/2010

Hi, Rob here!

So you think there’s lingering objects in your Active Directory forest and you need to quickly get rid of them!? If so, read on …

This blog post is a 4 part series, so stay tuned in the next few months for remaining 3 parts of the series, including forest wide cleanup of lingering objects, cleaning one NC and a few tips and tricks.

A brief bit of history and background…

For about a decade now, since the introduction of Windows 2000 and Active Directory both CSS and PFE have been chasing and cleaning lingering objects. A lot of fiddling with ways to make the process easier has come to an end for now …Fear no more, ReplDiag is here!

ReplDiag, developed by Ken Brumfield, with testing and tweaking done by numerous people in the DS space is a result of both internal and customer involvement and work. To date, it’s a well-known tool in the community to help address the problem of lingering objects, quickly and efficiently. The tool, as great as it is, has logic built-in for troubleshooting replication issues as well.

Although there are great write ups of the phenomenon, official supported methods of how to attack the problem is very complex and prone to errors due to the complexity inherent in AD replication topologies. For these reasons, hence the reason ReplDiag was born. It’s here to help you! At the bottom of this write up, you’ll find information on the problem of lingering objects and ways to tackle the issue, using other methods. Despite the latter, a few of us remain, who are all about getting the job done quickly, efficiently and with less room for error. Brave, no… Smart, yes.

Currently, ReplDiag is available only via CodePlex. For those of you who don’t know what CodePlex is; it is Microsoft’s open source software depot maintained by the community for the community! The disclaimer is here and the philosophy remains just what it is, open source. This means, a community of internal, external developers, and testers help maintain the repository. At Codeplex, you will find ReplDiag, as well as an array of other useful tools, for everyday tweaking and fixing, free of charge.

Let’s dig in …

So now, before jumping in blindly, you’re going to want to troubleshoot first, and below are some approaches on how you can begin to utilize (and understand) the tool’s output.

ReplDiag is all about helping to diagnose replication topology stability issues, preventing where possible, and remediating the consequences resulting from those issues. ReplDiag by design ONLY looks at the forest as a whole as to ensure the integrity of the directory as a whole replication must be working across the entire forest. While it has often been brought up that this is not necessarily how many organizations are structured for administering AD, the administrative boundaries organizations impose DO NOT have anything to do with how the technology works. Thus to keep the technology working, organizations need to figure out some way of getting along, at least well enough just to keep replication working. This may also includes punching through some firewalls for administrative purposes so that a holistic view of AD topology stability can be gained.

There are a number of different error states that are analyzed for by ReplDiag and classified regarding whether or not they affect the stability of the AD infrastructure as a whole. Below is the list of potential errors identified, what they mean, and why they are important.

SRVR_ACCESS (Stability impacting) – If a server is down or offline, replication into that box is failing. While in the short term this does not affect replication consistency, when using the features of ReplDiag to clean lingering objects the ability to build the complete topology and clean is affected because the box is unavailable to collect data from.
LINK_FAILURE (Not Stability Impacting) – This means the link is failing for some reason. This is really no different than the data “repadmin /showrepl” and thus doesn’t deserve additional illumination.
LINK_NEVER_SUCCEEDED_IN (Not Stability Impacting) – This indicates that one of this link has never completed its initial inbound replication cycle. One or multiple of these does not indicate an stability impacting issue unless ALL links for the naming context have not completed inbound replication as identified in the next error described. It is important to note that inbound replication is single threaded, as such use “repadmin /queue” to diagnose whether or not there is a backlog of inbound connections. Another common scenario here is firewall rules preventing connectivity to the desired servers.
NOTE: When querying a link in this state the error code returned is “0” (ERROR_SUCCESS) and can often times be missed until someone notices that the “Last Successful Sync” timestamp is “(null)” in repadmin.
NC_NEVER_COMPLETED_INBOUND (Stability Impacting) – This instance of the Naming Context has no inbound links that have succeeded. This means that said NC on said server is getting NO updates even though no errors are being reported (see previous bullet to understand why no errors are being reported). The consequences of this are that the data in the naming context is drifting from the rest of the environment. Since, as outlined previously, there is no data about the point in time when the NC was last updated it is difficult to tell when the last time the NC received any updates and particularly whether or not that is within TSL. On Windows 2003 or later last update times can be further investigated via “repadmin /showutdvec” command.
MISSING_INBOUND_REPL (Stability Impacting) – This NC has NO inbound replication connections. The consequences and challenges of this are identical to NC_NEVER_COMPLETED_INBOUND.
MISSING_OUTBOUND_REPL (Stability Impacting) – This is the same as MISSING_INBOUND_REPL except it means that no one is pulling from said instance of the naming context.
NC_NEVER_COMPLETED_OUTBOUND (Stability Impacting) – This is the same as NC_NEVER_COMPLETED_INBOUND except it means that no one is pulling from said instance of the naming context.
SITE_MISSING_INBOUND (Stability Impacting) – While replication may be occurring between all DCs in the site, there is no connection to get data into the specified site from another site. Essentially the site has become an island.
SITE_MISSING_OUTBOUND (Stability Impacting) – Same as SITE_MISSING_INBOUND except that data isn’t getting out of the site.
SINGLE_WRITABLE_INSTANCE_OF_NC (Not Stability Impacting) – Sometimes organizations end a single DC for a domain or in the case of DNS and other Application partitions, one copy of the writable partition. While this is not a problem at the point in time, if that writable instance is lost, the only way to make changes to the data in that partition is lost. Fault Tolerance/DR strategies for this are important. This is really just to raise awareness as sometimes it is difficult in large and complex multi-domain forests to keep track of stuff like this.
NO_WRITABLE_INSTANCES_OF_THIS_PARTITION (Stability Impacting) – Either the consequence of previous error or replication within the configuration NC is not converging. Mostly this has been observed because a domain was removed from the environment for one. reason or another and deletion of said domain has not propagated throughout the forest. This means that Global Catalogs have data from a domain that no longer exists.
NC_EXISTS_IN_ONE_SITE_ONLY (Not Stability Impacting) – This can be ignored for any environment where multi-site fault-tolerance is not required. For the rest, this just raises awareness as sometimes this is difficult to keep track of for larger and more complex AD deployments.

As can be seen from the above, there are several error states that can be encountered that prevent data from getting through the whole organization, none of which show up just by looking for error codes in “repadmin /showrepl”. Specifically, anything where the absence of the existence of a link is what is preventing replication and cannot be detected just by looking for failures on existing links.

All of the above stability impacting issues need to be fixed before proceeding with any sort of remediation of replication divergence related issues otherwise the issues will most likely recur.

This concludes the 1st in the series. See you next month for part 2 of the series on ReplDiag.

Troubleshooting replication with ReplDiag.exe [part 1 of 4]

Additional resources