• Cleaning lingering objects across the forest with ReplDiag.exe [Part 2 of 4]

     

    Hi, Rob here, with part 2 of 4 in the series on ReplDiag from CodePlex. Last month, we looked at replication troubleshooting, now we can commence actual forest wide cleanup of lingering objects. Hence, the word “forest wide”. We will look at cleaning one partition (or NC “naming context”) in part 4 of the series.

    Before we get started, please note that Enterprise Administrator privileges are required to perform cleanup. Also, the forest must be running Windows 2003 or greater on all DCs, and they must be online. The .Net Framework v2.0 SP1 must also be installed on the machine the cleanup will take place from, not necessarily being a domain controller. The latter is always good, but not required.

    Let’s dig in… a simple ReplDiag /? To display syntax will be good:

    ------------------------------------------------------------------------Version:  2.0.3397.24022 --------------------------------------------------------------

    Command Line Options:  ReplDiag [/Save] [/CheckForStableReplTopology] [/RemoveLingeringObjects] [/ImportData:<FileName.XML>] [/ShowTestCases] [/OverrideDefaultReferenceDC:"dc=namingcontext,dc=com":domainController.namingcontext.com]

    /UseRobustDCLocation -Query each and every DC for a list of DCs in forest.  Ensures replication instability does not cause any to be missed.
    /Save -Save out the data from the current environment to XML.  File is named "ReplicationData.xml" and is located in the current directory.
    /ImportData -Import the XML that was saved during a prior execution of this utility.  Run one of the other options to do something with the data.
    /ShowTestCases -Show detail about test cases.

    Lingering Object Cleanup:
    /RemoveLingeringObjects -Use the current forest topology to clean all the NCs in the forest. WILL NOT CLEAN WINDOWS 2000 SYSTEMS!!!
    /AdvisoryMode -Check for lingering objects only, do not clean. Must be used with /RemoveLingeringObjects.
    /OverrideDefaultReferenceDC -Specify reference DC for a naming context when when removing lingering objects, can be used multiple times for different NCs. Only functional if using /RemoveLingeringObjects.
    /OutputRepadminCommandLineSyntax -Output the command line syntax for repadmin. Only active in conjunction with /RemoveLingeringObjects.

    Example syntax:
    ReplDiag /Save
    - Collect the AD replication topology from the environment and save it.
    ReplDiag /ImportData:"ReplicationData.xml"
    - Load in previously collected data and check replication status.
    ReplDiag /RemoveLingeringObjects /OverrideDefaultReferenceDC: <continued below> "cn=Configuration,dc=contoso,dc=com":dc1.contoso.com /OverrideDefaultReferenceDC:"dc=contoso,dc=com":dc2.contoso.com

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------

    You will notice that currently, ReplDiag has several switches. The most important for now are /RemoveLingeringObjects and /Advisory mode. The remaining switches will allow advanced functionality, to be discussed in series 3 and 4 of this multi-part blog.

    The output will look similar to this:

    C:\ReplDiag>ReplDiag.exe /removelingeringobjects /advisorymode
    Replication topology analyzer.  Written by kenbrumf@microsoft.com
    Version:  2.0.3397.24022
    Command Line Switch:  /removelingeringobjects
    Command Line Switch:  /advisorymode

    Enumerating Forest:  contoso.com
            Forest Functional Level:  Windows 2000
    Enumerating Domain:  child.contoso.com  - Found 1 DCs.
            Domain Functional Level:  Windows 2003
    Enumerating Domain:  contoso.com        - Found 1 DCs.
            Domain Functional Level:  Windows 2003
    Enumerating Domain:  fabrikam.net
    Data collection duration:  0 seconds

    Number Complete,Status,Server Name,Naming Context,Reference DC,Duration,Error Code,Error Message
    1,Success,phl-dc-01.child.contoso.com,"cn=configuration,dc=contoso,dc=com",{14f1ed08-a30a-4400-9851-8745da037289},0h:0m:0s,0,
    Reference NCs cleaned in 0h:0m:0s.  Cleaning everything else against reference NCs.
    2,Success,ny-dc-01.contoso.com,"dc=child,dc=contoso,dc=com",{2f075a9d-c8ae-4ab0-8fab-38c53d542414},0h:0m:0s,0,
    3,Success,ny-dc-01.contoso.com,"cn=configuration,dc=contoso,dc=com",{2f075a9d-c8ae-4ab0-8fab-38c53d542414},0h:0m:0s,0,
    4,Success,phl-dc-01.child.contoso.com,"dc=contoso,dc=com",{14f1ed08-a30a-4400-9851-8745da037289},0h:0m:0s,0,

    The tool behaves in exactly the same way as using repadmin /removelingeringobjects, which calls DsReplicaVerifyObjects. The logic used to build the topology to clean the environment is exactly the same as the steps outlined in Glenn LeCheminant’s blog here.

    Often times it is asked how long this will take to run. Unfortunately this is predicated on the number of instances of writable partitions, the size of the partitions, and the speed of the links connecting the DCs. This is optimized as much as possible by multi-threading the actions to clean the partitions. This is an additional advantage of using ReplDiag over doing the work manually as it will clean multiple partitions at one time. Due to needing to scrub the reference DC, it will run one thread per partition initially cleaning the reference DC. Once that is complete for ALL partitions, it will run multiple threads concurrently for all partitions against all DCs until all are cleaned.

    Each DC will log a series of NTDS Replication Event IDs to begin telling you what the Lingering Object cleanup API (DsReplicaVerifyObjects function) is doing, hence the reason, the tool will not support Windows 2000 – because the API was not yet available in this now unsupported operating system. We recommend upgrading, even if you do have lingering objects, prior to enabling strict replication, to help clean up the entire forest.

    In this particular run, Event ID 1938 and 1942 were logged, reporting what took place. You will have to scour the event logs using your favorite tool to aggregate the data and determine where objects were alongside the second section of the output, which describes various aspects, such as status, server name, NC, the reference used, duration of the check and any error codes or messages associated with the particular NC on the DC it scrubbed. In a number of large environments the number of lingering objects cleaned may be significant and will over run the Directory Service logs and it may be a good idea to increase the size of the logs beforehand if collecting this information is important.

    Now that we kicked off a “safe pass”, known as an advisory mode run, we can begin cleanup by simply removing the /advisorymode switch. The process is similar, but much more automated and simpler than using the repadmin command to perform the work. However, there may be cases where a bit of further interaction by the administrator is necessary, hence the reason for the additional switches and perhaps even a few hidden ones too. We’ll discuss more advanced use of the tool in part #3 of 4 in my series on the tool.

    So at this point, you may now proceed to test replication again. This can be done using ReplDiag as described in part 1 of this series.

    Next month, we’ll take a look at part 3 in the series: “Why does ReplDiag error out with the message that the topology isn’t stable?”

    See you soon!

  • Troubleshooting replication with ReplDiag.exe [part 1 of 4]

     

    Hi, Rob here!

    So you think there’s lingering objects in your Active Directory forest and you need to quickly get rid of them!? If so, read on …

    This blog post is a 4 part series, so stay tuned in the next few months for remaining 3 parts of the series, including forest wide cleanup of lingering objects, cleaning one NC and a few tips and tricks.

    A brief bit of history and background…

    For about a decade now, since the introduction of Windows 2000 and Active Directory both CSS and PFE have been chasing and cleaning lingering objects. A lot of fiddling with ways to make the process easier has come to an end for now …Fear no more, ReplDiag is here!

    ReplDiag, developed by Ken Brumfield, with testing and tweaking done by numerous people in the DS space is a result of both internal and customer involvement and work. To date, it’s a well-known tool in the community to help address the problem of lingering objects, quickly and efficiently. The tool, as great as it is, has logic built-in for troubleshooting replication issues as well.

    Although there are great write ups of the phenomenon, official supported methods of how to attack the problem is very complex and prone to errors due to the complexity inherent in AD replication topologies. For these reasons, hence the reason ReplDiag was born. It’s here to help you! At the bottom of this write up, you’ll find information on the problem of lingering objects and ways to tackle the issue, using other methods. Despite the latter, a few of us remain, who are all about getting the job done quickly, efficiently and with less room for error. Brave, no… Smart, yes.

    Currently, ReplDiag is available only via CodePlex. For those of you who don’t know what CodePlex is; it is Microsoft’s open source software depot maintained by the community for the community! The disclaimer is here and the philosophy remains just what it is, open source. This means, a community of internal, external developers, and testers help maintain the repository. At Codeplex, you will find ReplDiag, as well as an array of other useful tools, for everyday tweaking and fixing, free of charge.

    Let’s dig in …

    So now, before jumping in blindly, you’re going to want to troubleshoot first, and below are some approaches on how you can begin to utilize (and understand) the tool’s output.

    ReplDiag is all about helping to diagnose replication topology stability issues, preventing where possible, and remediating the consequences resulting from those issues. ReplDiag by design ONLY looks at the forest as a whole as to ensure the integrity of the directory as a whole replication must be working across the entire forest. While it has often been brought up that this is not necessarily how many organizations are structured for administering AD, the administrative boundaries organizations impose DO NOT have anything to do with how the technology works. Thus to keep the technology working, organizations need to figure out some way of getting along, at least well enough just to keep replication working. This may also includes punching through some firewalls for administrative purposes so that a holistic view of AD topology stability can be gained.

    There are a number of different error states that are analyzed for by ReplDiag and classified regarding whether or not they affect the stability of the AD infrastructure as a whole. Below is the list of potential errors identified, what they mean, and why they are important.

     

    • SRVR_ACCESS (Stability impacting) – If a server is down or offline, replication into that box is failing. While in the short term this does not affect replication consistency, when using the features of ReplDiag to clean lingering objects the ability to build the complete topology and clean is affected because the box is unavailable to collect data from.
    • LINK_FAILURE (Not Stability Impacting) – This means the link is failing for some reason. This is really no different than the data “repadmin /showrepl” and thus doesn’t deserve additional illumination.
    • LINK_NEVER_SUCCEEDED_IN (Not Stability Impacting) – This indicates that one of this link has never completed its initial inbound replication cycle. One or multiple of these does not indicate an stability impacting issue unless ALL links for the naming context have not completed inbound replication as identified in the next error described. It is important to note that inbound replication is single threaded, as such use “repadmin /queue” to diagnose whether or not there is a backlog of inbound connections. Another common scenario here is firewall rules preventing connectivity to the desired servers.
      NOTE: When querying a link in this state the error code returned is “0” (ERROR_SUCCESS) and can often times be missed until someone notices that the “Last Successful Sync” timestamp is “(null)” in repadmin.
    • NC_NEVER_COMPLETED_INBOUND (Stability Impacting) – This instance of the Naming Context has no inbound links that have succeeded. This means that said NC on said server is getting NO updates even though no errors are being reported (see previous bullet to understand why no errors are being reported). The consequences of this are that the data in the naming context is drifting from the rest of the environment. Since, as outlined previously, there is no data about the point in time when the NC was last updated it is difficult to tell when the last time the NC received any updates and particularly whether or not that is within TSL. On Windows 2003 or later last update times can be further investigated via “repadmin /showutdvec” command.
    • MISSING_INBOUND_REPL (Stability Impacting) – This NC has NO inbound replication connections. The consequences and challenges of this are identical to NC_NEVER_COMPLETED_INBOUND.
    • MISSING_OUTBOUND_REPL (Stability Impacting) – This is the same as MISSING_INBOUND_REPL except it means that no one is pulling from said instance of the naming context.
    • NC_NEVER_COMPLETED_OUTBOUND (Stability Impacting) – This is the same as NC_NEVER_COMPLETED_INBOUND except it means that no one is pulling from said instance of the naming context.
    • SITE_MISSING_INBOUND (Stability Impacting) – While replication may be occurring between all DCs in the site, there is no connection to get data into the specified site from another site. Essentially the site has become an island.
    • SITE_MISSING_OUTBOUND (Stability Impacting) – Same as SITE_MISSING_INBOUND except that data isn’t getting out of the site.
    • SINGLE_WRITABLE_INSTANCE_OF_NC (Not Stability Impacting) – Sometimes organizations end a single DC for a domain or in the case of DNS and other Application partitions, one copy of the writable partition. While this is not a problem at the point in time, if that writable instance is lost, the only way to make changes to the data in that partition is lost. Fault Tolerance/DR strategies for this are important. This is really just to raise awareness as sometimes it is difficult in large and complex multi-domain forests to keep track of stuff like this.
    • NO_WRITABLE_INSTANCES_OF_THIS_PARTITION (Stability Impacting) – Either the consequence of previous error or replication within the configuration NC is not converging. Mostly this has been observed because a domain was removed from the environment for one. reason or another and deletion of said domain has not propagated throughout the forest. This means that Global Catalogs have data from a domain that no longer exists.
    • NC_EXISTS_IN_ONE_SITE_ONLY (Not Stability Impacting) – This can be ignored for any environment where multi-site fault-tolerance is not required. For the rest, this just raises awareness as sometimes this is difficult to keep track of for larger and more complex AD deployments.

    As can be seen from the above, there are several error states that can be encountered that prevent data from getting through the whole organization, none of which show up just by looking for error codes in “repadmin /showrepl”. Specifically, anything where the absence of the existence of a link is what is preventing replication and cannot be detected just by looking for failures on existing links.

    All of the above stability impacting issues need to be fixed before proceeding with any sort of remediation of replication divergence related issues otherwise the issues will most likely recur.

    This concludes the 1st in the series. See you next month for part 2 of the series on ReplDiag.

  • Why does ReplDiag.exe error out with the message that the topology isn’t stable? [Part 3 of 4]

    Hi, Rob here, fresh for 2011. Apologies for the late post, it’s a new year, coming off vacation and getting in the the swing of things. Hey, it’s CES this week too, so things are busy and we’re all dreaming about the new gadgets! Be sure to check out some of our announcements here. Alright, lets start off by posting our 3rd part in the 4 part series. Last month we looked at cleaning lingering objects across the entire forest. But wait! What if you didn’t get that far? What if the topology was reported to be unstable? What now? Despite the contrary, not all topologies are unstable, even with lingering objects in them. There may be the case of the unstable topology, so let’s take a look at what the definition is first.

    Un-stable vs. stable

    Dictionary definition: sta·ble (there are many means, some of which I’ve excluded purely for comedy reasons, as in “stable administrator” Smile)

    –adjective, -bler, -blest.

    1. not likely to fall or give way, as a structure, support, foundation, etc.; firm; steady.

    2. able or likely to continue or last; firmly established; enduring or permanent: a stable government.

    3. resistant to sudden change or deterioration: A stable economy is the aim of every government.

    4. steadfast; not wavering or changeable, as in character or purpose; dependable.

    If we were to put a percentage on the number of stable environments out there, I’d say 90% is stable in my experience, but what does ReplDiag actually look for?

    When we talk about an environment that is stable what we are looking for is one where replication of an object from any DC to any other DC that may host the object (this includes if it is resident in a Read-Only Global Catalog partition) can occur within TSL. There are several broken replication scenarios which may cause this degraded state to occur. As we move into the discussion, keep in mind that all replication is pull based and the topology is built on a per Naming Context basis. So when we talk about stability here, one NC could be stable (i.e. the Europe domain) and one NC could be unstable (i.e. the North America domain).

    As a result of these concerns, the topology has to be stabilized and time given to allow replication to converge. If this doesn’t happen, there are 3 consequences: current replication issues will continue to cause inconsistent views of the directory, new lingering objects will continue to be generated, and we can’t validate good versus bad data. Thus, cleaning existing lingering objects is an effort in futility until the replication is fixed so that notifications of deletions can converge across the DCs going forward.

    · Scenario 1 – The DC has no inbound replication connections for a given NC. What this means is that the DC has no peers to pull it’s updates from. This means it will get neither new objects, updates to existing objects, nor deletions of objects. This has to be fixed by homing the administration tools (i.e. Sites and Services MSC) to the DC and adding an connection to another DC.
    Note: This is probably the second most common problem scenario behind no replication across site boundaries. Though this usually happens when there is one instance of a partition in a site and site connectivity isn’t set up properly.

    · Scenario 2 – There are no outbound connections for a given NC. This means that no other DC in the environment is pulling changes from said DC. This has to be fixed by homing the administration tools to ANY other DC, which is replicating properly, in the environment and setting up a connection to said DC. As before, this usually happens in the scenario where there is one instance of the partition in a site and site connectivity isn’t setup properly.

    · Scenario 3 – There is no inbound replication in to a site for a given NC. This is very similar to Scenario 1, with the exception that if there are multiple DCs in a site they may be replicating with each other for a specified partition, but are not sharing that data with any DCs outside of the site. The fix is very similar to the fix for Scenario 1, but any DC in said site can have a connection to a DC outside the site.
    Note: This is probably the most common trouble scenario, in part because it includes all of Scenario 1 where there is one DC in the site. This is entirely due to site connectivity configuration issues.

    · Scenario 4 – There are no outbound connections from a site for a given NC. Just like Scenario 3, all DCs in a site may be replicating with each other, but none of that data is being shared with DCs outside of the site. The fix is the same as for Scenario 2.

    · Scenario 5 – No writeable instances of a partition exist. This can happen in scenarios where a domain or application partition is deleted and the changes to the replication topology never converge. Thus one or more Global Catalogs are advertising the partition and the data within. This implicitly means that the configuration partition does not have a stable replication topology and is a victim of that instability. To fix, investigate Scenarios 1 through 4 for the configuration partition only and allow replication to converge.

    · Scenario 6 – While a certain DC may have connections to peer DCs, if none of these connections have ever completed successfully, the partition may be in an inconsistent state. In this scenario, we don’t have data readily available (without starting to look deep into the metadata) to determine when this partition was last synchronized. Regardless of whether or not this may be a new problem, this is flagged as stability impacting as it is currently in a degraded state and needs to be reviewed.
    Fixes: This is a little more complex to fix, as the reason all the connections are reporting as failed needs to be investigated. Often times this is related to firewall rules not being configured properly but it could also simply be due to the fact that this is a newly introduced DC that has not fully replicated due to either database size and/or network bandwidth.

    · Scenario 7 – NC never completed outbound. This is very similar to Scenario 6, except that no other DC has been able to pull data from said source DC. This is usually related to firewalls.

    · Scenario 8 – Server inaccessible. If a box is down, it isn’t replicating. While this is not generally a problem, when it comes to cleaning lingering objects, this has a major impact. Reviewing the strategy on Glenn’s blog , a comparison of all systems in the forest is necessary to ensure the infrastructure is as clean as possible. If a box is offline, it cannot be compared to its peers and thus lingering objects may be left in the forest. For this reason, cleaning of lingering objects is blocked until all boxes can be contacted.

    So that about wraps things up… Looking forward to the final post (later this month) in the series, Part 4 of the ReplDiag breakdown “Can I clean one partition at a time with Repldiag, and other tips…”. Also, if you felt I have missed anything, please let me know.

    One big ask. I’d also like to know how you’ve used the tool, your success with it, and other experiences or feedback on Repldiag. Suggestions are always welcome, but those should be sent to myself or directly to the Ken.  

  • Can I clean one partition at a time with ReplDiag, and other tips [Part 4 of 4]

    Hi, Rob here again. As we conclude our 4 part series on the ReplDiag tool, there’s always one more trick up the author’s sleeve and that is, to clean one NC at a time! Perhaps we’ll explore hidden switches and naturally, give you the latest update where the tool is going. First, an “NC”, or naming context, for those of you who have not figure it out yet, is a partition, in the NTDS.DIT database of Active Directory. The basic breakdown is here. Think of it logically, segmented sections of the database where different data is stored. Where schema, configuration, domain and application (DNS usually) partitions are segmented into logical storage chunks in the database. When lingering objects exist, be they are logged in the deleted items container, a global catalog RO(read only) NC/partitions, it may be necessary to perform cleanup per that partition, so here’s how to do just that.

    ReplDiag has a command line switch that allows the user to output the equivalent repadmin.exe syntax and run lingering object cleanups using the officially supported Microsoft tool for those who are concerned about support and concerns around open source.

    As stated earlier in this thread, the author wrote this tool to address the forest as a whole, however there are scenarios where more granular, by naming context control is useful. Until the author adds that functionality into the tool here is how to do that:

    Command Line Syntax (using a combination of redirection and multiple commands, some basic command line tools):

    ReplDiag /removelingeringobjects /OutputRepadminCommandLineSyntax| find /I “cn=configuration,dc=contoso,dc=com”> cleanConfigNc.cmd & cleanConfigNc.cmd & del /q cleanConfigNc.cmd

    Note: just change the portion between the quotation marks with the name of the naming context required.

    Note: One of the advantages of ReplDiag is that it cleans everything in a multithreaded fashion where possible to improve performance. This is lost when repadmin.exe is used in the above batch file scenario.

    Now that you’re feeling in control, let’s dive into some hidden and other tips to get things wrapped up.

    /UseRobustDCLocation – In some environments replication is so broken that the existence of all DCs has not converged across the forest. To address replication stability in the environment this, at a minimum is something that needs to be addressed. In order to reduce the number of iterations of data collection and get a clear picture of the whole environment this will contact each DC and get a list of all DCs known by each DC. These lists are then aggregated and compared to produce a consolidated list of all known DCs. Then the data collection of current replication proceeds. Due to the nature of having to query each known DC, environmental analysis time increases based on the size of the environment.

    /OverrideDefaultReferenceDC – By default, the tool picks as the reference DC the DC with the most inbound connections under the assumption that this is a centrally located hub DC. If these criteria are incorrect for the environment, a reference DC per naming context can be designated.

    /Save – The current state of replication, and all associated data, can be saved out to an XML file for reference and to transfer the state elsewhere. This is very useful for sending the state into Microsoft support or to compare before/after states. It saves the data with the filename “ReplicationData.XML”.

    /ImportData – Loads the data from a previous “/Save” and performs analysis of the data for topology issues.

    Hidden workaround for cleaning environments that can’t be stable – There certain scenarios where an environment can’t be made stable. For example, large enterprises with lots of poor quality links to esoteric locations around the globe. There are consequences to not talking to all the DCs and without a full understanding of these the author didn’t want people to just bypass the stability validation without understanding the consequences of their actions.

    Finally …. So what’s next in the life of this great tool? What does the future look like? In Ken’s own words and things we are hearing down the pipeline, as follows:

    There is no timeline on many of these items given other demands on my time, but I’m looking to do some work with collecting some more detailed replication data including Up-To-Dateness and High-Watermark data. I haven’t done any work on testing and validating this with RODCs yet and as customers begin to adopt this, I really need to put some time into that. The irony is that my challenge isn’t writing the code, but the time and resources necessary to setup the lab to properly test this.

    There are some areas of the code I’m not entirely happy with that have some trouble in certain scenarios in collecting the data and I hope to make these a little more robust. There are some folks asking for the per NC lingering object clean up and that may eventually appear as well as initiating Garbage Collection prior to cleaning the NC to reduce log spam. The other items are the priority especially since there is a workaround that Rob tells me he is going to include in his blog.

    Also, check out my other tools on the same CodePlex project. I have to balance the support and feature requests for those tools with those of ReplDiag. Though, I’m happy to work with volunteers who are passionate about adding features or functionality.

    Woowhoo! We’re done! Thank you for reading out 4 month series on lingering object cleanup and ReplDiag. Keep in mind that the topic’s been addressed several times before, in some capacity - so I’d like to buy a drink to those who have been down this road before: you know who you are!

  • R2 BH Selection Process, everything you wanted to know …

    Rob here, spring is in the air and we have some yard cleaning to do. But before we get to the hard stuff, let’s check out something useful Windows 2008 R2 brings us: Bridgehead Load Balancing.

    For those of you who do not know; Windows 2008 R2 has a brand new logic to effectively load balance connection objects in the Hub. In larger deployments of Active Directory, this feature will help evenly distribute connections between the hub and branch office. From a performance perspective, this also allows extra processing cycles for whatever else the domain controller may be doing. This ultimately reduces the uneven or “everything goes to server X” phenomenon that can be observed prior to Windows 2008 R2.

    To start, let’s review the basic question: What is a Bridgehead server and what is its role in Inter-Site replication?

    A Bridgehead server is a Domain Controller designated to perform site-to-site replication. Basically a bridgehead is a point where an Active Directory replication connection leaves or enters a site. By default, the Inter-Site Topology Generator (ISTG) automatically designates which servers act as Bridgehead servers.

    Here’s a basic diagram that shows a single-domain replication topology – 1 Hub site and 5 branch sites.

    clip_image002

    Obviously it’s an extremely simple example, but it illustrates the basic role of the Bridgehead server – to perform replication between sites. With every DC performing site-to-site replication, the ISTG has an easy job. It designates that every server is a bridgehead.

    Take notice once again, that each DC builds an inbound connection from the bridgehead server in the connected site. The Hub DC builds 1 inbound connection from each of the 5 branch DCs, and each branch DC builds 1 inbound connection from the Hub DC for a total of 10 connections objects.

    For some environments, perhaps yours --- what happens when things get more complicated? How does the ISTG bridgehead selection work when it has multiple servers to choose from?

    Prior to Windows 2008 R2 – the typical results were:

    • Hub site inbound connections were NOT load balanced evenly

    • Branch sites inbound connections were load balanced evenly

    • In large branch office scenarios, 1 Hub DC carried >50% of all inbound connections (Potential bridgehead overload scenario)

    • After adding additional Hub DCs, only Branch RODCs rebalanced, Branch RWDCs ignore the new DC

    • ADLB utility (AD Load Balance) frequently used to rebalance large site designs

    Using the Windows 2008 or prior logic, the replication topology could look similar to the following:

    clip_image004

    As you can see, Branch inbound connections are evenly balanced but the Hub Inbound connections are not. If the left Hub DC fails, changes from 7 of the 10 branch offices will not replicate to the Hub site until the connections are rebuilt around the failed DC.

    So we can already see the minor cracks in the topology generation for this type of scenario. So what happens when we add an additional Hub DC to the mix?

    clip_image006

    Notice that the new Hub DC is completely ignored by the Branch RWDCs. To get the Branch DCs to recognize the new Hub DC, you have to delete all the inbound connection objects on the RWDCs and kick off the KCC to generate a new topology. If present, Branch RODCs would rebalance their inbound connections to use the new DC.

    Fast forward to today, we find ourselves with the improved logic:

    • Improved replication algorithm for branch office topologies

    • All Hub and Branch site inbound connections load balance evenly (Both RODCs and RWDCS)

    • Adding additional DCs to Hub site causes an automatic rebalance of connections across all Hub DCs (Both RODCs & RWDCS)

    • ADLB no longer needed for large environments – topology automatically recalculated on DC changes(adds, deletes, moves)

    Using the new 2008 R2 algorithm could generate a topology as seen below:

    clip_image008

    As you can see the connections are evenly distributed between the two Hub DCs. If the left Hub DC fails, only 5 branch DCs will be affected as opposed to 7 in the Pre-R2 scenario.

    So with that new logic, let’s see how adding a new Hub DC works:

    clip_image010

    Upon addition of the new Hub DC, all (both Hub Inbound and Branch Inbound) connections rebalance automatically to use the new DC.

    So how do you get this new functionality/replication logic?

    · Begin by installing at least two 2008 R2 DCs – starting with the Hub site DCs first.

    · Adding more 2008 R2 DCs improves the overall load-balancing, with the best results found in a pure 2008 R2 environment.

    · Windows 2008 R2 Forest or Domain functional mode is not required – just R2 DCs!

    More details on the new Bridgehead selection process can be found here:

    http://technet.microsoft.com/en-us/library/bridgehead_server_selection(WS.10).aspx

    Thanks to Brian Mulford (Boston Premier Field Engineering) for the fancy graphics and additional commentary to my blog post.