So you need to upgrade your AD domains and forests to W2K8 or W2K8R2 DCs to take advantage of new cool features. But your concerned about what's going to break once you begin introducing new DCs.
Fortunately, we have lots of documentation available regarding changes in the OS which should help customers determine up front the changes and therefore what may break. I previously blogged on this topic. http://blogs.technet.com/b/glennl/archive/2009/08/21/w2k3-to-w2k8-active-directory-upgrade-considerations.aspx. Do review these resources as they describe known issues that should be addressed for applicability in your environments.
First and foremost, we encourage you to test your applications and their coexistence with DCs running W2K8 or W2K8 R2 in a lab environment. You should also engage the vendors of your applications to ensure they have validated their applications running on servers in domains that run W2K8 or W2K8 R2 DCs. See our Upgrading AD domains Technet portal for our official guidance on planning and executing on domain upgrades. http://technet.microsoft.com/en-us/library/cc731188(WS.10).aspx
You may be saying to yourself, how can I test all my applications when I don't have a complete inventory or the staff or facilities (robust lab) with which to perform thorough adequate testing? This is a big problem for IT shops unfortunately. If you cannot or don't plan to thoroughly test all your applications in a lab environment prior to upgrading, then this blog is for you.
As you read this, keep in mind the intent is not to help your efforts to identify application compatibility, and infact may hinder them. The intent is to offer methodology for a controlled deployment of and consumption of up-level DC resources. This approach is not for everyone, and if your applications are communicating over standard protocols NETAPI, LDAP, and NSPI, there shouldn’t be any issues outside of new constraints and known compatibility issues already documented in my blog and elsewhere.
The idea behind the following approach is to allow you to promote a new W2K8/W2K8R2 DC into your W2K3 domain and keep it temporarily hidden/isolated from computers and applications. This will give you the ability to methodically and systematically perform application testing over time to identify and rectify application compatibility issues. Again, if you don't have a thorough inventory of the applications leveraging DC resources, then this approach arguably provides no value.
The focus of this strategy is targeted at the introduction of a new DC itself and not the needed preparatory steps required prior to deploying the W2K8 DC. We have guidance on testing Schema extensions here.
Hiding and isolating DCs involves a number of strategies, the more of which are performed, the better 'hidden' the DC from the general population of computers and applications.
1) Build a new logical AD site with site link to datacenter site. Configure the subnet for this site to the IP address of the yet to be promoted W2K8 DC. e.g., 192.168.14.231/32. Yes you can use a 32 bit mask to identify and associate a subnet to a site in AD.
2) Add DnsAvoidRegisterRecords to the registry of the yet to be promoted W2K8 server. The only records you absolutely must register are the host record and the cname record. All the rest of the records are there for clients to find DCs. So, if you don't register them, you effectively hide the DC from the most common method of discovery. You will want to register site specific SRV records so the yet to be promoted DC will register records for its temporary site. **important** Many customers have process issues resulting in subnets not being properly associated with sites in AD. The affected clients, then must use the fallback siteless SRV records to find DCs. Preventing their registration by this yet to be promoted W2K8 server will maintain the ability to control its 'advertising' its resources through DNS record registration.
3) Don't point the yet to be promoted W2K8 server to WINS. A less common way DCs are discovered is through a DCs registration of a WINS <1C> record. Don't point the DC to a WINS server and this record will not get registered by the server after promotion. 4) Consider putting the DC onto a separate VLAN. This may be overkill, but it will prevent the DC from being discovered through subnet broadcasts. 5) Prepare your forest and domain using ADPREP, then promote your first W2K8 DC into the temporary site. Since the DC will register its host record and cname record and site specific records only, it is discoverable by the other DCs for AD/SYSVOL replication purposes, but is otherwise hidden/isolated from clients/servers/applications on your network. Systematically test computers/application usage and coexistence with W2K8 DCs. 1) For applications running on Windows that find DCs through DCLocator, move the application servers into the temporary site....during a maintenance window of course. a) Add SiteName string value to netlogon\parameters registry key on the application servers and set it to the temporary site name. SiteName overrides DynamicSiteName written by the dclocator algorithm. Basically you are telling the computer what site it belongs to without having to change/create subnet configuration in AD. b) change the secure channel of the application server to the W2K8 DC using nltest /sc_reset:domain\dcname c) wait until Kerberos tickets expire, or reboot the application server, then have the application owner perform functionality testing. 1) now if the scenario is more complex...client connects to application, which impersonates client to access resources on backend servers, then you will want to do a,b,c on client and backend systems to make the testing as realistic as possible. 2) For LDAP applications running on Windows that use the domain A record to find a DC, add a host file entry on the application server pointing the domain A record to the W2K8 DC a) wait until kerberos tickets expire, or reboot the application server, then have the application owner perform functionality testing. 3) For LDAP applications not running on Windows, identify the mechanism they use to find a DC/LDAP server...probably configured in the application itself…then provide it with the DC A record or domain A record, or SRV record to be used to find the W2K8 DC. a) execute on a test matrix to ensure application functionality. 4) General authentication and ticket processing through the W2K8 DC. Work with business unit managers (aka..guinea pigs) to put their machines into the temporary site (SiteName reg value) and have them perform their normal business functions for a while....tests their machine and locally installed apps ability to use the new DC for auth and queries. Remove Isolation and 'introduce' the W2K8 DC to the masses Once you are satisfied that you have done adequate testing and have resolved app compat issues or created GPOs to relax DC settings to work with app compat issues, it is time to remove the isolation strategies. 1) remove SiteName from any systems it was added to. 2) remove the host file entry from any systems it was added to. 3) remove the DnsAvoidRegisterRecords value on the W2K8 DC. 4) point the DC to WINs if you have WINS deployed. 5) re-IP the DC if it was put on an isolated VLAN 6) move the DC into the production site and retire the temporary site/sitelink/subnet
3) Don't point the yet to be promoted W2K8 server to WINS. A less common way DCs are discovered is through a DCs registration of a WINS <1C> record. Don't point the DC to a WINS server and this record will not get registered by the server after promotion.
4) Consider putting the DC onto a separate VLAN. This may be overkill, but it will prevent the DC from being discovered through subnet broadcasts.
5) Prepare your forest and domain using ADPREP, then promote your first W2K8 DC into the temporary site. Since the DC will register its host record and cname record and site specific records only, it is discoverable by the other DCs for AD/SYSVOL replication purposes, but is otherwise hidden/isolated from clients/servers/applications on your network.
Systematically test computers/application usage and coexistence with W2K8 DCs.
1) For applications running on Windows that find DCs through DCLocator, move the application servers into the temporary site....during a maintenance window of course.
a) Add SiteName string value to netlogon\parameters registry key on the application servers and set it to the temporary site name. SiteName overrides DynamicSiteName written by the dclocator algorithm. Basically you are telling the computer what site it belongs to without having to change/create subnet configuration in AD.
b) change the secure channel of the application server to the W2K8 DC using nltest /sc_reset:domain\dcname
c) wait until Kerberos tickets expire, or reboot the application server, then have the application owner perform functionality testing.
1) now if the scenario is more complex...client connects to application, which impersonates client to access resources on backend servers, then you will want to do a,b,c on client and backend systems to make the testing as realistic as possible.
2) For LDAP applications running on Windows that use the domain A record to find a DC, add a host file entry on the application server pointing the domain A record to the W2K8 DC
a) wait until kerberos tickets expire, or reboot the application server, then have the application owner perform functionality testing.
3) For LDAP applications not running on Windows, identify the mechanism they use to find a DC/LDAP server...probably configured in the application itself…then provide it with the DC A record or domain A record, or SRV record to be used to find the W2K8 DC.
a) execute on a test matrix to ensure application functionality.
4) General authentication and ticket processing through the W2K8 DC. Work with business unit managers (aka..guinea pigs) to put their machines into the temporary site (SiteName reg value) and have them perform their normal business functions for a while....tests their machine and locally installed apps ability to use the new DC for auth and queries.
Remove Isolation and 'introduce' the W2K8 DC to the masses
Once you are satisfied that you have done adequate testing and have resolved app compat issues or created GPOs to relax DC settings to work with app compat issues, it is time to remove the isolation strategies.
1) remove SiteName from any systems it was added to.
2) remove the host file entry from any systems it was added to.
3) remove the DnsAvoidRegisterRecords value on the W2K8 DC.
4) point the DC to WINs if you have WINS deployed.
5) re-IP the DC if it was put on an isolated VLAN
6) move the DC into the production site and retire the temporary site/sitelink/subnet
I have collected some upgrade considerations from a couple colleagues of mine and have been sharing them on our internal technical DLs as the question comes up. I have gotten positive feedback on the notes and have been encouraged to post them. So, here they are. Though, the real thanks go out to my colleagues Tom and Arren. Further guidance on AD upgrades has been released to technet. The current title of the document "Microsoft Product Support Quick Start to Adding Windows Server 2008 or Windows Server 2008 R2 Domain Controllers to Existing Domains" can be found here. http://technet.microsoft.com/en-us/library/ee522994(WS.10).aspx
Here are some of the problems customers may run into when upgrading W2K3 AD deployment to W2K8 and/or W2K8R2 AD deployment:
b. For DNS Servers
c. For DCs running on hyper-V & VMWARE
a. install a UPS
b. brief all admins on the risks of USN rollbacks caused by restoring snapshots on DC role guests. Review http://technet.microsoft.com/en-us/library/dd363553(WS.10).aspx
c. P2V conversions should be done in offline mode. If converting multiple DC’s in same forest, then all need to be offline @ same time.
d. Disaster Avoidance & Recovery
a. Enable delete protection on OU containers
b. Enable system state backups
c. If using 3rd party backup, test system state restores + alternant backup like Windows Server backup so that PSS can restore when 3rd party product fails to restore
9. ADMIN STUFF
10. RECYCLE BIN STUFF
a. With Identity Lifecycle Manager (ILM), including Feature Pack 1 (FP1), the Management Agent for Active Directory is not supported with the Recycle Bin feature. KB2018683
I recently came across a DR scenario a colleague worked on that I thought should be documented so administrators can adequately plan for this or prevent it from occurring in the first place. This scenario can only happen in a forest that started life as a W2K forest and was upgraded to W2K3 FFL.
This is fairly long and may be better suited for 2-3 blogs, but I decided to cover this in one blog.
Key acronyms in this blog
LVR - Link Value Replication
Legacy - attribute value without individual replication metadata
Present- a value with additional replication metadata attached
Absent - a deleted value with additional metadata attached
TSL - tombstone lifetime
The restore problem
The deep technical problem description: Authoritative restores of objects, to recover the forward link portion of linked attributes, are not sufficient to return the forward link attribute contents to its former state. This problem can only occur in forests that have existed in W2K FFL and have been upgraded to W2K3 FFL. When I say existed in W2K FFL, I really mean there are objects with populated forward links prior to the upgrade to W2K3FFL.
The easier to grasp practical example problem description: Authoritative restores of group objects to recover removed group memberships are not sufficient for specific environments that have upgraded their forests to W2K3 FFL.
Note: Technically this restore problem applies to any LVR enabled attribute. See the next section for a description of LVR.
Note: The removed membership does not involve deleting the member objects, only removing them from a group's membership.
What is Link Value Replication?
Rather than reproduce what is already documented here, I will briefly summarize LVR replication.
LVR replication changes the smallest unit of replication for multi-valued linked attributes (the ones with distinguished name values) to a single value. This change has a few benefits:
The accidental modification and the recovery goal
Reminder: This scenario exists for any LVR enabled attribute. The example will focus on group membership.
An AD administrator has removed users from a set of groups (users not deleted, only removed from groups). The administrator realizes some of these group membership changes should not have occurred. The administrator's goal is to return these group memberships to their prior state.
How should an administrator recover from this situation?
Well, the following is one way to recover from this situation. The following assumes the backed up system and restore to system are running W2K3 SP1. Perform a system state restore to alternant hardware isolated from production. The members were not deleted, so isolation prevents loss of data to the user accounts like passwords if changed after the backup was taken.
Authoritatively restore the affected users rather than the modified groups. The idea here is to use ldf output created by NTDSUTIL to restore the group membership. Transport the LDF output created by NTDSUTIL to the production environment and import it to recover.
Reminder: I kept this example simple in that only users are members of groups. Of course, other groups as well as computers can be members as well.
Prevent this scenario from ever happening
This recovery certainly seems like a lot of work when it can be prevented in the first place.
Preventing this type of DR situation would require all LVR enabled attributes to have all of their values converted from legacy to present.
This can be accomplished by removing all the values and re-adding them.
My favorite part…the diagnosis of why authoritative restore of group object is not sufficient
Environment:
W2K3 2 DC forest with a group having 9 members (9 legacy and 1 LVR enabled absent entry)
(present and absent are LVR enabled entries, legacy are not LVR enabled members)
W2k3entr2-vm11
W2k3entr2-vm12
Replication metadata from each DC on the relevant group object prior to accidental modification
C:\>repadmin /showobjmeta w2k3entr2-vm11 cn=globalgroup,ou=groups,dc=dom1,dc=root
12 entries.
Loc.USN Originating DC Org.USN Org.Time/Date Ver Attribute
======= =============== ========= ============= === =========
14062 Default-First-Site-Name\W2K3ENTR2-VM11 14062 2008-01-17 14:19:58 10 member
13977 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 instanceType
13977 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 whenCreated
-----snip-----
10 entries.
Type Attribute Last Mod Time Originating DC Loc.USN Org.USN Ver
======= ============ ============= ================= ======= ======= ===
Distinguished Name
=============================
ABSENT member 2008-01-17 14:45:23 Default-First-Site-Name\W2K3ENTR2-VM11 14151 14151 1
CN=user1,OU=userstore,DC=dom1,DC=root
LEGACY member
CN=user2,OU=userstore,DC=dom1,DC=root
CN=user3,OU=userstore,DC=dom1,DC=root
C:\>repadmin /showobjmeta w2k3entr2-vm12 cn=globalgroup,ou=groups,dc=dom1,dc=root
12428 Default-First-Site-Name\W2K3ENTR2-VM11 14062 2008-01-17 14:19:58 10 member
12352 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 instanceType
12352 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 whenCreated
ABSENT member 2008-01-17 14:45:23 Default-First-Site-Name\W2K3ENTR2-VM11 12487 14151 1
----snip----
Notice above there are 10 entries associated with the member attribute of DOM1\GLOBALGROUP. The last 7 were snipped for the sake of brevity.
Important: In DSA.msc or LDP.exe or your favorite LDAP reader, you should see only 9 values…the 9 members of this group. The ABSENT entry is similar to a tombstoned object where it references the knowledge of a removed value in a LVR enabled attribute and will be garbage collected after TSL.
System state backup taken of w2k3entr2-vm12
The accidental modification of data in AD
One user is removed from the group (user object not deleted)
13977 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 objectClass
13977 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 cn
ABSENT member 2008-01-17 15:13:00 Default-First-Site-Name\W2K3ENTR2-VM11 14178 14178 1
12352 Default-First-Site-Name\W2K3ENTR2-VM11 13977 2008-01-17 14:14:55 1 objectClass
12352 Default-First-Site-Name\W2K3ENTR2-VM12 12352 2008-01-17 14:15:11 1 cn
ABSENT member 2008-01-17 15:13:00 Default-First-Site-Name\W2K3ENTR2-VM11 12511 14178 1
Notice that CN=user2,OU=userstore,DC=dom1,DC=root used to be LEGACY, and is now ABSENT. Also notice that the member attribute still has a version of 10. This is visible evidence of the new LVR code in action. The replication metadata on the member attribute is not touched…and therefore we no longer suffer the inefficient replication of the full multi-valued attribute. Furthermore, the LDAP transaction is now limited to the modified values instead of having to re-write the entire multi-valued attribute.
Auth restore of group on w2k3entr2-vm12
This object replicates w2k3entr2-vm11 as shown below.
13 entries.
14209 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100001 objectClass
14208 Default-First-Site-Name\W2K3ENTR2-VM11 14208 2008-01-17 15:47:09 2 cn
14209 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100010 member <--
Notice the member attribute version has increased by 100,000 and replicated to w2k3entr2-vm11. Completely expected behavior for an auth restore.
14209 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100001 instanceType
ABSENT member 2008-01-17 15:26:08 Default-First-Site-Name\W2K3ENTR2-VM12 14212 16386 100001
CN=user2,OU=userstore,DC=dom1,DC=root <-- Notice this entry is still at version 1 so the auth restore did not touch this attribute entry. This is because this entry did not exist in the restored database as an LVR entry. It was a legacy entry at the time of backup. So auth restore did not return the groups effective (legacy + LVR) membership to prior state.
DOM1\GLOBALGROUP on w2k3entr2-vm12 after auth restore but before first inbound replication cycle of the domain partition
16385 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100001 objectClass
16385 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100001 cn
16385 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100010 member
16385 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100001 instanceType
ABSENT member 2008-01-17 15:26:08 Default-First-Site-Name\W2K3ENTR2-VM12 16386 16386 100001
LEGACY member <-- Things look the same here as before the backup time except the member version above is 100,000 greater. So changes were rolled back to prior state….until, look below.
w2k3entr2-vm11 receives inbound replication from W2K3entr2-vm12
ABSENT member 2008-01-17 15:13:00 Default-First-Site-Name\W2K3ENTR2-VM11 16449 14178 1
CN=user2,OU=userstore,DC=dom1,DC=root <-- this entry replicated to w2k3entr2-vm12 just as it had prior to the restore.
----snip-----
So, auth restore of the modified group did not recover the groups effective membership to the prior state.
Corrected the membership by importing NTDSUTIL generated ldif in the production domain. See "How should an administrator recover from this situation" above.
14209 Default-First-Site-Name\W2K3ENTR2-VM12 16385 2008-01-17 15:26:08100010 member
PRESENT member 2008-01-17 16:10:37 Default-First-Site-Name\W2K3ENTR2-VM12 14251 16466 2
PRESENT member 2008-01-17 16:10:37 Default-First-Site-Name\W2K3ENTR2-VM12 16466 16466 2
Note: What was LEGACY prior to the accidental delete is now PRESENT. This strategy achieves the administrator's goal of putting the group membership back where it was.
These postings are provided "AS IS" with no warranties, and confers no rights. The content of this site are personal opinions and do not represent the Microsoft corporation view in anyway. In addition, thoughts and opinions often change. Because a weblog is intended to provide a semi-permanent point-in-time snapshot, you should not consider out of date posts to reflect current thoughts and opinions.
This is part 2 of my lingering object blog series. The purpose of this blog is to help customers with Windows 2000 DCs make informed decisions on how to tackle this problem on a forest wide scope. For the sake of brevity, please review my first blog on this topic for the "Alphabet soup" "What are Lingering Objects" and "Do you have Lingering Objects in your Forest" questions.
REPADMIN /REMOVELINGERINGOBJECTS will not work in W2K environments
First, let’s explain why the repadmin /removelingeringobjects will not work if source or target of that operation is running W2K Server. The /rlo call is leveraging server side code that actually performs the comparison and cleanup work. This code was added to W2K3 Server and does not exist on W2K Server. So, any W2K3 DCs in your forest? The strategy in my other blog can be leveraged for those systems. For W2K systems, one or more of the below strategies must be added to the overall plan of attack.
What about lingering in the WR partition?
How do we handle getting consistency in the WR replica set (WR domain partition and configuration partition)? Recall in my previous blog that REPADMIN can get W2K3 WR DCs consistent with other W2K3 WR DCs for a given NC, but does not address lingering objects in the WR as compared to the RO. Well, with W2K DCs, none of the below methods address lingering objects in the WR NC when compared to other DCs hosting that NC as WR or when compared to DCs hosting that NC as RO…except building a new forest and the ldifde/replfix method.
Note: lingering in the WR when compared to other WR DCs for an NC is uncommon and is rare when compared to DCs hosting a RO of the NC.
Options to clean a forest when W2K DCs exist
Build a new forest and migrate.
Pros:
Guaranteed success as a new forest built using W2K3 servers is set to strict replication consistency right out the gate.
TSL is 180 days which makes the forest more tolerant of replication outages that result in lingering objects.
Cons
Impractical and expensive
UnGC and GC
Can potentially clean all lingering objects from the RO environment if done methodically and systematically.
Cons:
Risk of sourcing from a *dirty* partition containing lingering objects is high without a carefully thought out plan of attack.Potentially huge network utilization hit depending on connectivity during exercise.
No GC available in site (assuming single GC site) during process.
Time consuming due to the NC tear down behavior on W2K DCs. Can be mitigated.
Does not address configuration or NDNCs.
This approach can take one of 2 forms and has a basic assumption that the writable replica set for each domain NC is consistent. This assumption is dangerous as it is certainly possible (al-be-it more rare) for lingering objects to exist in the WR partition when compared to other WR DCs for the same partition. Let's go with the dangerous assumption for the moment. The 2 approaches for this strategy are;
1. I'm cringing as I type this...unGC all GCs in the forest such that there are no GCs left (all lingering objects in the RO environment are destroyed). Then systematically promote new GCs. Yes, the cure may be more painful than the disease with this approach. I mainly wanted to present it here to be thorough and don't realistically think any organization would choose this approach.
2. Systematically and methodically unGC a few GCs at a time. The actual strategy used will differ based on individual IT org needs. The following is an example of a systematic and methodical approach that minimizes risk to operations and risk of sourcing in lingering objects onto the newly promoted GCs.
a.) Create logical AD maintenance site as a temporary site for use during the cleanup process. Create and configure site link connectivity to representative hub site.
b.) Add a representative DC from each domain in the forest. Allow automatic connection objects to be created or manually create them from another site.
c.) This site should have the inter site KCC disabled to remove the risk of the GC promotion creating connections from other GCs in the forest.
d.) Move a few GCs into the maintenance site (be sure to consider the authentication and LDAP needs of the site the GC just left during the maintenance window).
d.) The moved GCs should be isolated so they are not being hit by LDAP consumers over 3268. Prevent the registration of generic siteless SRV records for the duration of the process.
e.) unGC the boxes. REPADMIN /OPTIONS <GC-FQDN> -IS_GC. Either wait for the process to complete evidenced by DS event ID 1660 for each partition or speed up the tear down process.
f.) re-GC the boxes. REPADMIN /OPTIONS <GC-FQDN> +IS_GC. This will cause it to build inbound connections from the DCs in the maintenance site therefore sourcing its data from writable DCs only for each domain NC in the forest.
g.) Move the GCs back to their production sites.
h.) repeat d-g for all GCs in the forest.
i.) retire the maintenance site.
This isolation strategy is important, because without it, the promotion process can build connections from RO source partners which may themselves have lingering objects. The key tenants to keep in mind when planning a systematic and methodical cleanup are:
· Maintain business continuity for functions and applications that depend on GC lookups.
· Strict control of which systems GC promotion sources NC data from.
There are certainly other ways to go about strict control besides moving servers into dedicated maintenance sites and IT orgs may elect to leverage a different strategy or a combination of strategies to meet there needs.
Rehost all RO partitions on all GCs
Can be systematically performed to spread the bandwidth consumption out over time.
Only sources from WR NC so no risk of sourcing from *dirty* GCs.
Can target and clean one NC at a time
Can clean application partitions.
Port 3268 remains responsive which can produce irregular and unexpected query and authentication results during the rehost operation. This can be mitigated by putting these systems into maintenance mode (logically isolate them from production through the use of maintenance AD site, control SRV record and <1C> record registration)
Does not address configuration NC.
More labor intensive than unGC reGC.
This approach also has a basic assumption that the writable replica set for each domain NC and application NC is consistent. A dangerous assumption. This approach must be systematically and methodically planned and carried out to ensure business continuity during the exercise and strict control of which systems are used for sourcing NC data from. The following is an example of a systematic and methodical approach that minimizes risk to operations and risk of sourcing in lingering objects onto the newly promoted GCs.
a.) Prevent the GC from being used by consumers of GC services. There are many strategies here like moving GC to maintenance site, preventing the GC from registering GC specific SRV records, and have these records removed from DNS.
b.) Clean the GC by re-hosting all RO partitions and application NCs.
Example GC with 3 RO NCs (A,B,C) and 2 app NCs (D,E).
REPADMIN /REHOST <GCQDN> <LDAPDN of NC A> <good source DC writable for NC A>
REPADMIN /REHOST <GCFQDN> <LDAPDN of NC B> <good source DC writable for NC B>
REPADMIN /REHOST <GCFQDN> <LDAPDN of NC C> <good source DC writable for NC C>
REPADMIN /REHOST <GCFQDN> <LDAPDN of NC D> <good source DC writable for NC D> /APPLICATION
REPADMIN /REHOST <GCFQDN> <LDAPDN of NC E> <good source DC writable for NC E> /APPLICATION
c.) return rehosted GC to production.
d.) repeat a-c for all GCs in the forest.
Ldifde dumps, replfix.exe compares, and ldifde imports that call the removelingeringobject operational attribute to selectively clean all lingering objects found.
I am purposely leaving out the gory details of a systematic and thorough approach in this blog since working with MS support is required for this method. Strategically, it will be similar to the repadmin /rlo strategy in my first blog.
Targeted cleanup of lingering objects only where they exist.
Reports on lingering objects in writable.
Bandwidth consumption to copy LDIFDE dumps across the network less than other options (but still can be significant).
No GC downtime
Labor intensive large number of LDIFDE dumps and comparisons of every partition from every DC using the same strategy outlined in my first blog.
Extensive batch processing creation needed to automate the processes as much as possible.
Not really scalable as the volume of data to manage quickly becomes unwieldy as the forest size to clean increases.
Must work with MS support
So what if you have a mix of W2K3 and W2K?
Keep the following things in mind as you review a plan of attack in a mixed environment.
· Consider the business continuity risk cost of the existence of lingering objects while W2K DCs exist in the forest.
§ Did you answer yes to the "Do you have Lingering Objects in your Forest"? In my first blog on this topic.
§ Have you ever experienced any of the common symptoms associated with lingering objects?
§ How soon will all W2K DCs be retired and does it make more sense to postpone a forest wide cleanup until all DCs are running W2K3?
· Use strategies that minimize business impact.
§ Use repadmin /removelingeringobj for all W2K3 DC/GCs deployed.
§ Leverage Microsoft PSS support to assist with the planning and execution.
§ Review the pros and cons above to isolate which method makes the most sense. Perhaps more than one method makes sense.
· Use strategies that minimize cost.
§ Hopefully you have gathered that a full scale forest wide lingering object cleanup exercise is no trivial matter.
§ The more complex the plan of attack, the longer and more costly it will be to execute on.
· A phased approach has risks of the just cleaned GC to be re-contaminated by lingering object animation occurring in the environment
§ This can be tackled by monitoring just cleaned systems for 1388 events in the DSevent log. If 1388s are logged after the box is cleaned and before the forest is completely cleaned then a second pass against these boxes are in order.
§ This can be avoided by setting each box to Strict Replication Consistency as soon as it is cleaned. This must be thought over carefully because of the OS quarantine behavior of halting replication of the partition if an inbound replication request for a lingering object is discovered.
So, you want to clean up your forest of lingering objects before you set your forest to strict?
Good choice! This little database inconsistency can cause big business continuity issues. A change to strict replication consistency while lingering objects still exist in the forest can result in replication outages which themselves can cause big business continuity issues.
Alphabet soup in this blog:
TSL = tombstone lifetime
DC = domain controller
GC = global catalog server
W2K = Windows 2000 Server
W2K3 = Windows Server 2003
IFM = install from media
USN = update sequence number
GUID = globally unique identifier
FQDN = fully qualified domain name
WR = writable
RO = read only
DN = distinguished name
NC = naming context (aka partition)
NDNC = non-domain NC
RPC = remote procedure call
Nwr = # of writable DCs
Nro = # of read only DCs
What are lingering objects?
Lingering objects are objects that exist on one or more DCs that do not exist on other DCs hosting the same partition. They may be introduced in any partition except the schema. They are essentially object delete operations that do not successfully replicate to a DCs/GCs that host the partition of the deleted object. Eventually the tombstoned (deleted) object will be garbage collected which destroys all knowledge of the delete and purges the object from the database. They can be introduced through a few mechanisms:
Do you have lingering objects in your forest?
If you answer any of the following questions with a YES, then lingering objects may exist in your forest.
Has any DC (or any one or more partitions on a DC) ever failed to receive inbound replication for more than the tombstone lifetime (TSL) configured on the forest? (60 days default for forests that started with W2K. 180 days default if the first DC in a forest is W2K3 SP1)
Has any DC been successfully restored using a backup that was older than TSL?
Has a DC ever been promoted with IFM method using IFM media that was older than TSL?
There are other types of database consistency problems beyond the above that will be treated as lingering objects by the OS quarantine logic when Strict Replication Consistency is enforced.
So how do you clean a forest of lingering objects?
There are a few methods available. This blog will cover using repadmin.exe /removelingeringobjects. The following steps assume all DCs are running W2K3. I Plan to write a future blog on other methods that can be used when W2K DCs are in the mix.
The command to clean out lingering objects looks like the following.
repadmin /removelingeringobjects <targetDCFQDN> <sourceDCguid> <partitionLDAPdn>
It specifies a target DC by DN, a source DC by GUID, and an NC to be cleaned. The target DC is cleaned using a reference DC for the comparison. The reference DC must always be writable for the partition being cleaned and the target DC may be WR or RO.
It can be run in advisory mode to have the DC report an event identifying each lingering object.
repadmin /removelingeringobjects <targetDCFQDN> <sourceDCguid> <partitionLDAPdn> /ADVISORY_MODE
This command must be run 2(Nwr-1) to clean the writable DCs for the NC. For NCs that have RO copies (all domain NCs), it must also be run (Nro) more times.
Configuration and NDNCs (2(N-1) * # of NCs). Domain NCs (2(Nwr-1)+(Nro)*NCs). N = # of DCs hosting the partition.
An example forest of 10 GCs, 5 domain NCs (2 DCs each), and 6 application partitions (forestdnszones hosted on all 10 DCs and domaindnszones in each domain hosted on each DC in their respective domains) will require 96 executions of repadmin.
Consider the following illustration that explains how the above methodology is the most efficient and thorough approach possible with repadmin /removelingeringobjects.
DC1,2,3,4 all host a writable copy of domain A. DC5,6,7,8,9,10 host a read only copy of domain A.
DC1 will be chosen as an initial target for this illustration. DC1 may be clean or dirty with respect to lingering objects.
1) Clean a target DC.
DC1 is now clean as compared to DC2,3,4.
DC1 now becomes the source to be used to clean DC2,3,4
2) Clean remaining DCs using the target in 1) above as the source DC.
DC2,3,4 are now clean with respect to DC1. This approach makes DC1,2,3,4 consistent with each other.
At this point any writable DC for domain A can be used as a source to clean the DCs hosting a read only copy of domain A.
DC1 will be chosen as the source DC for cleaning the DCs hosting read only copies of domain A.
3) Clean all DCs hosting a read only copy of domain A.
At this point all DCs hosting a read only copy of domain A are consistent with each other and are consistent* with the writable DCs for domain A.
* The abandoned delete scenario is not addressed with the above method. There is no in the box method to discover, report on , and remove objects that are lingering in the writable as compared to the read only. Working with Microsoft PSS is currently necessary to leverage an internal tool to compare LDIFDE.exe dumps that will report on lingering objects in the writable partition.
So, how do you apply the above methodology to your forest?
Simple! Of course, you must have RPC connectivity between each source and target identified in the repadmin command.
Apply steps 1 & 2 for all non domain partitions. This means the configuration partition and all application partitions.
Apply steps 1 & 2 & 3 for all domain partitions.
*** Note ***
There is a tool available that calls the same API (namely DsReplicaVerifyObjects http://msdn.microsoft.com/en-us/library/ms676035(VS.85).aspx ) used by repadmin /rlo and automates above process of cleaning all NCs in a forest using a single command line. repldiag.exe http://www.codeplex.com/ActiveDirectoryUtils/Release/ProjectReleases.aspx?ReleaseId=13664
What default logging of the process is provided during the exercise?
Every target DC will log details about the cleaning exercise such as a start event, an event for each lingering object purged, and a finish event summarizing the number of lingering objects removed.
The following is an example of the start of a clean cycle on a particular NC.
Event Type: InformationEvent Source: NTDS ReplicationEvent Category: Replication Event ID: 1937Date: 11/8/2007Time: 1:38:23 PMUser: TAILSPINTOYS\AdministratorComputer: W2K3ENTR2-VM3Description:Active Directory has begun the removal of lingering objects on the local domain controller. All objects on this domain controller will have their existence verified on the following source domain controller. Source domain controller: 150efcda-20b4-4f1f-9b48-705665bfc095._msdcs.tailspintoys.com Objects that have been deleted and garbage collected on the source domain controller yet still exist on this domain controller will be deleted. Subsequent event log entries will list all deleted objects.
Note: This is worth repeating. "Objects that have been deleted and garbage collected on the source domain controller yet still exist on this domain controller will be deleted."
If you run the same cleanup command multiple times, you may see the 1945 events referencing deleted objects that were cleaned because they happened to be garbage collected on the source DC used in the clean command. This is of no concern as the objects will have been purged on the next run of the garbage collection process anyway. This is more likely in larger more dynamic environments.
Next are the events specifying the objects deemed lingering that were deleted. There will be one for every object deleted, so be sure the DS event log is sufficiently large enough to hold all these events for reporting as well as so other unrelated events are not lost to a full event log.
Event Type: WarningEvent Source: NTDS ReplicationEvent Category: Replication Event ID: 1945Date: 11/8/2007Time: 1:38:52 PMUser: TAILSPINTOYS\AdministratorComputer: W2K3ENTR2-VM3Description:Active Directory will remove the following lingering object on the local domain controller because it had been deleted and garbage collected on the source domain controller without being deleted on this domain controller. Object: CN=retail1003,OU=retail,DC=tailspintoys,DC=com Object GUID: 5e83e965-f802-4d7a-8372-d35a43820515 Source domain controller: 150efcda-20b4-4f1f-9b48-705665bfc095._msdcs.tailspintoys.com
Finally, there is a summary event detailing the number of lingering objects deleted on the server.
Event Type: InformationEvent Source: NTDS ReplicationEvent Category: Replication Event ID: 1939Date: 11/8/2007Time: 1:38:52 PMUser: TAILSPINTOYS\AdministratorComputer: W2K3ENTR2-VM3Description:Active Directory has completed the removal of lingering objects on the local domain controller. All objects on this domain controller have had their existence verified on the following source domain controller. Source domain controller: 150efcda-20b4-4f1f-9b48-705665bfc095._msdcs.tailspintoys.com Number of objects deleted: 16 Objects that were deleted and garbage collected on the source domain controller yet existed on the local domain controller were deleted from the local domain controller. Past event log entries list these deleted objects.