Troubleshooting Active Directory Replication

[Today's post comes to us courtesy Lavinder Kumar.]

Though uncommon, there are times when we see an SBS network with multiple domain controllers, placed in multiple sites. This is mainly done to reduce bandwidth consumption for intra-site LDAP queries. A lot of times, though, replication breaks, causing various problems on one or more of these sites. So, what should you do if you see problems in replication? We endeavor to answer this question in the rest of the article.

Active directory replication depends essentially on four components:

1. Name Resolution

2. Remote Procedure Call

3. Kerberos

4. The AD JET Database (ntds.dit)

We will try and cover each of these components to provide a step-by-step approach to Active Directory Replication troubleshooting.

Before you start troubleshooting replication issues, you might want to have some tools installed on the server, so there’s no time wasted on downloading them at a later time:

1. Support Tools for Windows Server 2003: https://www.microsoft.com/downloads/details.aspx?FamilyId=6EC50B78-8BE1-4E81-B3BE-4E7AC4F0912D&displaylang=en

2. PortQry.exe:

https://www.microsoft.com/downloads/details.aspx?familyid=89811747-c74b-4638-a2d5-ac828bdc6983&displaylang=en

3. RPCDump.exe:

https://www.microsoft.com/downloads/details.aspx?FamilyID=9d467a69-57ff-4ae7-96ee-b18c4790cffd&displaylang=en

4. KList.exe: This tool is a part of the Windows Server 2003 Resource Kit. The resource kit can be downloaded from https://www.microsoft.com/downloads/details.aspx?familyid=9D467A69-57FF-4AE7-96EE-B18C4790CFFD&displaylang=en

Once the support tools have been installed on the server, you can check whether replication is working fine using the repadmin tool. Using repadmin with the “/showreps” switch, you can find out whether replication is working or if it is broken. KB 232072 provides a good overview on using repadmin to initiate and troubleshoot the replication process.

Running repadmin /showreps on a domain controller will show all replication partners that this domain controller has, along with a summary of the last few replication attempts with these partners. It will also show the result of these replication attempts.

The output of repadmin /showreps command on my SBS server is shown below:

Default-First-Site-Name\SBS-LAVINK

DC Options: IS_GC

Site Options: (none)

DC object GUID: 7b0f4476-e134-406f-a742-05aa06a2c974

DC invocationID: 7b0f4476-e134-406f-a742-05aa06a2c974

==== INBOUND NEIGHBORS ======================================

DC=SBS,DC=LOCAL

    Default-First-Site-Name\WIN2K3 via RPC

        DC object GUID: d6234c9f-30c3-4cd3-9172-a4a0ea8ef571

        Last attempt @ 2008-03-24 15:28:29 was successful.

CN=Configuration,DC=SBS,DC=LOCAL

    Default-First-Site-Name\WIN2K3 via RPC

        DC object GUID: d6234c9f-30c3-4cd3-9172-a4a0ea8ef571

        Last attempt @ 2008-03-24 15:28:28 was successful.

CN=Schema,CN=Configuration,DC=SBS,DC=LOCAL

    Default-First-Site-Name\WIN2K3 via RPC

        DC object GUID: d6234c9f-30c3-4cd3-9172-a4a0ea8ef571

       Last attempt @ 2008-03-24 15:28:29 was successful.

*************************************************************************************

From the output above, we will be using the “DC object GUID” information for further troubleshooting.

Name Resolution

For replication to work fine, it is imperative that name resolution is working fine. Each domain controller should be able to resolve the _msdcs record of other domain controllers. The _msdcs record can be formed using the DC object GUID from the repadmin output. Alternatively, it can also be pulled up from the DNS console. The screenshot below shows the _msdcs records on the DNS console:

clip_image001[1]

Once you have the _msdcs record or the DC object GUID for the replicating partner, you can use nslookup to check name resolution.

Checking Name Resolution with NSLOOKUP:

1. Open the command prompt.

2. Type in the command below:

NSLOOKUP <GUID._msdcs.DnsForestName>

Note: With the setup we are using, the command will translate to:

NSLOOKUP d6234c9f-30c3-4cd3-9172-a4a0ea8ef571._msdcs.mydomain.com

3. If the NSLOOKUP resolves the correct IP-Address of the replication partner DNS is working fine.

For the repadmin output above, working name resolution will show the results below:

NSLOOKUP d6234c9f-30c3-4cd3-9172-a4a0ea8ef571._msdcs.mydomain.com

Server: sbs-lavink.sbs.local

Address: 192.168.0.1

Name: win2k3.sbs.local

Address: 192.168.0.2

Aliases: d6234c9f-30c3-4cd3-9172-a4a0ea8ef571._msdcs.mydomain.com

Note: Please make sure that the DNS server which is returning the NSLOOKUP query is the correct DNS server and the IP-address returned is the IP-Address of the correct server

If the IP address returned is not the correct IP address, then the DNS server does not have the correct record for the replication partner. If this is the case, you need to configure the correct records. You can do this from the DNS console.

Once you are sure, we can proceed to work on RPC.

Remote Procedure Call

Now that you are sure name resolution is flawless, you need to make sure that the systems in question can communicate using Remote Procedure Call (RPC). The two main things that you need to confirm when you are checking to see if RPC is working fine are:

1. Verify that the two replication partners can communicate with each other on port 135.

2. Verify that Directory Replication Service on the replication partner is in a listening state.

A simple test to confirm both servers are listening on port 135 would be to telnet from one to the other on the port. Alternately, you can use portqry.exe or rpcdump.exe to check the ports.

KB 310456 provides good information on troubleshooting AD replication using portqry.exe. To use RPCDump.exe to verify communication, use the command below:

rpcdump /s <partner_dc> /v /i >endpoints.txt

Where, <partner_dc> is the DC with which replication is failing.

If the first line of the endpoints.txt shows a failure it means that there is a problem with port 135 between the two replication partners. A good result for RPCDump.exe is shown below:

ProtSeq:ncacn_ip_tcp

Endpoint:1025

NetOpt:

Annotation:MS NT Directory DRS Interface

IsListening:YES

StringBinding:ncacn_ip_tcp:192.168.0.2[1025]

UUID:e3514235-4b06-11d1-ab04-00c04fc2dcd2

ComTimeOutValue:RPC_C_BINDING_DEFAULT_TIMEOUT

VersMajor 4 VersMinor 0

If the IsListening result is positive, then port 135 and the port used by Active Directory Replication Service are accessible. In the output above, the Active Directory Replication Service is listening on port 1025. If the endpoints.txt file does not show any issue, then you can proceed to the next step.

Kerberos

Before actually starting your check for Kerberos, you need to make sure that the time on both replication partners is in synchronization. On a command prompt, run the commands below:

1. Net time

This will show you the time on the PDC. On an SBS network, this will be the time on the SBS server, since the SBS will be the PDC role holder.

2. If the time returned is different (±5 minute offset), you would need to synchronize the time between the two servers. On the non-SBS, run the following command:

Net time \\sbs-lavink /set /y

The /set switch queries and synchronizes the time with the specified domain controller.

Once you’re sure that the time on both machines is synchronized, you need to proceed to checking the userAccountControl value and Kerberos trust. The procedure is described below:

1. Ensure the Kerberos Key Distribution Center service is running on all Domain Controllers.

2. Ensure Trust computer for delegation is check on the Properties of the Domain Controller in Active Directory Users and Computers.

clip_image002[1]

3. Using Adsiedit.msc from the Support Tools, confirm that the userAccountControl attribute is set to 532480.

4. Open Adsiedit.msc, and on the left pane, browse to the server object.

clip_image004[1]

5. Right click the server object, and go into properties for the object.

clip_image006[1]

Resetting the password and refreshing the Kerberos ticket

Once you’ve verified that the userAccountControl attribute is set correctly, and that Kerberos Trust is correctly configured, reset the Kerberos ticket on the problematic server. Follow the procedure below to do this:

1. Stop the KDC service on the downstream partner. The downstream partner is the partner that is pulling replication. Basically, this will be the failing DC in the repadmin /showreps output.

2. Purge the Kerberos ticket on the DC which is failing replication. Use klist.exe to purge the tickets.

From the command prompt, run:

Klist.exe purge

Type a “y” when prompted, and hit enter.

3. Reset the secure channel on the DC that is failing replication. You can use the netdom command to reset the secure channel. KB 325850 explains the use of the netdom command to reset the secure channel on a domain controller.

4. Access the PDC (the Small Business Server) from the DC that is failing replication using the FQDN. This will request a new Kerberos ticket. To renew this ticket, from the command prompt, use the command below:

Net use \\sbs-lavink.sbs.local\ipc$

5. Force the replication from PDC using AD sites and services.

Check Kerberos fragmentation between the replication partners

If resetting the secure channel, and refreshing the Kerberos tickets still doesn’t help, there might be packet fragmentation between the two domain controllers, which might be a cause of replication failure. To test for packet fragmentation between the two servers, from the DC failing replication, run the command below:

Ping hostname_PDC –f –l 1472

This command will send ICMP packets with a packet size of 1472 bytes, and try to get a reply. If the packets are being fragmented, we won’t receive a reply from the DC that is failing replication. If the output shows fragmentation, you might want to troubleshoot the network for fragmentation. Information for troubleshooting packet fragmentation is available on KB 314825. After following this article, if you still see problems with replication, force Kerberos to use TCP. The procedure is explained here:

1. Start the Registry Editor.

2. Expand to:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters

If the parameters key does not exist, create it.

3. Create the value below:

Value Name: MaxPacketSize

Value Type: DWORD

Value Data: 1

4. Save the registry value, and restart the server.

By default, Kerberos uses UDP, and UDP is a connectionless protocol. If the packet is fragmented, there will be no way to know if the packet has been lost. The registry change we made above will force Kerberos to use TCP instead of UDP. Since TCP is a connection oriented protocol, and provides means for retransmission of lost packets, we will be able to see where and how we’re failing communication, in a Netmon trace. This information is also available on KB 244474.

Prior to making these changes, if had network problems causing replication failure, replication should work now. Once we’re sure both Domain Controllers are communicating successfully, if replication is still failing, we need to switch concentration to the Active Directory Jet Database for further troubleshooting.

NTDS.dit – The Active Directory Jet Database

Active Directory writes its information in NTDS.dit which is a Jet Database. Along with the NTDS.dit file, the AD makes use of transaction logs and check files to commit data into NTDS.dit. The default location for the database and related files is C:\Windows\NDTS.

The data is first written to the log files before being committed to NTDS.dit file. Check files take account of the log files which are committed to the database.

You might want to keep in mind the following pointers when checking the Jet database:

1. Is an antivirus scanning the Jet database?

It is recommended that the Antivirus program should not be scanning the log files or the database. If the antivirus program keeps the transaction log file open for scanning then the Active Directory services will be unable to write changes into the database. It is highly recommended that you set exclusions for the C:\Windows\NTDS folder.

2. Do you have missing log files?

The jet database engine does not write changes directly to the database. Changes are first written to the transaction logs and then committed to the database. A check file is created to mark the transactions that have been successfully written to the database. This prevents data corruption in case of power failure. If there is a power failure, the active directory will start when the server is restarted and will look for the log files that have not been written to the database. If the log files are missing the jet database will not initialize successfully.

Our intention here is to help you eliminate points of failure when troubleshooting AD replication. This article only covers the basic troubleshooting concepts. In case the steps above don't help, please call PSS for further assistance. If you have any doubts or queries, please let us know so we can answer the same.