Troubleshooting KCC Event Log Errors

Troubleshooting KCC Event Log Errors

  • Comments 4
  • Likes

My name is David Everett and I’m a Support Escalation Engineer on the Directory Services Support team.

I’m going to discuss a recent trend I’ve seen where Active Directory Replication appears to be fine but one DC only in one (or more) sites begins logging Knowledge Consistency Checker (KCC) Warning and Error events in the Directory Service event log. I included sample events below.

For those not familiar with the KCC, it is a distributed application that runs on every domain controller. The KCC is responsible for creating the connections between domain controllers and collectively forms the replication topology. The KCC uses Active Directory data to determine where (from what source domain controller to what destination domain controller) to create these connections.

In some cases these errors are logged all the time and in others they are logged at regular intervals and they clear on their own only to reappear like clockwork. Typically other DCs in the same site(s), perhaps even in the whole forest, report no KCC errors at all. In some cases the DC logging these errors have a small number of connection objects compared with their peer DCs in the same site:

Event Type: Warning
Event Source: NTDS KCC
Event Category: (1)
Event ID: 1566
Date: 5/14/2008
Time: 1:51:23 PM
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: DC1X
Description:
All domain controllers in the following site that can replicate the
directory partition over this transport are currently unavailable.

Site:
CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com
Directory partition:
CN=Configuration,DC=contoso,DC=com
Transport:
CN=IP,CN=Inter-Site Transports,CN=Sites,CN=Configuration,DC=contoso,DC=com

-AND-

Event Type: Error
Event Source: NTDS KCC
Event Category: (1)
Event ID: 1311
Date: 5/14/2008
Time: 1:51:23 PM
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: DC1X
Description:
The Knowledge Consistency Checker (KCC) has detected problems with the
following directory partition.

Directory partition:
CN=Configuration,DC=contoso,DC=com

There is insufficient site connectivity information in Active Directory
Sites and Services for the KCC to create a spanning tree replication topology.
Or, one or more domain controllers with this directory partition are unable
to replicate the directory partition information. This is probably due to
inaccessible domain controllers.

User Action
Use Active Directory Sites and Services to perform one of the following
actions:
- Publish sufficient site connectivity information so that the KCC can
determine a route by which this directory partition can reach this site. This is
the preferred option.
- Add a Connection object to a domain controller that contains the directory
partition in this site from a domain controller that contains the same
directory partition in another site.

If neither of the Active Directory Sites and Services tasks correct this
condition, see previous events logged by the KCC that identify the
inaccessible domain controllers.

In some cases this event is also seen; it suggests name resolution is working but a network port is blocked:

Event Type: Warning
Event Source: NTDS KCC
Event Category: (1)
Event ID: 1865
Date: 5/14/2008
Time: 1:51:23 PM
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: DC1X
Description:
The Knowledge Consistency Checker (KCC) was unable to form a complete
spanning tree network topology. As a result, the following list of sites
cannot be reached from the local site.

Sites:
CN=SITEY,CN=Sites,CN=Configuration,DC=contoso,DC=com

If you encounter this issue it could be the DC logging the errors is hosting the Intersite Topology Generator (ISTG) role for its site. This role is responsible for maintaining all of the Inter-site connection objects for the site. This role polls each DC in its site for connection objects that have failed and if failures are reported by the peer DCs the ISTG logs these events indicating something is not right with connectivity.

For those wondering what these events mean here is a quick rundown:

  • The 1311 event indicates the KCC couldn't connect up all the sites.
  • The 1566 event indicates the DC could not replicate from any server in the site identified in the event description.
  • When logged, the 1865 event contains secondary information about the failure to connect the sites and tells which sites are disconnected from the site where the KCC errors are occurring.

Ok, I’ll get to the point and explain how to identify the root cause and correct this. These errors are pointing to a topology or a connectivity issue. Either there are not enough site links to connect all the sites or more likely network connectivity is failing for a number of reasons.

If your network is not fully routed (the ability for any DC in the forest to perform an RPC bind to every other DC in the forest) make certain Bridge All Sites Links (BASL) is unchecked. If BASL is unchecked Site Links and/or Site Link Bridges must be configured. Site Links and Site Link Bridges provide the KCC with the information it needs to build connections over existing network routes. If the network is fully routed and you have BASL checked, fine.

While the network routes may exist the ports needed for Active Directory to replicate must not be restricted.

The assumption of this blog is these errors continue to be logged even though the site listed in the 1566 event has been added to a site link object and AD topology is correctly configured.

To locate the source of the KCC events and identify the root cause, you need to execute the following commands while the KCC events are being logged.

1) Identify the ISTG covering each site by running this command:

repadmin /istg

The output will list all sites in the forest and the ISTG for each site:

repadmin running command /istg against server localhost

Gathering topology from site Default-First-Site-Name (DC1.contoso.com):

                                   Site                                ISTG
================== =================
                                 SiteX                               DC1X
                                 SiteY                               DC1Y

NOTE: Determine from the output if the DC logging these events (DC1X) is the ISTG or not.

2) If the DC logging the events is the ISTG any one of the DCs in the same site as this ISTG could have connectivity issues to the site identified in the 1566 event. You can identify which DC(s) are failing to replicate from the site identified in the 1566 event by running this command which targets all DCs in the site that the ISTG logging the errors resides in. For example, DC1X is logging the events and it is the ISTG for siteX. To identify which DCs in siteX are failing to replicate from siteY run this command:

repadmin /failcache site:siteX >siteX-failcache.txt

The failcache output shows two DCs in siteX:

repadmin running command /failcache against server DC1X._msdcs.contoso.com

==== KCC CONNECTION FAILURES =========================== (none)

==== KCC LINK FAILURES ===============================     SiteY\DC1Y        
    DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473        
    No Failures.

repadmin running command /failcache against server DC2X._msdcs.contoso.com

==== KCC CONNECTION FAILURES =========================== (none)

==== KCC LINK FAILURES ===============================     SiteY\DC1Y        
    DC object GUID: 7c2eb482-ad81-4ba7-891e-9b77814f7473         
    46 consecutive failures since 2008-08-12 22:14:39.
SiteZ\DC1Z        DC object GUID: fh3h8bde-a928-466a-97b0-39a507acbe54        
    No Failures.

The output above identifies the Destination DC as (DC2X) in siteX that is failing to inbound replicate from siteY. In some cases the DC name is not resolved and shows as a GUID (s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com). If the DC name is not resolved determine the hostname of the Destination DC by pinging the fully qualified CNAME:

ping s9hr423d-a477-4285-adc5-2644b5a170f0._msdcs.contoso.com

NOTE: DC2X may or may not be logging Error events in its Directory Services event log like the DC1X the ISTG is.

3) Logon to the Destination DC identified in the previous step and determine if RPC connectivity from the Destination DC to the Source DC (DC1Y) is working.

repadmin /bind DC1Y.contoso.com

  • If “repadmin /bind DC1Y” from the Destination DC succeeds:

Run “repadmin /showrepl <Destination DC>” and examine the output to determine if Active Directory Replication is blocked. The reason for replication failure should be identified in the output. Take the appropriate corrective action to get replication working.

  • If “repadmin /bind DC1Y” from the Destination DC fails:

Verify firewall rules are not interfering with connectivity between the Destination DC and the Source DC. If the port blockage between the Destination DC and the Source DC cannot be resolved, configure the other DCs in the site where the errors are logged to be Preferred Bridgeheads and force KCC to build new connection objects with the Preferred Bridgeheads only.

NOTE: Running "repadmin /bind DC1Y” from the ISTG logging the KCC errors may reveal no connectivity issues to DC1Y in the remote site. As noted earlier, the ISTG is responsible for maintaining inter-site connectivity and may not be the DC having the problem. For this reason the command must be run from the Destination DC that repadmin /failcache identified as failing to inbound replicate

A successful bind looks similar to this:

C:\>repadmin /bind DC1Y
Bind to DC1Y succeeded.
NTDSAPI V1 BindState, printing extended members.
    bindAddr: DC1Y
Extensions supported (cb=48):
    BASE                             : Yes
    ASYNCREPL                        : Yes
    REMOVEAPI                        : Yes
    MOVEREQ_V2                       : Yes
    GETCHG_COMPRESS                  : Yes
    DCINFO_V1                        : Yes
    RESTORE_USN_OPTIMIZATION         : Yes
    KCC_EXECUTE                      : Yes
    ADDENTRY_V2                      : Yes
    LINKED_VALUE_REPLICATION         : Yes
    DCINFO_V2                        : Yes
    INSTANCE_TYPE_NOT_REQ_ON_MOD     : Yes
    CRYPTO_BIND                      : Yes
    GET_REPL_INFO                    : Yes
    STRONG_ENCRYPTION                : Yes
    DCINFO_VFFFFFFFF                 : Yes
    TRANSITIVE_MEMBERSHIP            : Yes
    ADD_SID_HISTORY                  : Yes
    POST_BETA3                       : Yes
    GET_MEMBERSHIPS2                 : Yes
    GETCHGREQ_V6 (WHISTLER PREVIEW)  : Yes
    NONDOMAIN_NCS                    : Yes
    GETCHGREQ_V8 (WHISTLER BETA 1)   : Yes
    GETCHGREPLY_V5 (WHISTLER BETA 2) : Yes
    GETCHGREPLY_V6 (WHISTLER BETA 2) : Yes
    ADDENTRYREPLY_V3 (WHISTLER BETA 3): Yes
    GETCHGREPLY_V7 (WHISTLER BETA 3) : Yes
    VERIFY_OBJECT (WHISTLER BETA 3)  : Yes
    XPRESS_COMPRESSION               : Yes
    DRS_EXT_ADAM                     : No
Site GUID: stn45bf5-f33f-4d53-9b1b-e7a0371f9a3d
Repl epoch: 0
Forest GUID: idk4734-eeca-11d2-a5d8-00805f9f21f5
Security information on the binding is as follows:
    SPN Requested:  LDAP/DC1Y
    Authn Service:  9
    Authn Level:  6
    Authz Service:  0

4) If these events occur at specific periods of the day or week and then they resolve on their own, verify DNS Scavenging is not set too aggressively. It could be DNS Scavenging is so aggressive that SRV, A, CNAME and other valid records are purged from DNS causing name resolution between DCs to fail. If this is the behavior you are seeing, verify scavenging settings on these DNS zones:

  • _msdcs.forestroot.com
  • forestroot.com
  • Scavenging settings need to be checked on child domains if the Source or Destination DCs are in child domains.

Example: if Scavenging is set this way the outage will occur every 24 hours:

Non-refresh period: 8 hours
Refresh period: 8 hours
Scavenging period: 8 hours

To correct this change the Refresh and Non-refresh periods to 1 day each and set scavenging to 3 days. See Managing the aging and scavenging of server data on Technet to configure these settings for the DNS Server and/or zones.

Hopefully this clears up the mysterious KCC errors on that one DC.

- David Everett