I was well and truly stumped a few months ago. I joke that once a year I am flat out wrong, and rarely do I have nothing to say on a subject. The 'once a year I may be flat out wrong' statement may be true simply because after 15 years in the IT industry I’ve learned to avoid letting broad definitive statements out of my mouth unless I am certain. I also rarely say something is impossible.
Too frequently in the past I’ve been proven wrong after such proclamations. Oh the embarrassment in the IT world if you are not accurate!
So after reviewing the issue below I was stumped, out of ideas, stymied, at a loss, and bewildered. Here’s the deal.
Our customer had recently rolled out her first Server 2008 domain controllers. They didn’t use DCPROMO “”answer files”, custom builds or images of the operating system, and the DCs were not installed as Core or Read Only DCs. Shortly after they were promoted though the DCs would respond oddly. For one thing it was noticed that workstations were not gaining services from these DCs for the most part (file access worked well, but LDAP binds would fail). The big symptom that was being seen was that the recently created objects in the Active Directory were not being received by the new DCs via AD replication. These odd circumstances were only occurring with the 2008 DCs, and didn’t seem to happen immediately following promotion.
You can imagine the disappointment of the the users who's accounts those were. There were also DNS Server events (yes, this server also hosted DNS as do many DCs):
Log Name: DNS Server
Date: 9/16/2008 3:41:01 PM
Event ID: 4000
Task Category: None
The DNS server was unable to open Active Directory. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
Event ID: 4015
The DNS server has encountered a critical error from the Active Directory. Check that the Active Directory is functioning properly. The extended error debug information (which may be empty) is "000004DC: LdapErr: DSID-0C0906DD, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1771". The event data contains the error.
Rather than speculate needlessly I asked the customer to run the Microsoft Support Diagnostic Tool (MSDT) against the problematic DCs. Here were some of the results. I’m including some of the things that were successful but should have been failures as well if there was some catastrophic thing going on for these DCs.
Testing server: Columbia\DC21
Starting test: Connectivity
* Active Directory LDAP Services Check
Determining IP4 connectivity
Determining IP6 connectivity
* Active Directory RPC Services Check
......................... DC21 passed test Connectivity
Starting test: Advertising
The DC DC21 is advertising itself as a DC and having a DS.
The DC DC21 is advertising as an LDAP server
The DC DC21 is advertising as having a writeable directory
The DC DC21 is advertising as a Key Distribution Center
The DC DC21 is advertising as a time server
The DS DC21 is advertising as a GC.
......................... DC21 passed test Advertising
So the above tests and errors showed that the DNS Server service couldn’t start because the Active Directory wasn’t running successfully, but the normal tests which show whether the Active Directory is working were claiming it was fine.
So I remotely connected and attempted to do an LDAP bind to the local DC using LDP.EXE but got this response:
Error: An LDAP lookup operation failed with the following error:
LDAP Error 49(0x31): Invalid Credentials
Server Win32 Error 2148074252(0x8009030c): The logon attempt failed
Extended Information: 8009030C: LdapErr: DSID-0C0904D1, comment: AcceptSecurityContext error, data 52e, v1771
Now that was really odd given that the DCDIAG tests above were succesful. In most situations where an error is seen binding that error is repeated in diagnostic tests. But not in this one.
By stopping the Kerberos Key Distribution Center service and flushing the Kerberos tickets we were able to see in a network trace that DC21 was requesting a Kerberos ticket for the service LDAP/local. No DC registers this service principal name for itself, and when I checked in AD I was able to confirm that there was no SPN register by that name in DC21’s servicePrincipalName attribute.
My debugging skills were enough to tell that the request and subsequent failure for a Kerberos service ticket using the SPN LDAP/local was probably the problem. But the question was why was it doing that and how could we make it stop? No configuration of the network interfaces had anything like that “local” thing, and there was no record in DNS of that unlikely name.
I was stumped. So I asked our Global Escalation Services folks to apply their stronger debugging skills to this issue. Joey Seifert from that team obliged (kudos to him for shedding light on this).
You’ll never guess what it was….or at least I didnt.
This was caused by an entry in the %systemroot%\system32\drivers\etc\hosts file. That entry was “127.0.0.1 local localhost”. The 2008 DCs which were failing all had this entry in a HOSTS file which was munging the Kerberos SPN which would be used in the ticket request. An expected, working entry would be be slightly different: "127.0.0.1 localhost". As a result, the ticket request was unsuccessful and the DC could not allow it's local service to bind to AD via LDAP since that ticket wasn't there.
I've mentioned in the prior blog posts that DNS is important to Kerberos authentication. Here's a real life example.
The moral of this story? Never assume you know everything. Once you do you’ll never succeed or learn anything more.