I learn more about AD and other things every day, which is part of the fun of this job we do-learning about how things work. This story does a good job of lending some understanding to something that can be tough to understand-trust secure channels.
This story begins with a customer contacting us regarding a problem they were suddenly confronted with where some of the domain controllers of one domain would hang and become unresponsive. It was impossible to tell when this would happen, but when it did happen that domain controller would no longer provide any client services until it was rebooted. The result would be that Windows clients would have errors communicating with that DC for about 15 minutes (for things like group policy processing if that was their time for that, user logon, or authentication requests) and then they would find another DC to service their requests. For applications which were specifically set to look to the failing DC, however, the failure would continue until the server would be rebooted. A quick glance in Task Manager for the affected domain controller would show that it did not seem very busy.
Additional symptoms were the events below from the clients who would otherwise be looking to that DC. These events can occur from a wide variety of causes and as a result were definitely interesting but did not narrow the playing field significantly enough to figure out the problem.
Event ID : 5783
Raw Event ID : 5783
Category : The operation completed successfully.
Source : NETLOGON
Type : Error
Generated : 3/1/2008
Written : 3/1/2008
Machine : MemberSrvr9104
Message : The session setup to the Windows NT or Windows 2000 Domain Controller \\DC1.child1.haybuvtoys.com for the domain Haybuvchild1 is not responsive. The current RPC call from Netlogon on \\DC1 to \\DC1.child1.haybuvtoys.com has been cancelled.
Event ID : 5719
Raw Event ID : 5719
Generated : 3/1/2008
Written : 3/1/2008
Message : This computer was not able to set up a secure session with a domain controller in domain Haybuvchild1 due to the following:
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.
If this computer is a domain controller for the specified domain, it sets up the secure session to the primary domain controller emulator in the specified domain. Otherwise, this computer sets up the secure session to any domain controller in the specified domain.
NOTE: The above events have a plethora of different possible causes.
For many of these scenarios we start asking if this issue occurs over time, which would suggest a resource bottleneck for some reason. But in this case there was no similar time frame this would occur in, and the fact this would occur on any DC made that much less likely anyway. Due diligence with performance data take over time on these DCs (link to some info on PERFWIZ) ruled that out.
Our next considered step was the sometimes dreaded Memory Dump. We won’t be going into extreme detail about examining what is in memory from a Windows server here, or the various techniques and tools we use for that, but we do need to mention a little bit. A performance guru (which I am not) can use a full memory dump, usually in conjunction with Perfmon performance data gathered leading up to the memory dump, to look for bottlenecks, memory leaks and other resource contention type issues. That was not what our goal was in getting a memory dump of this issue.
A not unheard of thing a Directory Service person may do would be to simply get a User Mode dump of lsass.exe as the server was hanging-if we can-and see what the various thread stacks are doing. This really is not the most common thing we will do since we have a specialty team which is tasked with-and much more expert in-depth debugging, but can be useful for a high level idea of what is happening. A simple way to get that Lsass.exe dump is by using ADPlus.vbs which is included with the Debugging Tools for Windows (a free download). Here’s an article on how to use ADPlus.vbs. You can open your .DMP file in Windbg.exe which is included in the Debugging Tools mentioned above.
Now, in my last post about my wife’s impromptu game interruption I mentioned how to look at a stack in memory using Process Explorer. That is a much more simple way to do the same thing we did in this scenario. It simply doesn’t let you compare the different threads at that time for trends, which might be useful in this type of situation. In other words, it doesn’t give you a comprehensive snapshot of all of the user mode threads at the time of the problem.
So in this case we dumped Lsass.exe several times and were able to see that in every case there was a hang on a domain controller there was a Netlogon thread doing the same thing. Since software running in memory moves so very quickly and a memory dump is similar to taking a quick picture of something in motion…so when you see something in several “snapshots” it makes this thing more interesting.
Incidentally, the simple command for viewing all of a user mode dumped process’ stacks is “~*k” (without the quotes). The only problem was that the thing it was doing seemed pretty harmless-just updating the list of trusted domains.
Having gleaned this from the .DMP file it was immediately apparent that there was a less difficult way to see this same thing: Netlogon debug logging on the affected domain controller, the one that has the hang condition. Sometimes we go very deep only to discover that we don’t have to think quite so hard to achieve the same results. Case in point are entries similar to below.
03/17 10:54:41 [CRITICAL] ACCOUNTING: NlDiscoverDc: Cannot find DC.
03/17 10:54:41 [CRITICAL] ACCOUNTING: NlSessionSetup: Session setup: cannot pick trusted DC
03/17 10:54:41 [MISC] Eventlog: 5719 (1) "ACCOUNTING" 0xc000005e c000005e ^...
03/17 10:54:41 [MISC] Didn't log event since it was already logged.
So why are the above Netlogon.log entries interesting here? To understand that we need to consider how netlogon works. I’m going to crib the following excellent explanation from someone who has a better depth understanding of it and laid it out very well.
The Netlogon service maintains a list of "server sessions" - each one represents a secure channel to the DC. The server sessions are identified by NetBIOS name of the client machine. Every member machine in a domain will have a secure channel with one DC in its domain, and all domain controllers will have a secure channel with the PDC Emulator , as well as to a DC in each (directly) trusted domain. These are all stored in the netlogon service as "client sessions", identified by the target DC NetBIOS name.
The problem which can occur is that a member server and a domain controller in different domains could have the same NetBIOS name. On local network segments this is quickly detected and noticed, and in every place it is possible in Windows we have code which will prohibit duplicate names-if noticed-from being used for different domain members. A domain controller which can see this problem-which is extremely rare-will place a secure channel for that DC of a trusted domain ACCOUNTING-let’s call it BIGDADDY-which may later be unexpectedly “hijacked” by a secure channel update from the server in perhaps yet another domain in the forest which has the same name of BIGDADDY, and that update creates a problem in the service while it tries to update the list of trusted secure channel partners for the other domains it trusts. The most confusing aspect is that all of this problem behavior revolves around BIGDADDY, but the problem took place on DC1 as DC1 tried to keep track of its secure channel partners-one of which was the DC BIGDADDY in a trusted domain.
Clear as mud now, right?
This is a pretty unusual occurrence, so I don’t expect folks to be reading this and then crying “Eureka!” as they see the solution to a problem they are experiencing. The value in discussing it here lies in the understanding of the troubleshooting process, the tools and techniques which can be used, and maybe a good glimpse into how trusts work, from DC to DC. So we went from looking at event logs, to analyzing performance data from Perfmon on a hanging DC, then to taking memory dumps, and finally to reviewing Netlogon debug logs to get a good understanding of the problem.
The solution in this case? Well, if you’re in this state simply search through your forest-and any external trusted domains and forests-for duplicate names and rename the servers which have the same name as a domain controller elsewhere in the environment. Better yet, maintain a company-wide process control on computer naming and re-use and you should never be in this unlikely place.
Wow !! so much for a good uniform naming convention across the board