Mike Lagase

Saving the Exchange world one day at a time.....

The Case of the Mysterious Exchange Server Hang

The Case of the Mysterious Exchange Server Hang

  • Comments 11
  • Likes

Recently we had a case in which an Exchange 2003 server would hang and no longer accept any new RPC connections to the Information Store. The rest of the server seemed to be operating just fine, but it was the Store that was ultimately having the problems.

I took a look at the perfmon data that was provided and didn’t see anything out of the ordinary except for a small amount of RPC Operations taking place on the server. The server did look like it was processing data though, so this was quite intriguing to me now. I did notice that one DC had a number outstanding LDAP requests for an extended period of time as shown below.

image

We ended up taking some dumps of the Store, IIS processes, and LSASS to see what might be going on. The store and IIS dumps were not that interesting. Looking at the LSASS dumps was an eye opener. We saw that over 150 threads were hung up calling (SECUR32!CALLSPM) in to the Security Provider Manager (SPM). The beginning of the stacks were showing secur32!LsaAcceptSecurityContext calls which were mostly client authentication calls to the server. More info on the AcceptSecurityContext calls can be found here

There was almost 200 other threads that were calling netlogon!NlpUserValidateHigher which essentially means that we are trying to send a user validation request to a higher authority for authentication requests over the secure channel. Once we accept this validation request, we then attempt to connect to the DC over RPC to handle the request. Debug analysis can be found on Dave Goldman’s blog here.

By default, Netlogon only allows 2 concurrent API calls for these authentication requests which is controlled by a semaphore. If the 2 semaphore objects are tied up waiting for a response from the DC, all other requests will start queuing, thus having this mysterious hang affect on the Exchange Server. This was our problem since the debug analysis showed that we hit our maximum of 2 concurrent requests most likely to an overloaded DC, leaving a backlog of requests for authentication traffic. This request queue is controlled by the MaxConcurrentApi setting. Each request has a default timeout of 45 seconds, so if there were requests that were timing out, this is surely going to cause some delays for other users. On healthy servers with good network connectivity, these authentication requests are extremely fast.

At this point, we knew that we were tied up in authentication calls to DC’s, but we couldn’t find out what users were trying to logon which were taking the most amount of time. The debug information only shows a point in time. It is possible that a user could be trying to authenticate to a down-level domain across a slow WAN link, not sure at this time.

To move forward, we enabled Netlogon Debug logging per http://support.microsoft.com/kb/109626 and let the problem occur again.

We opened the netlogon.log file and started reviewing the information. Prior to the problem we can see that responses are returning in a timely manner. Notice the time intervals happen within the same second

07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

As traffic increases, the response times are starting to get a little slower

07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Now we see a response time right at 45 second timeout below

07/31 11:57:02 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:57:47 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Here is where our first netlogon timeout hit

07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1) 
07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E

0xC000005E = STATUS_NO_LOGON_SERVERS

Now we are seeing that we cannot allocate a Client API slow because the max request queue is busy servicing other requests

07/31 11:58:55 [CRITICAL]EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

Now we get to an actual DC timeout error as shown below.

08/01 17:21:24 [CRITICAL] NlPrintRpcDebug: Couldn't get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064)
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: timeout call to \\DC1.domain.com.  Count: 2
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: dropping the session to \\DC1.domain.com
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlSetStatusClientSession: Set connection status to c000005e

We can see clearly now that DC1 was having problems servicing authentication requests to this Exchange server. This does not always mean that the DC is overloaded, it could be a down level network trust that is really slow that is causing this problem, so additional investigation needs to be performed at this point. We just know that Exchange is the victim and the problem is elsewhere now.

Troubleshooting methodologies

So what can we do at this point?

We can test secure channels for different domains to see which domains might be failing. First we will need to obtain the DC in which the secure channel is currently formed on the Exchange server by running nlttest /sc_query:<domain> replacing domain with the domain name where the Exchange Server resides in.

Once that DC is found, you will then run a command similar to the following for each of the domains:

nltest /server: DC1   /sc_query:ASIA
nltest /server: DC1   /sc_query:EUROPE
nltest /server: DC1   /sc_query:US

This will help fish out any down level domains that could be causing authentication delays.

You can also enable netlogon debug logging on the DC’s to help understand the traffic patterns there.

Installing the Server Performance Advisor on the Windows 2003 DC’s or using the Active Directory Diagnostics Data Collector in the Windows 2008 Reliability and Performance monitor will help fish out any potential bottlenecks.

Take netmon captures and search for NetrLogonSamLogonEx entries for Netlogon requests

For Windows 2003 servers, you can install the following hotfix to help track these type issues faster. This hotfix adds new performance counters to help track access to these semaphores better. Windows 2008 servers already have this built in to the OS

New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication
http://support.microsoft.com/default.aspx/kb/928576

The main ones you want to look at at the following

Semaphore Holders: How many threads on average are holding the client semaphore

This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting

By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded.

Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore

These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring.

Semaphore Waiters: The average number of waiters waiting on the semaphore.

This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.

In some instances on heavily loaded servers, you may want to adjust MaxConcurrentApi to a value of 5 on both the Exchange Servers and DC’s to help widen the pipe or increase the amount of auth requests that can occur at any given time. Bumping this setting up may help alleviate this problem altogether, but could also prolong the issue due to some other underlying issue that has now been masked. It’s always best to understand where the problem is coming from before making any major changes such as this which may increase overall processor utilization on the Exchange server and your domain controllers.

In this instance, we set MaxConcurrentApi to 5 on the DC’s and Exchange Servers and this appears to have reduced the amount of occurrences of this problem.

This particular problem not only affects Exchange servers, but also affects other applications such as ISA server. More information on this can be found here

I hope this provides some insight in to some of the underlying dependency problems that you may seen in Exchange.

That is all for now.

Comments
  • Nice post Mike, way to go on this one.

  • Very interesting post! Are you allowed to share what the root cause of the problem was(e.g. slow network ling, etc.)? Christian

  • i hear that maxconcurrentapi is reated to ntlm. why is ntlm being used and not kerberos???

  • cnschindler - It appears to be network related, but setting MaxConcurrentAPI to 5 allows the server to continue operating when and if there are network problems going on. Instead of a 2 lane highway, we have a 5 lane highway now, so if 2 lanes of the highway are backed up, we still can process logon requests from the other lanes.

    slimwreck - If an authentication request comes in and is not for local auth, we will send the request through a secure channel to the DCs which is controlled via the MaxConcurrentAPI setting.

  • Server 2008R2 has the maxConcurrentAPI set to 10, and maybe something to consider. Other dimension to check would be if the domain controller machine that the machine was connected to was bad. Did you try reseting the dc machine connected to and see if that improves the situation ?

  • Rahul - MaxConcurrentAPI is set to 10 for Clients (Vista/7), 2 for Member servers, and 1 for Domain Controllers for 2008. For Windows Server 2012, they are set to 10, across the board, but if it is a large environment in co-existence with 2007 (where NTLM is needed for OutlookAnywhere proxying to function), 10 can easily become inadequate. You just need to watch the metrics to see if the current setup is problematic and adjust as necessary.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment