Yesterday I wrote about the “perfect storm” case where the root cause was disk bottleneck, today I would like to discuss a troubleshooting scenario for one of the 5 core scenarios that was presented yesterday, which is called “authentication”. Again, the symptom was exactly the same: ISA Server stops responding requests in random times of the day.
2. Data Gathering
Use the same approach as was presented on yesterday’s blog, in addition, enable netlogon log while preparing the ISA Data Packager, it will be very useful in this type of scenario.
3. Data Analysis
Today I’m going to start the analysis by reviewing Perfmon and for such scenario I like to review some core counters:
· Physical Disk\Average Disk Queue Length: to make sure that we don’t have disk bottleneck like yesterday’s blog.
· Firewall Packet Engine\Backlogged Packets: you know, if this guy start growing substantially we have strong indications of authentication or DNS problem.
· Web Proxy\Memory Pool for SSL Requests (%): you don’t want to see this value dropping, hitting zero and staying there. Usually when this occurs you will receive the event below:
Log Name: Application
Source: Microsoft Forefront TMG Web Proxy
Event ID: 31212
Task Category: None
The Forefront TMG Web Proxy memory pool that handles SSL connections is low. To specify a larger Web Proxy memory pool, set the ProxyVmemAlloc1pSize registry value in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W3Proxy\Parameters registry key.
Note: to address this issue, follow http://support.microsoft.com/kb/842438
Those are the core counters, there are much more to look at and cross the data with the rest of the dataset that you might have it. You can see a full list of counters and what to expect on this post here from Tom Shinder.
In this particular case the only counter that was presenting a suspicious (very suspicious) behavior was Backlogged Packets, as you can see below:
Notice the black line, how it goes from zero to 283..wow, that’s huge for sure. Remember that we should not be higher than 10 on this counter. Now, let’s review the dump file to see if the issue showed on perfmon is confirmed via the dump of the wspsrv.exe process:
First step was to check if there was any thread locked in Critical Section, which in this case we haven’t. Since we haven’t then it will be interesting to know the information about the time consumed by each thread. To do that we use the runaway command as show below (just the top five):
User Mode Time
36:d9c 0 days 0:00:07.734
30:d78 0 days 0:00:06.781
37:da0 0 days 0:00:06.703
21:d24 0 days 0:00:06.390
29:d74 0 days 0:00:06.171
Let’s take a look on what the first thread (36) is doing:
ChildEBP RetAddr Args to Child
29e5ece8 7c827d29 77e61d1e 00004f89 00000000 ntdll!KiFastSystemCallRet
29e5ecec 77e61d1e 00004f89 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc
29e5ed5c 77c6a927 00004f89 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
29e5ed78 77c69cbf 2a50ca04 00000000 ffffffff rpcrt4!UTIL_WaitForSyncIO+0x20
29e5ed9c 77c6a9fd 2a50c9cc 2a50ca04 29e5ede0 rpcrt4!UTIL_GetOverlappedResultEx+0x1d
29e5edb8 77c6a9bb 2a50c9cc 2a50ca04 29e5ede0 rpcrt4!UTIL_GetOverlappedResult+0x17
29e5edd8 77c69517 00000400 00000074 299626d0 rpcrt4!NMP_SyncSendRecv+0x73
29e5ee00 77c6972f 00000000 299626d0 00000074 rpcrt4!OSF_CCONNECTION::TransSendReceive+0x7d
29e5ee88 77c6969c 299626d0 2a5c9308 00000001 rpcrt4!OSF_CCONNECTION::SendFragment+0x2ae
29e5eee0 77c69842 00000000 0000005c 29e5ef24 rpcrt4!OSF_CCALL::SendNextFragment+0x1e2
29e5ef28 77c69aba 29e5efb0 29e5ef6c 00000001 rpcrt4!OSF_CCALL::FastSendReceive+0x148
29e5ef44 77c69a3d 29e5efb0 29e5ef6c 29e5efb0 rpcrt4!OSF_CCALL::SendReceiveHelper+0x5b
29e5ef74 77c7feb0 299626e8 29e5ef94 77c80845 rpcrt4!OSF_CCALL::SendReceive+0x41
29e5ef80 77c80845 29e5efb0 7d1f7690 29e5f3a4 rpcrt4!I_RpcSendReceive+0x24
29e5ef94 77ce415a 29e5efdc 29962744 00000000 rpcrt4!NdrSendReceive+0x2b
29e5f384 7d1fac12 7d1f7690 7d1f998e 29e5f3a4 rpcrt4!NdrClientCall2+0x22e
29e5f39c 7d1fadd3 2a408880 29e5f3d0 29e5f4c0 advapi32!LsarLookupSids2+0x1c
29e5f41c 7d1fac5f 2a408880 00000001 29e5f4d0 advapi32!LsaICLookupSids+0xbd
29e5f45c 7d1faf65 2a408880 00000001 29e5f4d0 advapi32!LsaLookupSids+0x41
29e5f4c4 7d1faee6 00000000 275ce740 00000000 advapi32!LookupAccountSid+0x8b
Notice (bottom to top) that we start this thread by retrieving the name of the account for this SID and the name of the first domain on which this SID is found using the function LookupAccountSid. After that we move to LsaLookupSids function, which looks up the names that correspond to an array of security identifiers (SIDs). This lookup process still going on until the Firewall Service (wspsrv.exe) leverages the function NdrClientCall2 to make a client side call using RPC. This RPC call keeps going back and forth until RPC gets stuck waiting for something.
Note: for more information on how to debug RPC Stuck Call Problem read this article.
By reviewing the other 4 threads on the top 5 threads from “runaway” output I notice that they all are doing the same type of operation and getting stuck on the same place. Here I already have enough data to say that we are hanging because we are waiting on authentication to come from the DC, but if you want to have the triple confirmation, open netlogon.log and you should see also events like these:
[CRITICAL] CONTOSO : NlFinishApiClientSession: timeout call to \\dc01.contoso.com Count: 1
[CRITICAL] CONTOSO : NlpUserValidateHigher: denying access after status: 0xc0020017 1
[SESSION] CONTOSO : NlSetStatusClientSession: Set connection status to c0020017
[SESSION] CONTOSO : NlSetStatusClientSession: Unbind from server to \\dc01.contoso.com(TCP) 1.
This keeps going on and on until we time out:
[SESSION] CONTOSO : NlSetStatusClientSession: Unbind from server \\dc01.contoso.com (PIPE) 0.
[SESSION] CONTOSO : NlSessionSetup: Session setup Failed
[CRITICAL] I_NetLogonGetAuthData failed: (null) CONTOSO (Flags 0x1): 0xc000005e
[LOGON] SamLogon: Network logon of CONTOSO\YuriDio from ISA02 Returns 0xC000005E
[SESSION] I_NetLogonGetAuthData called: (null) CONTOSO (Flags 0x1)
Notice that the error 0xC000005E means STATUS_NO_LOGON_SERVERS, which clearly states that we can’t get on hold of the DC for some reason.
In this case ISA/TMG performance was affected due the lack of response from the DC, in this particular case the DC was suffering with a bottleneck and causing authentication requests to queue up. Again, another good example of having the same symptom for a different root cause. This is a very important concept to keep in mind: it is not because you have the same symptom that the root cause of the issue will be the same. Yesterday and today you have two posts where the symptom was the same but with a completely different root cause.