Hi Folks, I just want to drop a quick note here about KB http://support.microsoft.com/kb/2433623/ that brings the list of the updates that are part of the new Software Update 2 for Forefront Threat Management Gateway (TMG) 2010 Service Pack 1.
Enjoy it!!
1. Introduction
Yesterday I wrote about the “perfect storm” case where the root cause was disk bottleneck, today I would like to discuss a troubleshooting scenario for one of the 5 core scenarios that was presented yesterday, which is called “authentication”. Again, the symptom was exactly the same: ISA Server stops responding requests in random times of the day.
2. Data Gathering
Use the same approach as was presented on yesterday’s blog, in addition, enable netlogon log while preparing the ISA Data Packager, it will be very useful in this type of scenario.
3. Data Analysis
Today I’m going to start the analysis by reviewing Perfmon and for such scenario I like to review some core counters:
· Physical Disk\Average Disk Queue Length: to make sure that we don’t have disk bottleneck like yesterday’s blog.
· Firewall Packet Engine\Backlogged Packets: you know, if this guy start growing substantially we have strong indications of authentication or DNS problem.
· Web Proxy\Memory Pool for SSL Requests (%): you don’t want to see this value dropping, hitting zero and staying there. Usually when this occurs you will receive the event below:
Log Name: Application
Source: Microsoft Forefront TMG Web Proxy
Event ID: 31212
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Description:
The Forefront TMG Web Proxy memory pool that handles SSL connections is low. To specify a larger Web Proxy memory pool, set the ProxyVmemAlloc1pSize registry value in the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W3Proxy\Parameters registry key.
Note: to address this issue, follow http://support.microsoft.com/kb/842438
Those are the core counters, there are much more to look at and cross the data with the rest of the dataset that you might have it. You can see a full list of counters and what to expect on this post here from Tom Shinder.
In this particular case the only counter that was presenting a suspicious (very suspicious) behavior was Backlogged Packets, as you can see below:
Notice the black line, how it goes from zero to 283..wow, that’s huge for sure. Remember that we should not be higher than 10 on this counter. Now, let’s review the dump file to see if the issue showed on perfmon is confirmed via the dump of the wspsrv.exe process:
First step was to check if there was any thread locked in Critical Section, which in this case we haven’t. Since we haven’t then it will be interesting to know the information about the time consumed by each thread. To do that we use the runaway command as show below (just the top five):
0:000> !runaway
User Mode Time
Thread Time
36:d9c 0 days 0:00:07.734
30:d78 0 days 0:00:06.781
37:da0 0 days 0:00:06.703
21:d24 0 days 0:00:06.390
29:d74 0 days 0:00:06.171
Let’s take a look on what the first thread (36) is doing:
0:000> ~36kb
ChildEBP RetAddr Args to Child
29e5ece8 7c827d29 77e61d1e 00004f89 00000000 ntdll!KiFastSystemCallRet
29e5ecec 77e61d1e 00004f89 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc
29e5ed5c 77c6a927 00004f89 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
29e5ed78 77c69cbf 2a50ca04 00000000 ffffffff rpcrt4!UTIL_WaitForSyncIO+0x20
29e5ed9c 77c6a9fd 2a50c9cc 2a50ca04 29e5ede0 rpcrt4!UTIL_GetOverlappedResultEx+0x1d
29e5edb8 77c6a9bb 2a50c9cc 2a50ca04 29e5ede0 rpcrt4!UTIL_GetOverlappedResult+0x17
29e5edd8 77c69517 00000400 00000074 299626d0 rpcrt4!NMP_SyncSendRecv+0x73
29e5ee00 77c6972f 00000000 299626d0 00000074 rpcrt4!OSF_CCONNECTION::TransSendReceive+0x7d
29e5ee88 77c6969c 299626d0 2a5c9308 00000001 rpcrt4!OSF_CCONNECTION::SendFragment+0x2ae
29e5eee0 77c69842 00000000 0000005c 29e5ef24 rpcrt4!OSF_CCALL::SendNextFragment+0x1e2
29e5ef28 77c69aba 29e5efb0 29e5ef6c 00000001 rpcrt4!OSF_CCALL::FastSendReceive+0x148
29e5ef44 77c69a3d 29e5efb0 29e5ef6c 29e5efb0 rpcrt4!OSF_CCALL::SendReceiveHelper+0x5b
29e5ef74 77c7feb0 299626e8 29e5ef94 77c80845 rpcrt4!OSF_CCALL::SendReceive+0x41
29e5ef80 77c80845 29e5efb0 7d1f7690 29e5f3a4 rpcrt4!I_RpcSendReceive+0x24
29e5ef94 77ce415a 29e5efdc 29962744 00000000 rpcrt4!NdrSendReceive+0x2b
29e5f384 7d1fac12 7d1f7690 7d1f998e 29e5f3a4 rpcrt4!NdrClientCall2+0x22e
29e5f39c 7d1fadd3 2a408880 29e5f3d0 29e5f4c0 advapi32!LsarLookupSids2+0x1c
29e5f41c 7d1fac5f 2a408880 00000001 29e5f4d0 advapi32!LsaICLookupSids+0xbd
29e5f45c 7d1faf65 2a408880 00000001 29e5f4d0 advapi32!LsaLookupSids+0x41
29e5f4c4 7d1faee6 00000000 275ce740 00000000 advapi32!LookupAccountSid+0x8b
Notice (bottom to top) that we start this thread by retrieving the name of the account for this SID and the name of the first domain on which this SID is found using the function LookupAccountSid. After that we move to LsaLookupSids function, which looks up the names that correspond to an array of security identifiers (SIDs). This lookup process still going on until the Firewall Service (wspsrv.exe) leverages the function NdrClientCall2 to make a client side call using RPC. This RPC call keeps going back and forth until RPC gets stuck waiting for something.
Note: for more information on how to debug RPC Stuck Call Problem read this article.
By reviewing the other 4 threads on the top 5 threads from “runaway” output I notice that they all are doing the same type of operation and getting stuck on the same place. Here I already have enough data to say that we are hanging because we are waiting on authentication to come from the DC, but if you want to have the triple confirmation, open netlogon.log and you should see also events like these:
[CRITICAL] CONTOSO : NlFinishApiClientSession: timeout call to \\dc01.contoso.com Count: 1
[CRITICAL] CONTOSO : NlpUserValidateHigher: denying access after status: 0xc0020017 1
[SESSION] CONTOSO : NlSetStatusClientSession: Set connection status to c0020017
[SESSION] CONTOSO : NlSetStatusClientSession: Unbind from server to \\dc01.contoso.com(TCP) 1.
This keeps going on and on until we time out:
[SESSION] CONTOSO : NlSetStatusClientSession: Unbind from server \\dc01.contoso.com (PIPE) 0.
[SESSION] CONTOSO : NlSessionSetup: Session setup Failed
[CRITICAL] I_NetLogonGetAuthData failed: (null) CONTOSO (Flags 0x1): 0xc000005e
[LOGON] SamLogon: Network logon of CONTOSO\YuriDio from ISA02 Returns 0xC000005E
[SESSION] I_NetLogonGetAuthData called: (null) CONTOSO (Flags 0x1)
Notice that the error 0xC000005E means STATUS_NO_LOGON_SERVERS, which clearly states that we can’t get on hold of the DC for some reason.
4. Conclusion
In this case ISA/TMG performance was affected due the lack of response from the DC, in this particular case the DC was suffering with a bottleneck and causing authentication requests to queue up. Again, another good example of having the same symptom for a different root cause. This is a very important concept to keep in mind: it is not because you have the same symptom that the root cause of the issue will be the same. Yesterday and today you have two posts where the symptom was the same but with a completely different root cause.
As more I deal with Performance issue on ISA Server (or TMG), more I realize that there are not really a lot of new things on this area to be explored. The reason why I say that there are not much of new things on this area is because I can easily map the top five core causes of ISA/TMG stop responding requests and causing the “server hanging” symptom, which are:
· DNS – a wrong DNS configuration or a lack of response from the DNS Server can definitely cause issues on ISA. Please see the following related articles:
http://blogs.technet.com/b/isablog/archive/2009/08/27/side-effects-of-incorrect-dns-configuration-on-isa-server-10060-connection-timeout-scenario.aspx
http://blogs.technet.com/b/isablog/archive/2009/01/12/isa-server-2006-stops-answering-requests.aspx
· Authentication – if the DC doesn’t answer, ISA can’t authenticate and as result new authentication requests will start to accumulate. The infamous 5783/5719 scenario is a good example of that. Please see the following related articles:
http://blogs.technet.com/b/yuridiogenes/archive/2008/06/05/isa-server-losing-secure-channel-with-the-dc-the-5783-nightmare.aspx
· Logging – if we can’t log we will eventually stop responding. On ISA we go to lockdown mode, on TMG we start to write the LLQ files in the disk, which can fill up the disk and the server runs out of disk space, which will end up causing ISA/TMG stop responding. Please see the following related articles:
http://blogs.technet.com/b/yuridiogenes/archive/2008/08/06/intermittent-performance-problem-while-accessing-internet-through-isa-server-2006.aspx
· Disk – this is one key element, because if we have disk bottleneck, everything else will fall apart. Please see the following related articles:
http://blogs.technet.com/b/isablog/archive/2010/05/10/how-disk-bottleneck-can-affect-tmg-performance.aspx
· Antivirus – well, yeah…this is true. There are many elements here that can go wrong, for example: some antivirus also introduces firewall modules and cause conflicts with ISA/TMG firewall kernel engine, which is something that I already explained in here. Please see the following related articles:
http://blogs.technet.com/b/isablog/archive/2008/03/11/isolating-problems-that-seems-to-be-related-to-the-isa-server-part-iii.aspx
http://blogs.technet.com/b/yuridiogenes/archive/2009/07/18/isa-server-stop-answering-requests-and-firewall-service-hangs.aspx
As you can see this is a long list and the scenario that I’m about to describe on this post is a combination of all elements above and I like to call it: the perfect storm. What’s the symptom? The usual: ISA Server stop responding request and to fix ISA Admin have to restart Firewall Service.
On this type of scenario the most common action plan is to gather perfmon, dump of the wspsrv.exe process and ISA Data Packager in repro mode. Here are the core steps:
a. Install ISABPA from www.isabpa.com
b. Configure Performance Monitoring with the following objects:
> ISA Server Firewall Packet Engine/*
> ISA Server Firewall Service/*
> ISA Server Web Proxy/*
> Memory/*
> Processor/*
> Network Interface/*
> Process/*
> Physical Disk/*
> Threads/*
Note: configure the maximum size file for 200MB, the refresh time to 15 seconds and configure Perfmon to stop when the log is full and create a new file (Schedule Tab).
c. Install the DebugDiag (download from the link below):
http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=28bd5941-c458-46f1-b24d-f60151d875a3
When the issue happens, the following steps needs to be done in order to gather the correct data:
a. Go to Start / Programs / Microsoft ISA Server / ISA Tools / ISA Data Packager
b. On the option "Collect data using one of the following repro scenarios", select "Web Proxy and Web Publishing" and click Next;
c. Click in Modify Options;
d. In addition to the options that are already selected please select also:
- ISA BPA
- ISA Info
e. Click in Start data collection
f. The Data Packager will start to run. When the option "Press spacebar to start the capture" appears, press the spacebar and repro the issue by trying to connect from the client workstation.
g. After you finishing testing then press space bar again in the ISA Data Packager console.
h. When ISA Data Packager finishes collecting data, open DebugDiag.
i. On the first Debugdiag screen (Select Rule Type) click cancel.
j. Go to Processes tab and look for the wspsrv.exe process.
l. While this window is open, go back to the workstation and try to connect again.
m. While the workstation is trying to connect, go back to debugdiag window, right click on wspsrv.exe process and choose Create Full User Dump.
n. Stop Perfmon counter.
Having the correct data in hands you can start looking for obvious issues on the ISA BPA report. If there is nothing there that is relevant for this type of issue, move forward to analyze the perfmon or dump. In this particular case I’m going to review the dump first.
First step was to check if there was any thread locked in Critical Section:
0:000> !cs -l
-----------------------------------------
DebugInfo = 0x000c0fe8
Critical section = 0x011c9024 (+0x11C9024)
LOCKED
LockCount = 0x0
WaiterWoken = No
OwningThread = 0x0000113c
RecursionCount = 0x1
LockSemaphore = 0x0
SpinCount = 0x00000000
Notice that this thread is locked and we have the address of the owning thread; let’s see which thread is that:
45 Id: 1c38.113c Suspend: 1 Teb: 7ffa0000 Unfrozen
36d8f354 7c827d29 77e61d1e 00000d28 00000000 ntdll!KiFastSystemCallRet
36d8f358 77e61d1e 00000d28 00000000 36d8f39c ntdll!ZwWaitForSingleObject+0xc
36d8f3c8 77e61c8d 00000d28 00004e20 00000000 kernel32!WaitForSingleObjectEx+0xac
36d8f3dc 74cd2e3f 00000d28 00004e20 00004e20 kernel32!WaitForSingleObject+0x12
36d8f408 6d56ddde 002fb5b0 0138d2a0 00000009 DBmsLPCn!ConnectionRead+0xaf
36d8f428 6d5687fc 0138f2c0 0138d2a0 00000009 dbnetlib!WrapperRead+0x2c
36d8f480 4e2597ce 0138f2c0 0138d2a0 0138d2a0 dbnetlib!ConnectionRead+0x519
36d8f4b4 4e25982d 0138f2c0 0138d2a0 00000009 sqloledb!CDataSource::ConnectionRead+0x35
36d8f500 4e252358 0138d06e 00000001 00000000 sqloledb!CDBConnection::GetBytes+0x269
36d8f54c 4e2555c4 011da180 00000088 0000001e sqloledb!CDBConnection::ProcessTDSStream+0x157
36d8f608 4e255691 011c5680 0000003d 011fb198 sqloledb!CStmt::ExecDirect+0x786
36d8f620 4e254d32 011c5680 0000003d 00000000 sqloledb!CStmt::SQLExecDirect+0x28
36d8f650 4e25517d 00000000 4e25321c 0000003d sqloledb!CCommand::ExecuteHelper+0x157
36d8f6d4 4e254c4b 011d4888 00000000 4bbea778 sqloledb!CCommand::Execute+0x76b
36d8f70c 4bbea64d 011e8550 00000000 4bbea778 sqloledb!CImpICommandText::Execute+0xdd
36d8f74c 4bc0c79b 011c8d78 011fb22c 011f8738 msado15!CConnection::Execute+0x9d
36d8f91c 4bbea4a7 011cfed8 00000000 011d16c8 msado15!_ExecuteAsync+0x19f
36d8f930 4bbea385 011cfed8 ffffffff 00000000 msado15!ExecuteAsync+0x23
36d8fa18 4bbea258 00000000 7c828200 00000000 msado15!CQuery::Execute+0xa5e
36d8fa84 4bc21717 011d16c8 00000000 7c828200 msado15!CCommand::_Execute+0x153
The yellow line in the second stack shows that the machine is submitting a SQL statement using the SQLExecDirect function. Now let’s see what SQL command is being executed:
0:000> du 011c5680
011c5680 "SELECT RTRIM(filename) FROM ISAL"
011c56c0 "OG_20101107_FWS_000..sysfiles"
Logs starting with FWS suffix represent the Firewall log; in this case ISA was querying the SQL database for this log. Now where is SQL located? According to ISAInfo collected by ISA Data Packager the Log was located on the D: drive, which was actually part of the same disk as C:, only in a different partition. Now it is time to review perfmon and see if we can match this with something going on from the disk perspective.
Here it is sample of the time where the issue was happening:
The black line represents the Average Disk Queue Length that goes from zero to 26 (maximum should be 2 per spindle - in this case we just have 1 spindle) and got stuck there from 1:36PM to 1:39PM. During the same time we see the ISA Server Firewall Packet Engine\Backlogged Packets goes from zero to 113 (maximum should never be higher than 10). The logic here is the following:
1. ISA is trying to query a firewall log located on the SQL (MSDE in this case) database. ISA is waiting on SQL.
2. SQL is performing a reading operation for a piece of information located in disk. SQL is waiting on Disk.
3. Disk is having bottleneck and it is queuing up requests.
4. Since ISA can’t proceed (since is waiting on disk), ISA starts to accumulate requests (backlog starts to grow). ISA stop answering new requests.
5. Clients can’t browse.
You might be thinking, but this is only for 3 minutes, I can live with that. Really? I doubt your helpdesk will not overflow of calls if nobody can browse Internet for 3 minutes.
I hope this post gives you a big picture of what goes behind an ISA (or TMG) performance issue in scenarios where ISA/TMG stops responding. There are much more elements that needs to be investigated other than ISA/TMG itself.
The Microsoft Solution Accelerators team is pleased to announce the release of new resources that you can use in combination with the Microsoft Security Compliance Manager tool: the Windows Server 2008 R2 Security Baseline and the Office 2010 Security Baseline, and setting packs for Windows 7 and Internet Explorer 8.
Together with the tool, these resources are designed to help organizations efficiently manage the security and compliance process for some of the most widely used Microsoft products.
The Security Compliance Manager works with the Microsoft Assessment and Planning (MAP) Toolkit and the Microsoft Deployment Toolkit (MDT) to help you plan, securely deploy, and manage new Microsoft technologies—easier, faster, and at less cost. Learn more.
Next steps:
First, learn more about the Security Compliance Manager tool. Next, learn more about the new security baselines and setting packs:
Download the tool:
Last week I delivered a level 200 presentation for a customer about TMG, a brief overview of TMG in a Secure Web Gateway Scenario and some demos (which are not available here, but I’m working on the videos to share it here). The goal of this presentation is to introduce TMG for ISA administrators of for new edge administrators, the agenda includes:
Enjoy it !!
Today Microsoft Press published on their blog an overview of what this new book contains and how it is organized. For more information take a look at:
http://blogs.msdn.com/b/microsoft_press/archive/2010/11/03/new-book-deploying-microsoft-forefront-unified-access-gateway-2010.aspx
I also hope you enjoy it.