Yes, another case where TMG stops responding…bad, bad TMG right? NOT!!! Recently I worked in some scenarios where TMG was stopping every day, during the same time and required a manual restart. TMG Admin claimed that nothing really changed on TMG or on the client workstations, besides the environment was running rock solid for a long time and the issue started happened couple of weeks ago.
2. Digging In
After many sessions of data gathering (believe me, it can take more than one round to find out) using the usual approach to collect performance related data I found out the following trend when the issue was happening:
This is the Forefront TMG Firewall Packet Engine\Backloggged Packets counter, which should be 10 and it was 2,485.000 (WOW…just WOW). Notice this beautiful line going from 0 to 2K and worst, staying there forever. Now we know why TMG stops responding, but why backlogging is growing? Well, there are two core reasons: authentication and/or name resolution. On the user mode dump of wspsrv.exe we have hundreds of threads like this:
Child-SP RetAddr Call Site
00000000`11fbccc8 000007fe`fd82aa76 ntdll!ZwAlpcSendWaitReceivePort+0xa
00000000`11fbccd0 000007fe`fd8ccb64 rpcrt4!NdrDllCanUnloadNow+0x31c6
00000000`11fbcd90 000007fe`fd8ccd55 rpcrt4!Ndr64AsyncClientCall+0xe04
00000000`11fbd050 000007fe`fc9e1f95 rpcrt4!NdrClientCall3+0xf5
00000000`11fbd3e0 000007fe`fc9e1e74 dnsapi!DnsApiAlloc+0xdd1
00000000`11fbd440 000007fe`fc9e60a6 dnsapi!DnsApiAlloc+0xcb0
00000000`11fbd500 000007fe`fca0d012 dnsapi!DnsValidateName_W+0x186
00000000`11fbd580 00000000`72cab68f dnsapi!DnsQuery_A+0x36
00000000`11fbd5d0 00000000`72caaced msphlpr!COC_NameResolution_TargetImpl::FoundInNegativeCache+0x2b93
00000000`11fbd6a0 00000000`72ca6efe msphlpr!COC_NameResolution_TargetImpl::FoundInNegativeCache+0x21f1
00000000`11fbd960 00000001`3fa467ff msphlpr!ProxyGetHostByAddr+0x4a2
00000000`11fbdce0 00000001`3fa48420 wspsrv!FwGapaGetConfig+0x4b4a3
00000000`11fbddb0 00000001`3f8ded8d wspsrv!FwGapaGetConfig+0x4d0c4
00000000`11fbe660 00000001`3f92e07c wspsrv+0x5ed8d
00000000`11fbe730 00000001`3f92d79e wspsrv!IsChainingRequired+0x163cc
00000000`11fbef90 00000001`3f8ea240 wspsrv!IsChainingRequired+0x15aee
00000000`11fbf050 00000001`3f8f1c0a wspsrv+0x6a240
00000000`11fbf1f0 00000001`3f8f11d2 wspsrv+0x71c0a
00000000`11fbf270 00000001`3f9838bf wspsrv+0x711d2
00000000`11fbf320 00000001`3f97d871 wspsrv!DeleteFwEngFilter+0x249b
00000000`11fbf360 00000001`3fa1bedc wspsrv!IsChainingRequired+0x65bc1
00000000`11fbf550 00000001`3f971a53 wspsrv!FwGapaGetConfig+0x20b80
00000000`11fbf6a0 00000001`3f94185c wspsrv!IsChainingRequired+0x59da3
00000000`11fbf780 00000001`3f9415ce wspsrv!IsChainingRequired+0x29bac
00000000`11fbf7f0 00000000`771cf56d wspsrv!IsChainingRequired+0x2991e
00000000`11fbf880 00000000`77403021 kernel32!BaseThreadInitThunk+0xd
00000000`11fbf8b0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
…and lots more like this:
00000000`021aece8 000007fe`fd3e10ac ntdll!ZwWaitForSingleObject+0xa
00000000`021aecf0 00000001`3f90609b KERNELBASE!WaitForSingleObjectEx+0x9c
00000000`021aed90 00000001`3f8ec283 wspsrv+0x8609b
00000000`021aee60 00000001`3f8f1c37 wspsrv+0x6c283
00000000`021af250 00000001`3f8f11d2 wspsrv+0x71c37
00000000`021af2d0 00000001`3f9838bf wspsrv+0x711d2
00000000`021af380 00000001`3f97d9fe wspsrv!DeleteFwEngFilter+0x249b
00000000`021af3c0 00000001`3fa1bedc wspsrv!IsChainingRequired+0x65d4e
00000000`021af5b0 00000001`3f971a53 wspsrv!FwGapaGetConfig+0x20b80
00000000`021af700 00000001`3f94185c wspsrv!IsChainingRequired+0x59da3
00000000`021af7e0 00000001`3f9415ce wspsrv!IsChainingRequired+0x29bac
00000000`021af850 00000000`771cf56d wspsrv!IsChainingRequired+0x2991e
00000000`021af8e0 00000000`77403021 kernel32!BaseThreadInitThunk+0xd
00000000`021af910 00000000`00000000 ntdll!RtlUserThreadStart+0x21
I cannot use the private symbols here (for obvious reasons – they are private ), but this function is dealing with re-injection and we had 50 threads performing this operation. This is a magic number, because 50 is the default value of re-injection threads on TMG as I explain here.
3. Moving forward
We now know why TMG box hangs, but why we have this gigantic amount of authentication if the environment didn’t suffer any change, the applications are the same, the users are the same…how’s that possible? Maybe a malware sending burst traffic from inside to outside? We didn’t know but we continued the investigation and concluded that the environment was clean from malware.
We used netmon to understand from where the traffic was coming from and it was identified some IPs on the internal network that were sending this gigantic amount of traffic. We tracked that IPs and found the owners of those computers; they were contractors that were in the company performing a project. Guess what? They were using a P2P application to download “some stuff”. I started reviewing more info about this application and found the following statement on their website:
“If you use a software firewall (e.g. ZoneAlarm) you will need to make sure Ares P2P gets full and unlimited access to the Internet.” Source: http://www.aresp2p.net
What’s your reading on this? I will let you think about that.
This was not a TMG problem, TMG was hanging because it was waiting from DC/DNS to reply to the plethora of requests that it was sending through the network and as it couldn’t authenticate the users and couldn’t allow them just to go through. We really need to step back here and think about security in a broader manner, how do you validate the guest computers on your corporate network? Because here, the environment had a good security policy for domain joined computers by using software restriction policy and disallowing non-authorized applications to run on the corporate environment. But, it didn’t have a validation process for guest computers. I will suggest start by reading this “Protect Corporate Assets from Unmanaged Computers” and go from there. At the end of the day you don’t want to create a rule on your firewall allowing everything for all users just because there are some applications that need full Internet access, unless you are willing to take the risk for such action, are you?
(Note: read the follow up of this post here).
Yuri, this is a very complicated situation, because TMG should simply ignore or Drop the requests, not hang.
no wonder why no one wants to migrate to TMG....
As a Firewall TMG can't just ignore the request and drop in a scenario like that, it needs to give an answer to the client. Such behavior happens on ISA also.
Thanks for your comment.