Back in 2009 I wrote this post about PAL (Performance Analysis of Logs Tool), during that time we didn’t have ISA/TMG template available. Good news is that we now have the TMG template available!!! Back in April 2010 we had our Security Summit in Porto (Portugal) where we started to elaborate a plan to make this happen. My contribution was very little comparing with what those guys from PFE did, I pretty much assisted with some TMG perfmon counters/thresholds. Main kudos here goes to: Clint Huffman (the tool owner), Shaf Mahmood, Zbigniew Kukowski (content owner), Dirk-Jan van der Vecht and Luís Galvão.
Download PAL from here, once you install it you will notice that TMG is now on the list, as shown below:
The current template is V2.4 and the template file is located here:
Now you you have a new PAL to help you out. Enjoy it !!
Yesterday I published this post about an issue that caused TMG stop to respond and I want to clarify one key point here: TMG didn’t stop because was not able to handle the load, let’s be clear on that. Maybe it was not clear for some readers that don’t know about how TMG works, but the issue here was that the DC was not able to handle the gigantic amount of authentication request on the time speed that TMG was sending the requests and waiting for an answer. As a result of that TMG’s backlog started to grow and caused this behavior. Its plain simple: DC was not sized to handle that amount of authentication request. Again, not a TMG issue.
Couple of things can be done to avoid that those incidents don’t fully affect your environment. Here are some key tips (nothing new, but maybe you missed):
· Don’t create rules allowing ALL OUTBOND TRAFFIC as Protocol. This may cause issues as I explained in this post.
· Make sure to use Internet Explorer 7 or higher to take advantage of Kerberos, which will distributed the authentication load among the DCs. Back in 2008 I wrote this article that explains in details all the advantages of using Kerberos for Proxy authentication.
By using those practices you offload the authentication request to go from TMG to the DC and leave this task for the workstation (again read this article for more info), which dramatically impact the backlog (by lowering the utilization). Last but not least I want to say that it’s all about sizing: if the environment was sized to receive 20 x 100, it will have a negative impact if you see 2000 x 100. There is no magic here, in this case TMG was correctly sized, but as a secure firewall it couldn’t allow traffic to pass through without waiting for the DC to reply back saying that that request comes from a valid user, therefore it will fail safe and block the traffic from traversing the networks.
BTW, for those of you that still believe that Hardware Firewall is better, I will let you with the wise words of my friend Tom Shinder about this old discussion: Tom Shinder on “Hardware” Firewalls.
Yes, another case where TMG stops responding…bad, bad TMG right? NOT!!! Recently I worked in some scenarios where TMG was stopping every day, during the same time and required a manual restart. TMG Admin claimed that nothing really changed on TMG or on the client workstations, besides the environment was running rock solid for a long time and the issue started happened couple of weeks ago.
2. Digging In
After many sessions of data gathering (believe me, it can take more than one round to find out) using the usual approach to collect performance related data I found out the following trend when the issue was happening:
This is the Forefront TMG Firewall Packet Engine\Backloggged Packets counter, which should be 10 and it was 2,485.000 (WOW…just WOW). Notice this beautiful line going from 0 to 2K and worst, staying there forever. Now we know why TMG stops responding, but why backlogging is growing? Well, there are two core reasons: authentication and/or name resolution. On the user mode dump of wspsrv.exe we have hundreds of threads like this:
Child-SP RetAddr Call Site
00000000`11fbccc8 000007fe`fd82aa76 ntdll!ZwAlpcSendWaitReceivePort+0xa
00000000`11fbccd0 000007fe`fd8ccb64 rpcrt4!NdrDllCanUnloadNow+0x31c6
00000000`11fbcd90 000007fe`fd8ccd55 rpcrt4!Ndr64AsyncClientCall+0xe04
00000000`11fbd050 000007fe`fc9e1f95 rpcrt4!NdrClientCall3+0xf5
00000000`11fbd3e0 000007fe`fc9e1e74 dnsapi!DnsApiAlloc+0xdd1
00000000`11fbd440 000007fe`fc9e60a6 dnsapi!DnsApiAlloc+0xcb0
00000000`11fbd500 000007fe`fca0d012 dnsapi!DnsValidateName_W+0x186
00000000`11fbd580 00000000`72cab68f dnsapi!DnsQuery_A+0x36
00000000`11fbd5d0 00000000`72caaced msphlpr!COC_NameResolution_TargetImpl::FoundInNegativeCache+0x2b93
00000000`11fbd6a0 00000000`72ca6efe msphlpr!COC_NameResolution_TargetImpl::FoundInNegativeCache+0x21f1
00000000`11fbd960 00000001`3fa467ff msphlpr!ProxyGetHostByAddr+0x4a2
00000000`11fbdce0 00000001`3fa48420 wspsrv!FwGapaGetConfig+0x4b4a3
00000000`11fbddb0 00000001`3f8ded8d wspsrv!FwGapaGetConfig+0x4d0c4
00000000`11fbe660 00000001`3f92e07c wspsrv+0x5ed8d
00000000`11fbe730 00000001`3f92d79e wspsrv!IsChainingRequired+0x163cc
00000000`11fbef90 00000001`3f8ea240 wspsrv!IsChainingRequired+0x15aee
00000000`11fbf050 00000001`3f8f1c0a wspsrv+0x6a240
00000000`11fbf1f0 00000001`3f8f11d2 wspsrv+0x71c0a
00000000`11fbf270 00000001`3f9838bf wspsrv+0x711d2
00000000`11fbf320 00000001`3f97d871 wspsrv!DeleteFwEngFilter+0x249b
00000000`11fbf360 00000001`3fa1bedc wspsrv!IsChainingRequired+0x65bc1
00000000`11fbf550 00000001`3f971a53 wspsrv!FwGapaGetConfig+0x20b80
00000000`11fbf6a0 00000001`3f94185c wspsrv!IsChainingRequired+0x59da3
00000000`11fbf780 00000001`3f9415ce wspsrv!IsChainingRequired+0x29bac
00000000`11fbf7f0 00000000`771cf56d wspsrv!IsChainingRequired+0x2991e
00000000`11fbf880 00000000`77403021 kernel32!BaseThreadInitThunk+0xd
00000000`11fbf8b0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
…and lots more like this:
00000000`021aece8 000007fe`fd3e10ac ntdll!ZwWaitForSingleObject+0xa
00000000`021aecf0 00000001`3f90609b KERNELBASE!WaitForSingleObjectEx+0x9c
00000000`021aed90 00000001`3f8ec283 wspsrv+0x8609b
00000000`021aee60 00000001`3f8f1c37 wspsrv+0x6c283
00000000`021af250 00000001`3f8f11d2 wspsrv+0x71c37
00000000`021af2d0 00000001`3f9838bf wspsrv+0x711d2
00000000`021af380 00000001`3f97d9fe wspsrv!DeleteFwEngFilter+0x249b
00000000`021af3c0 00000001`3fa1bedc wspsrv!IsChainingRequired+0x65d4e
00000000`021af5b0 00000001`3f971a53 wspsrv!FwGapaGetConfig+0x20b80
00000000`021af700 00000001`3f94185c wspsrv!IsChainingRequired+0x59da3
00000000`021af7e0 00000001`3f9415ce wspsrv!IsChainingRequired+0x29bac
00000000`021af850 00000000`771cf56d wspsrv!IsChainingRequired+0x2991e
00000000`021af8e0 00000000`77403021 kernel32!BaseThreadInitThunk+0xd
00000000`021af910 00000000`00000000 ntdll!RtlUserThreadStart+0x21
I cannot use the private symbols here (for obvious reasons – they are private ), but this function is dealing with re-injection and we had 50 threads performing this operation. This is a magic number, because 50 is the default value of re-injection threads on TMG as I explain here.
3. Moving forward
We now know why TMG box hangs, but why we have this gigantic amount of authentication if the environment didn’t suffer any change, the applications are the same, the users are the same…how’s that possible? Maybe a malware sending burst traffic from inside to outside? We didn’t know but we continued the investigation and concluded that the environment was clean from malware.
We used netmon to understand from where the traffic was coming from and it was identified some IPs on the internal network that were sending this gigantic amount of traffic. We tracked that IPs and found the owners of those computers; they were contractors that were in the company performing a project. Guess what? They were using a P2P application to download “some stuff”. I started reviewing more info about this application and found the following statement on their website:
“If you use a software firewall (e.g. ZoneAlarm) you will need to make sure Ares P2P gets full and unlimited access to the Internet.” Source: http://www.aresp2p.net
What’s your reading on this? I will let you think about that.
This was not a TMG problem, TMG was hanging because it was waiting from DC/DNS to reply to the plethora of requests that it was sending through the network and as it couldn’t authenticate the users and couldn’t allow them just to go through. We really need to step back here and think about security in a broader manner, how do you validate the guest computers on your corporate network? Because here, the environment had a good security policy for domain joined computers by using software restriction policy and disallowing non-authorized applications to run on the corporate environment. But, it didn’t have a validation process for guest computers. I will suggest start by reading this “Protect Corporate Assets from Unmanaged Computers” and go from there. At the end of the day you don’t want to create a rule on your firewall allowing everything for all users just because there are some applications that need full Internet access, unless you are willing to take the risk for such action, are you?
(Note: read the follow up of this post here).
BPOS is growing in a fast pace and as IT Admins starts to use this service they need to adjust their Firewall in order to proper allow the traffic to traverse the on-premise clients to the cloud. Microsoft Online Services did a good job documenting what needs to be in place from the Firewall perspective to allow this traffic to correctly flow. Here are the main articles for this type of deployment:
KB2410859 Firewall prevents users from using Microsoft Online Services Directory Synchronization, rich clients, or the Microsoft Online Services Identity Federation Management tool in Office 365
KB2409256 You cannot connect to Lync Online, or certain features do not work, because an on-premises firewall blocks the connection
Both articles mention ISA Server as an example and they also mention that for ISA you may need to use Firewall Client in order to make this deployment to work. If you use Firewall Client, nothing else needs to be done on the client workstation, however, if you don’t want to install Firewall Client you will need to edit the file Program Files\Microsoft Online Services\Sign In\SignIn.exe.config and add the entry below:
<proxy usesystemdefault="True" />
Consider a scenario that you have all the implementations in place, rules are correctly configured on ISA Server as per KB2410859 and have Firewall Client on the workstation, however the issue persists and on ISA log you see access denied due anonymous request. When closely look to the detailed logging (Monitoring/Logging/Lower Pane) you see that no rules appear in there, which means that the request is getting processed in lower level mode (kernel).
The problem here was caused because the option below was enabled:
When you enable this option you might have issues with a variety of applications (not only BPOS), because this option completely disable Anonymous access for Web Proxy requests on the network. This application forces the user’s credential to be requested even before the firewall policy is starting to get evaluated. This is the reason why when you enable this option you receive the warning below:
As you can see on this warning window, this option can cause compatibly issue with applications such as Windows Update (and I found out that with BPOS too). In order to avoid compatibly problems, disable this option and make sure to control your user access via Firewall Policy. There are many other scenarios where we recommend to disable this option, see this article for more information. After disabling this option the user was able to login:
Have a good migration to the Cloud!!
First of all, happy new year!! It took me a long time to come back here due many other projects going on…I’m actually feeling like I still in December as I didn’t really have time off during the holidays. Very different from Brazil (where I’m originally from), there things are slow during the holidays and keep going slow until Carnival (usually in February). This is actually very funny to me, because recently when I was writing my next book (about Security+ Certification in Portuguese) I told my editor: “let’s release the book in March” and he said: “This year Carnival is in March, so nobody will really read books in March, let’s release in April”, he got a good point for sure. But, since I moved to US I notice that the year really starts on January 2nd :).
Anyway, here it goes the first post of this year and it is about a collaboration with a colleague of mine that was originally troubleshooting this issue. The problem here was when trying to join a new node to an existing TMG Array and the following error message appeared:
The user that was trying to join had permission on the Array level as shown below:
We also could see on ProcMon that this user was making the connection to the remote server while the issue was happening:
Unfortunately in this situation as the error message was showing right away, nothing was really useful in TMG Setup logs (located at %windir%\temp). Now what? Well, now you need to move to a more deep data gathering and use TMG Data Packager in both servers (EMS and Node that is trying to join). In this particular scenario it was possible to see the error “ldap_modify_s failed” followed by 0x80070005 (which is Access Denied) while trying to change some properties on ADAM (ADLDS). After reviewing the source code for this specific error at this moment of the failure it was possible to understand that in order to perform such action the user needs Enterprise level rights, in this case the user was not there as shown below:
Once we added the user in there (Enterprise Level) it was possible to join without any issue. So…when deploying TMG, make sure to remember that the user that is joining new members to the array need to have Enterprise Level permission.
Note: If you decide to add a group there, remember the warning for the following window: