Consider a scenario where the TMG administrator is publishing servers that are behind TMG and after enabling NLB on the External interface the users are not able to access those resources. If he uses the DIP (Dedicated IP) to publish the resource it works. The basic diagram is showed in the figure below:
The traffic from inside (VLAN E) to outside resource (for example to VLAN A) using the internal VIP (10.10.10.1) as default gateway was working, however the incoming traffic (from VLAN A for instance) to the external VIP (192.168.4.10) was failing.
There is no clever troubleshooting here, always start from the basics in scenarios like this, which basic? Try enabling live logging and see the traffic is ever arriving to TMG in the first place. In this case was not but, to be in the safe side we installed netmon in all three nodes to see if the traffic was hitting the box at all. It was not, no traffic arriving on TMG.
After reviewing the core switch we were able to see that the NLB MAC address in there was wrong, hence the traffic was never arriving on TMG. This was a Cisco 6500 Switch Series and we follow Cisco’s recommendation below in order to address this issue:
Cisco Catalyst 6500 Series Switches - Catalyst Switches for Microsoft Network Load Balancing Configuration Example http://www.cisco.com/en/US/products/hw/switches/ps708/products_configuration_example09186a0080a07203.shtml
TMG NLB leverages Windows NLB and all the recommendations that we have on Windows NLB applies to TMG NLB. In other words, if Windows NLB has restrictions, TMG will obey those restrictions. Here are some important points to remember when implementing NLB:
Switch is operating in Layer-3 mode
NLB is not supported when the hosts are homed to a switch operating at Layer-3. Instead, create a VLAN for all the nodes in the NLB cluster, and configure that VLAN to operate in Layer-2 mode.
An unusual number of TCP connections to the cluster are being reset.
See more “gotchas” at http://download.microsoft.com/download/3/2/3/32386822-8fc5-4cf1-b81d-4ee136cca2c5/NLB_Troubleshooting_Guide.htm
…and always remember the 5 commandments when troubleshooting NLB on TMG:
1. Never assume that the problem is on TMG in the first place. 2. Never think that because your network infrastructure worked fine for years without problems that is free of issues. 3. Do not take it personal if someone says: you’ve got a problem on your switch. 4. Always have a hub available for testing purpose, sometimes having a dumb device to validate NLB functionality can save much more time than dealing with smart devices for hours. 5. NLB Unicast causes switch flooding, don’t be surprise by having flooding after enabling NLB on TMG and blame TMG, this is how Windows NLB works.
Now we have available parts 2 and 3 of this webcast that explains the core reasons to migrate from ISA 2006 to TMG 2010. We are working on having this series converted in English. Stay tune in this channel.
Troubleshooting high CPU utilization can be challenging, even when you know the process that is causing such behavior, which in this case is the wspsrv.exe. In order to better understand what’s going on in scenarios like this is recommended to get perfmon snapshot, including the time that the process is working normally, until the issue happens. It is also recommended to get a user mode dump of the wspsrv.exe process; to do that read this article here. This post is based on a scenario where wspsrv.exe process was consuming lots of CPU resource during business hours. The issue did not happen out of the business hours.
In this case perfmon was very decisive because it was possible to see the normal behavior (wspsrv.exe using around 20% average) and then after starting business hours the CPU utilization by wspsrv.exe process grows consistently. However, having lots of perfmon counters without knowing exactly what to start looking at is not very useful, it was the user mode dump of the wspsrv.exe process that triggered what to look for. For this particular scenario we had around 1000 threads running within wspsrv.exe process and around 200 threads had the same pattern as shown below:
0:000> ~50 kv
*** ERROR: Symbol file could not be found. Defaulted to export symbols for mpengine.dll -
Child-SP RetAddr : Args to Child : Call Site
00000000`1ea0e520 00000000`70433461 : 00000000`889f7b60 00000000`889f7b50 00000000`889f7bb0 00000000`889f78d8 : 0x1e774914
00000000`1ea0e570 00000000`70432f68 : 00000000`889f7828 00000000`889f7828 00000000`889f7bb0 00000000`7046ba90 : mpengine+0x13461
00000000`1ea0e630 00000000`704cec70 : 00000000`889b0040 00000000`889f7bb0 00000000`00000012 00000000`889f78d8 : mpengine+0x12f68
00000000`1ea0e6e0 00000000`704cdb4b : 00000000`000000fa 00000000`00000001 00000000`00000009 00000000`00000000 : mpengine+0xaec70
00000000`1ea0e8e0 00000000`7043ddd0 : 00000000`16d285f0 00000000`16d285f0 00000000`00000000 00000000`16d285f0 : mpengine+0xadb4b
00000000`1ea0ea00 00000000`7043e447 : 00000000`16d28110 00000000`16d285f0 00000000`00000000 00000000`00000000 : mpengine+0x1ddd0
00000000`1ea0ebf0 00000000`7043e1bf : 00000001`804fbf93 00000000`00000000 00000000`09180492 00000000`00000001 : mpengine+0x1e447
00000000`1ea0ec90 00000000`704adfb6 : 00000000`16d28100 00000000`16c66f01 00000000`00000000 00000000`00000000 : mpengine+0x1e1bf
00000000`1ea0ed20 00000000`7057b1b1 : 00000000`183c0e00 00000000`7068aa80 00000000`00000000 00000000`16d28110 : mpengine+0x8dfb6
00000000`1ea0edb0 00000000`7057b5e3 : 00000000`21825080 00000000`7068aa80 00000000`00000001 00000000`183c0e00 : mpengine+0x15b1b1
00000000`1ea0ee10 00000000`7057b7ba : 00000000`21825080 00000000`1ea0ef90 00000000`21825080 00000000`21825080 : mpengine+0x15b5e3
00000000`1ea0eed0 00000000`7057b6ab : 00000000`00000000 00000000`0000800c 00000000`00000001 00000000`21825080 : mpengine+0x15b7ba
00000000`1ea0ef20 00000000`70686365 : 00000000`00000000 00000000`00000020 00000000`5d69f640 00000000`5d69f640 : mpengine+0x15b6ab
00000000`1ea0ef50 00000000`7057da92 : 00000000`00000020 00000000`00004032 00000000`21825080 000007fe`f1a029e4 : mpengine!GetSigFiles+0x5b035
00000000`1ea0f080 00000000`7057f4d6 : 00000000`569def80 00000000`00000001 00000000`1d730548 00000000`76ecccd4 : mpengine+0x15da92
00000000`1ea0f230 00000000`7057f605 : 00000000`569def80 00000000`1ea0f440 00000000`00004032 00000000`76fb3660 : mpengine+0x15f4d6
*** ERROR: Symbol file could not be found. Defaulted to export symbols for EmpScan.dll -
00000000`1ea0f260 000007fe`f1a0d2d9 : 00000000`00000000 00000000`712c32c3 00000000`0195da28 00000000`76e901f2 : mpengine!rsignal+0x105
00000000`1ea0f390 000007fe`f1a0c668 : 00000000`569decd0 00000000`00004032 00000000`1ea0f440 00000000`00000020 : EmpScan!SetProxyStatsObject+0x9289
00000000`1ea0f400 000007fe`f1a104ff : 00000000`569decd0 00000000`6c8eeb68 00000000`6c8eeb70 000007fe`f19f8bc8 : EmpScan!SetProxyStatsObject+0x8618
00000000`1ea0f550 000007fe`f1a0904f : 00000000`6c8eeb70 00000000`0385b5f0 00000000`60e43070 00000000`000003b8 : EmpScan!SetProxyStatsObject+0xc4af
EMPScan and MPEngine are part of Malware Inspection as explained in this KB. This raises a flag that the good amount of CPU utilization by wspsr.exe process was caused by those 200 threads that were using functions that belongs to Malware Inspection. Having those 200 threads were already suspicious, but it could be that server was just too busy during that time performing malware inspection. In order to move I verified if there was any thread locked in Critical Section. To do that the following command was used !cs -l and the result was:
DebugInfo = 0x0000000000343980
Critical section = 0x000000000032b500 (+0x32B500)
LockCount = 0x0
WaiterWoken = No
OwningThread = 0x0000000000000b28
RecursionCount = 0x1
LockSemaphore = 0x0
SpinCount = 0x0000000000000000
DebugInfo = 0x00000000324861f0
Critical section = 0x00000000648f12d0 (+0x648F12D0)
OwningThread = 0x0000000000000b1c
SpinCount = 0x0000000000000fa0
DebugInfo = 0x00000000328a9200
Critical section = 0x000000005f1b8240 (+0x5F1B8240)
OwningThread = 0x00000000000011d0
….and many more…
Now that we know the threads that are locked in Critical Section, we just need to get more details about those threads by using the command !cs –o <owning thread>, in this case the relevant thread was:
0:000> !cs -o 0x000000005f1d9460
Critical section = 0x000000005f1d9460 (+0x5F1D9460)
DebugInfo = 0x00000000565fd4f0
OwningThread = 0x000000000000194c
RecursionCount = 0x5
LockSemaphore = 0x43E80
OwningThread DbgId = ~889s
OwningThread Stack =
00000000`78fde658 000007fe`f0c0a6c9 : 00000000`566fa638 00000000`78fde6c4 00000000`566fa638 00000000`00000000 : EmpFilter!DllUnregisterServer+0x1c6b7
00000000`78fde660 000007fe`f0c22a52 : 00000000`566fa630 00000000`31eb9120 000007fe`f0bc9198 00000000`4a74d9d0 : EmpFilter!DllUnregisterServer+0x20431
00000000`78fde7b0 000007fe`f0c15f53 : 00000000`31eb9100 00000000`566fa6fc 00000000`4a74d9d0 00000000`00000000 : EmpFilter!DllUnregisterServer+0x387ba
00000000`78fde830 000007fe`f0c0c11b : 00000000`31eb9100 00000000`566fa6fc 00000000`4a74d9d0 00000000`4a74d9d0 : EmpFilter!DllUnregisterServer+0x2bcbb
00000000`78fde8a0 000007fe`f0c172ee : 00000000`566fa630 00000000`00000000 00000000`566fa648 00000000`78fde960 : EmpFilter!DllUnregisterServer+0x21e83
00000000`78fde920 000007fe`f0c0d7d4 : 00000000`566fa638 00000000`00000000 00000000`566fa740 00000000`00001355 : EmpFilter!DllUnregisterServer+0x2d056
00000000`78fde9a0 000007fe`f0c12c67 : 00000000`566fa630 00000000`618a3ba0 00000000`00001355 00000000`4a74d6b0 : EmpFilter!DllUnregisterServer+0x2353c
00000000`78fdeb20 000007fe`f0c1afad : 00000000`566fa630 00000000`618a3ba0 00000000`00001355 00000000`00000000 : EmpFilter!DllUnregisterServer+0x289cf
00000000`78fdebd0 000007fe`f0be40ee : 00000000`6cfd8640 00000000`30f6eaf0 000007fe`f0bca490 000007fe`f0bd133f : EmpFilter!DllUnregisterServer+0x30d15
00000000`78fdec60 000007fe`f0c3270b : 00000000`1af7a7b0 00000000`56cdf680 00000000`30f6eaf0 00000000`78fded64 : EmpFilter+0x240ee
00000000`78fded40 00000000`72ebf77b : 000007fe`f0c5eb18 00000000`56cdf680 00000000`00000400 00000000`30f6eaf0 : EmpFilter!DllUnregisterServer+0x48473
00000000`78fdeda0 00000000`72ed316d : 00000000`56cdf620 00000000`618a3ba0 00000000`00001355 00000000`00001355 : W3Filter!DllInstall+0x2419f
00000000`78fdeec0 00000000`72eabe84 : 00000000`562b4480 00000000`618a3ba0 00000000`00001355 00000000`78fdefe0 : W3Filter!DllInstall+0x37b91
00000000`78fdef30 00000000`72efcacd : 00000000`562b44b8 00000000`1ad135e0 00000000`618a3ba0 00000000`00001355 : W3Filter!DllInstall+0x108a8
00000000`78fdefd0 00000000`72f031bc : 00000000`27109010 00000000`00000001 00000000`000003e5 00000000`00000000 : W3Filter!DllInstall+0x614f1
00000000`78fdf040 00000000`72f038cb : 00000000`27109010 00000000`00000000 00000000`00000000 00000000`72e5d9d0 : W3Filter!DllInstall+0x67be0
00000000`78fdf140 00000000`72f04420 : 00000000`27109010 00000000`27109030 00000000`00000000 00000000`565da0e0 : W3Filter!DllInstall+0x682ef
00000000`78fdf180 00000000`72ede14f : 00000000`27109030 00000000`00001355 00000000`00000000 00000000`00000000 : W3Filter!DllInstall+0x68e44
00000000`78fdf1c0 00000001`3fcc016e : 00000000`4a5c5e80 00000000`00000001 00000000`00000000 00000000`565da0e0 : W3Filter!DllInstall+0x42b73
00000000`78fdf4a0 00000001`3fcbf557 : 00000000`3235bd50 00000000`00001355 00000000`00000000 00000000`00000001 : wspsrv!IsChainingRequired+0x582d2
The resolution for this problem is documented on this article here at TMG Team Blog.
4. Another Approach to Troubleshoot Issues of this Nature
Another approach to troubleshoot high CPU utilization is by leveraging the perfmon capabilities to identify the threads that are using more resources within the process. For example, in this scenario the perfmon was showing the following result:
Here you have two key counters within thread object
· %Processor Time: specify how much processor this thread is using.
· ID Thread: represents the ID of the thread that you can use to correlate that on the dump for this process.
Using the recommendations from KB133385 we have:
1. Monitor each thread's %Processor Time and identify the instance number in Performance Monitor for the thread that takes the most processor time.
2. Find the ID Thread item that has the same instance number as the thread you identified in step 1 and select this ID Thread with the mouse. Note that its Last and Average status indicators have the same value. This value is the decimal thread ID.
3. Convert the decimal thread ID to a hexadecimal (hex) number to get the thread ID for desired component.
With the HEX value of the thread ID in hands you can go back to the dump file and look which thread is that and what this thread is doing.
Note: there is a really good step by step on Marcelo’s blog. Also, make sure to read the Chapter 13 of the Windows NT Resource Kit (old buddy that still rocking).
Here it goes the first video (out of 3) on which I go over some core reasons why you should migrate from ISA 2006 to TMG:
Almost one month without update, that’s very rare for sure, time is working against me but pretty soon I should be able to breathe again. Last couple of months I was studying to get my CEH certification and couple of weeks ago and I got it. Here are some resources that I recommend you to read if you are studying for this certification:
Some Articles to Complement
Now I’m working to finish my new book (in Portuguese only) about Security+ Certification (Portuguese’s readers go to www.securityplusbr.org for more info). This project should be done by end of this month and this book should be released in Brazil April next year.
On the community space my last contributions were:
My MVP friends are very active these days and here are some great articles that they recently published:
Next week I should have new posts here with some interesting issues that I’m dealing with.