website statistics
October, 2010 - Yuri Diogenes's Blog - Site Home - TechNet Blogs

Yuri Diogenes's Blog

Thoughts from a Senior Technical Writer @ Microsoft Server and Cloud Division (Solutions Group) - Information Experience

October, 2010

Posts
  • Yuri Diogenes's Blog

    Updates

    • 2 Comments

    Almost one month without update, that’s very rare for sure, time is working against me but pretty soon I should be able to breathe again. Last couple of months I was studying to get my CEH certification and couple of weeks ago and I got it. Here are some resources that I recommend you to read if you are studying for this certification:

    Books

    Some Articles to Complement

    Now I’m working to finish my new book (in Portuguese only) about Security+ Certification (Portuguese’s readers go to www.securityplusbr.org for more info). This project should be done by end of this month and this book should be released in Brazil April next year.

    On the community space my last contributions were:

    My MVP friends are very active these days and here are some great articles that they recently published:

    Next week I should have new posts here with some interesting issues that I’m dealing with.

  • Yuri Diogenes's Blog

    Reasons to Migrate from ISA 2006 to TMG (Portuguese WebCast)

    • 0 Comments

    Here it goes the first video (out of 3) on which I go over some core reasons why you should migrate from ISA 2006 to TMG:

  • Yuri Diogenes's Blog

    High Processor Utilization by wspsrv.exe process on TMG 2010

    • 0 Comments

    1. Introduction

     

    Troubleshooting high CPU utilization can be challenging, even when you know the process that is causing such behavior, which in this case is the wspsrv.exe.  In order to better understand what’s going on in scenarios like this is recommended to get perfmon snapshot, including the time that the process is working normally, until the issue happens. It is also recommended to get a user mode dump of the wspsrv.exe process; to do that read this article here. This post is based on a scenario where wspsrv.exe process was consuming lots of CPU resource during business hours. The issue did not happen out of the business hours.

     

    2. Troubleshooting

     

    In this case perfmon was very decisive because it was possible to see the normal behavior (wspsrv.exe using around 20% average) and then after starting business hours the CPU utilization by wspsrv.exe process grows consistently. However, having lots of perfmon counters without knowing exactly what to start looking at is not very useful, it was the user mode dump of the wspsrv.exe process that triggered what to look for. For this particular scenario we had around 1000 threads running within wspsrv.exe process and around 200 threads had the same pattern as shown below:

     

    0:000> ~50 kv

    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for mpengine.dll -

    Child-SP          RetAddr           : Args to Child                                                           : Call Site

    00000000`1ea0e520 00000000`70433461 : 00000000`889f7b60 00000000`889f7b50 00000000`889f7bb0 00000000`889f78d8 : 0x1e774914

    00000000`1ea0e570 00000000`70432f68 : 00000000`889f7828 00000000`889f7828 00000000`889f7bb0 00000000`7046ba90 : mpengine+0x13461

    00000000`1ea0e630 00000000`704cec70 : 00000000`889b0040 00000000`889f7bb0 00000000`00000012 00000000`889f78d8 : mpengine+0x12f68

    00000000`1ea0e6e0 00000000`704cdb4b : 00000000`000000fa 00000000`00000001 00000000`00000009 00000000`00000000 : mpengine+0xaec70

    00000000`1ea0e8e0 00000000`7043ddd0 : 00000000`16d285f0 00000000`16d285f0 00000000`00000000 00000000`16d285f0 : mpengine+0xadb4b

    00000000`1ea0ea00 00000000`7043e447 : 00000000`16d28110 00000000`16d285f0 00000000`00000000 00000000`00000000 : mpengine+0x1ddd0

    00000000`1ea0ebf0 00000000`7043e1bf : 00000001`804fbf93 00000000`00000000 00000000`09180492 00000000`00000001 : mpengine+0x1e447

    00000000`1ea0ec90 00000000`704adfb6 : 00000000`16d28100 00000000`16c66f01 00000000`00000000 00000000`00000000 : mpengine+0x1e1bf

    00000000`1ea0ed20 00000000`7057b1b1 : 00000000`183c0e00 00000000`7068aa80 00000000`00000000 00000000`16d28110 : mpengine+0x8dfb6

    00000000`1ea0edb0 00000000`7057b5e3 : 00000000`21825080 00000000`7068aa80 00000000`00000001 00000000`183c0e00 : mpengine+0x15b1b1

    00000000`1ea0ee10 00000000`7057b7ba : 00000000`21825080 00000000`1ea0ef90 00000000`21825080 00000000`21825080 : mpengine+0x15b5e3

    00000000`1ea0eed0 00000000`7057b6ab : 00000000`00000000 00000000`0000800c 00000000`00000001 00000000`21825080 : mpengine+0x15b7ba

    00000000`1ea0ef20 00000000`70686365 : 00000000`00000000 00000000`00000020 00000000`5d69f640 00000000`5d69f640 : mpengine+0x15b6ab

    00000000`1ea0ef50 00000000`7057da92 : 00000000`00000020 00000000`00004032 00000000`21825080 000007fe`f1a029e4 : mpengine!GetSigFiles+0x5b035

    00000000`1ea0f080 00000000`7057f4d6 : 00000000`569def80 00000000`00000001 00000000`1d730548 00000000`76ecccd4 : mpengine+0x15da92

    00000000`1ea0f230 00000000`7057f605 : 00000000`569def80 00000000`1ea0f440 00000000`00004032 00000000`76fb3660 : mpengine+0x15f4d6

    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for EmpScan.dll -

    00000000`1ea0f260 000007fe`f1a0d2d9 : 00000000`00000000 00000000`712c32c3 00000000`0195da28 00000000`76e901f2 : mpengine!rsignal+0x105

    00000000`1ea0f390 000007fe`f1a0c668 : 00000000`569decd0 00000000`00004032 00000000`1ea0f440 00000000`00000020 : EmpScan!SetProxyStatsObject+0x9289

    00000000`1ea0f400 000007fe`f1a104ff : 00000000`569decd0 00000000`6c8eeb68 00000000`6c8eeb70 000007fe`f19f8bc8 : EmpScan!SetProxyStatsObject+0x8618

    00000000`1ea0f550 000007fe`f1a0904f : 00000000`6c8eeb70 00000000`0385b5f0 00000000`60e43070 00000000`000003b8 : EmpScan!SetProxyStatsObject+0xc4af

     

    EMPScan and MPEngine are part of Malware Inspection as explained in this KB. This raises a flag that the good amount of CPU utilization by wspsr.exe process was caused by those 200 threads that were using functions that belongs to Malware Inspection.  Having those 200 threads were already suspicious, but it could be that server was just too busy during that time performing malware inspection. In order to move I verified if there was any thread locked in Critical Section. To do that the following command was used !cs -l and the result was:

     

    -----------------------------------------

    DebugInfo          = 0x0000000000343980

    Critical section   = 0x000000000032b500 (+0x32B500)

    LOCKED

    LockCount          = 0x0

    WaiterWoken        = No

    OwningThread       = 0x0000000000000b28

    RecursionCount     = 0x1

    LockSemaphore      = 0x0

    SpinCount          = 0x0000000000000000

    -----------------------------------------

    DebugInfo          = 0x00000000324861f0

    Critical section   = 0x00000000648f12d0 (+0x648F12D0)

    LOCKED

    LockCount          = 0x0

    WaiterWoken        = No

    OwningThread       = 0x0000000000000b1c

    RecursionCount     = 0x1

    LockSemaphore      = 0x0

    SpinCount          = 0x0000000000000fa0

    -----------------------------------------

    DebugInfo          = 0x00000000328a9200

    Critical section   = 0x000000005f1b8240 (+0x5F1B8240)

    LOCKED

    LockCount          = 0x0

    WaiterWoken        = No

    OwningThread       = 0x00000000000011d0

    RecursionCount     = 0x1

    LockSemaphore      = 0x0

    SpinCount          = 0x0000000000000fa0

    -----------------------------------------

     

    ….and many more…

     

    Now that we know the threads that are locked in Critical Section, we just need to get more details about those threads by using the command !cs –o <owning thread>, in this case the relevant thread was:

     

    0:000> !cs -o 0x000000005f1d9460

    -----------------------------------------

    Critical section   = 0x000000005f1d9460 (+0x5F1D9460)

    DebugInfo          = 0x00000000565fd4f0

    LOCKED

    LockCount          = 0x0

    WaiterWoken        = No

    OwningThread       = 0x000000000000194c

    RecursionCount     = 0x5

    LockSemaphore      = 0x43E80

    SpinCount          = 0x0000000000000000

    OwningThread DbgId = ~889s

    OwningThread Stack =

     

    00000000`78fde658 000007fe`f0c0a6c9 : 00000000`566fa638 00000000`78fde6c4 00000000`566fa638 00000000`00000000 : EmpFilter!DllUnregisterServer+0x1c6b7

    00000000`78fde660 000007fe`f0c22a52 : 00000000`566fa630 00000000`31eb9120 000007fe`f0bc9198 00000000`4a74d9d0 : EmpFilter!DllUnregisterServer+0x20431

    00000000`78fde7b0 000007fe`f0c15f53 : 00000000`31eb9100 00000000`566fa6fc 00000000`4a74d9d0 00000000`00000000 : EmpFilter!DllUnregisterServer+0x387ba

    00000000`78fde830 000007fe`f0c0c11b : 00000000`31eb9100 00000000`566fa6fc 00000000`4a74d9d0 00000000`4a74d9d0 : EmpFilter!DllUnregisterServer+0x2bcbb

    00000000`78fde8a0 000007fe`f0c172ee : 00000000`566fa630 00000000`00000000 00000000`566fa648 00000000`78fde960 : EmpFilter!DllUnregisterServer+0x21e83

    00000000`78fde920 000007fe`f0c0d7d4 : 00000000`566fa638 00000000`00000000 00000000`566fa740 00000000`00001355 : EmpFilter!DllUnregisterServer+0x2d056

    00000000`78fde9a0 000007fe`f0c12c67 : 00000000`566fa630 00000000`618a3ba0 00000000`00001355 00000000`4a74d6b0 : EmpFilter!DllUnregisterServer+0x2353c

    00000000`78fdeb20 000007fe`f0c1afad : 00000000`566fa630 00000000`618a3ba0 00000000`00001355 00000000`00000000 : EmpFilter!DllUnregisterServer+0x289cf

    00000000`78fdebd0 000007fe`f0be40ee : 00000000`6cfd8640 00000000`30f6eaf0 000007fe`f0bca490 000007fe`f0bd133f : EmpFilter!DllUnregisterServer+0x30d15

    00000000`78fdec60 000007fe`f0c3270b : 00000000`1af7a7b0 00000000`56cdf680 00000000`30f6eaf0 00000000`78fded64 : EmpFilter+0x240ee

    00000000`78fded40 00000000`72ebf77b : 000007fe`f0c5eb18 00000000`56cdf680 00000000`00000400 00000000`30f6eaf0 : EmpFilter!DllUnregisterServer+0x48473

    00000000`78fdeda0 00000000`72ed316d : 00000000`56cdf620 00000000`618a3ba0 00000000`00001355 00000000`00001355 : W3Filter!DllInstall+0x2419f

    00000000`78fdeec0 00000000`72eabe84 : 00000000`562b4480 00000000`618a3ba0 00000000`00001355 00000000`78fdefe0 : W3Filter!DllInstall+0x37b91

    00000000`78fdef30 00000000`72efcacd : 00000000`562b44b8 00000000`1ad135e0 00000000`618a3ba0 00000000`00001355 : W3Filter!DllInstall+0x108a8

    00000000`78fdefd0 00000000`72f031bc : 00000000`27109010 00000000`00000001 00000000`000003e5 00000000`00000000 : W3Filter!DllInstall+0x614f1

    00000000`78fdf040 00000000`72f038cb : 00000000`27109010 00000000`00000000 00000000`00000000 00000000`72e5d9d0 : W3Filter!DllInstall+0x67be0

    00000000`78fdf140 00000000`72f04420 : 00000000`27109010 00000000`27109030 00000000`00000000 00000000`565da0e0 : W3Filter!DllInstall+0x682ef

    00000000`78fdf180 00000000`72ede14f : 00000000`27109030 00000000`00001355 00000000`00000000 00000000`00000000 : W3Filter!DllInstall+0x68e44

    00000000`78fdf1c0 00000001`3fcc016e : 00000000`4a5c5e80 00000000`00000001 00000000`00000000 00000000`565da0e0 : W3Filter!DllInstall+0x42b73

    00000000`78fdf4a0 00000001`3fcbf557 : 00000000`3235bd50 00000000`00001355 00000000`00000000 00000000`00000001 : wspsrv!IsChainingRequired+0x582d2

     

    3. Resolution

     

    The resolution for this problem is documented on this article here at TMG Team Blog.

     

    4. Another Approach to Troubleshoot Issues of this Nature

     

    Another approach to troubleshoot high CPU utilization is by leveraging the perfmon capabilities to identify the threads that are using more resources within the process. For example, in this scenario the perfmon was showing the following result:

     

    image 

     

    Here you have two key counters within thread object

    ·         %Processor Time: specify how much processor this thread is using.

    ·         ID Thread: represents the ID of the thread that you can use to correlate that on the dump for this process.

     

    Using the recommendations from KB133385 we have:

     

    1.       Monitor each thread's %Processor Time and identify the instance number in Performance Monitor for the thread that takes the most processor time.

    2.       Find the ID Thread item that has the same instance number as the thread you identified in step 1 and select this ID Thread with the mouse. Note that its Last and Average status indicators have the same value. This value is the decimal thread ID.

    3.       Convert the decimal thread ID to a hexadecimal (hex) number to get the thread ID for desired component.

     

    With the HEX value of the thread ID in hands you can go back to the dump file and look which thread is that and what this thread is doing.

     

    Note:  there is a really good step by step on Marcelo’s blog. Also, make sure to read the Chapter 13 of the Windows NT Resource Kit (old buddy that still rocking).

     

     

  • Yuri Diogenes's Blog

    Reasons to Migrate from ISA 2006 to TMG (Portuguese WebCast Parts 2 and 3)

    • 0 Comments

    Now we have available parts 2 and 3 of this webcast that explains the core reasons to migrate from ISA 2006 to TMG 2010. We are working on having this series converted in English. Stay tune in this channel.

     

    Part 2

     

    Part 3

  • Yuri Diogenes's Blog

    Unable to Access Resource behind TMG after enabling NLB

    • 0 Comments

    Consider a scenario where the TMG administrator is publishing servers that are behind TMG and after enabling NLB on the External interface the users are not able to access those resources. If he uses the DIP (Dedicated IP) to publish the resource it works. The basic diagram is showed in the figure below:

    image

    The traffic from inside (VLAN E) to outside resource (for example to VLAN A) using the internal VIP (10.10.10.1) as default gateway was working, however the incoming traffic (from VLAN A for instance) to the external VIP (192.168.4.10) was failing.

    Troubleshooting

    There is no clever troubleshooting here, always start from the basics in scenarios like this, which basic? Try enabling live logging and see the traffic is ever arriving to TMG in the first place. In this case was not but, to be in the safe side we installed netmon in all three nodes to see if the traffic was hitting the box at all. It was not, no traffic arriving on TMG.

    Resolution

    After reviewing the core switch we were able to see that the NLB MAC address in there was wrong, hence the traffic was never arriving on TMG. This was a Cisco 6500 Switch Series and we follow Cisco’s recommendation below in order to address this issue:

    Cisco Catalyst 6500 Series Switches - Catalyst Switches for Microsoft Network Load Balancing Configuration Example
    http://www.cisco.com/en/US/products/hw/switches/ps708/products_configuration_example09186a0080a07203.shtml

    Additional Info

    TMG NLB leverages Windows NLB and all the recommendations that we have on Windows NLB applies to TMG NLB. In other words, if Windows NLB has restrictions, TMG will obey those restrictions. Here are some important points to remember when implementing NLB:

    Switch is operating in Layer-3 mode

    NLB is not supported when the hosts are homed to a switch operating at Layer-3. Instead, create a VLAN for all the nodes in the NLB cluster, and configure that VLAN to operate in Layer-2 mode.

    An unusual number of TCP connections to the cluster are being reset.

    Possible Causes:

    • The switch to which the cluster hosts are connected may have learned the cluster MAC address, though this is rare.  This can cause TCP traffic to be delivered to the wrong NLB host, resulting in a connection reset (unicast mode only). See the section Switch is learning the MAC Address for details.
    • The switch to which the cluster hosts are connected is a layer 3 switch.  NLB requires layer 2 switching.


    See more “gotchas” at http://download.microsoft.com/download/3/2/3/32386822-8fc5-4cf1-b81d-4ee136cca2c5/NLB_Troubleshooting_Guide.htm

    …and always remember the 5 commandments when troubleshooting NLB on TMG:

    1. Never assume that the problem is on TMG in the first place.
    2. Never think that because your network infrastructure worked fine for years without problems that is free of issues.
    3. Do not take it personal if someone says: you’ve got a problem on your switch.
    4. Always have a hub available for testing purpose, sometimes having a dumb device to validate NLB functionality can save much more time than dealing with smart devices for hours.
    5. NLB Unicast causes switch flooding, don’t be surprise by having flooding after enabling NLB on TMG and blame TMG, this is how Windows NLB works.

Page 1 of 1 (5 items)