Welcome to TechNet Blogs Sign in | Join | Help

Mfartura's blog

Marcelo Fartura's blog about debugging and general troubleshooting
How to manually translate virtual addresses into physical ones

In previous posts we talked about virtual address space and how virtual memory is managed.  I've never posted anything about virtual address translation though, and for the ones interested on the details behing this operation I recommend reading the chapter 7 - Memory Management (specifically the Address Translation section) of the Windows Internals 4th Edition book.  So I won't go too deep on the details of how the system does the translation itself, my goal here is just how to show you how easily translate from one address to another once you have a memory dump or a live debugging session.

Once you choose the virtual address you want to do the translation the first step is to run the command !pte and find out some additional information for the corresponding PTE.  For this example I'm going to use the virtual address used as a base address to load the tcpip driver itself on a 32-bit Windows 2003 SP2 system:

 

0: kd> lmvm tcpip

start    end        module name

f7625000 f7686000   tcpip      (deferred)            

    Image path: \SystemRoot\system32\DRIVERS\tcpip.sys

    Image name: tcpip.sys

    Timestamp:        Thu Mar 24 19:40:31 2005 (42435DFF)

    CheckSum:         0005606F

    ImageSize:        00061000

Above highlighted in yellow we have the base address where the tcpip driver has been loaded (virtual address).  Now we run the command !pte passing this address as an argument:

0: kd> !pte f7625000

               VA f7625000

PDE at   C0300F74        PTE at C03DD894

contains 00ACC963      contains 1C055963

pfn acc -G-DA--KWEV    pfn 1c055 -G-DA—KWEV

 

The highlighted information above is about the PFN correlated to the physical address associated with the  virtual address we passed as an input to !pte.  However you noticed that 1c055 uses up to 20 bits (which are used to locate the PDE and PTE itself which will point to the physical page) and we know the completed addresses are 32 bits wide.  The last 12 bits on the translation process are used as a byte index to the physical page where the content is actually stored, so they come straight from the virtual address.

 

The final physical address will then be 1c055 followed by the last 12 bits of the virtual address which in this case is 000.  So putting them together we have that 1c055000 is the physical address correlated to the virtual address f7652000.  Let’s verify that by dumping both address and comparing their content:

 

Dumping the virtual address first:

0: kd> dc f7625000

f7625000  00905a4d 00000003 00000004 0000ffff  MZ..............

f7625010  000000b8 00000000 00000040 00000000  ........@.......

f7625020  00000000 00000000 00000000 00000000  ................

f7625030  00000000 00000000 00000000 000000e0  ................

f7625040  0eba1f0e cd09b400 4c01b821 685421cd  ........!..L.!Th

f7625050  70207369 72676f72 63206d61 6f6e6e61  is program canno

f7625060  65622074 6e757220 206e6920 20534f44  t be run in DOS

f7625070  65646f6d 0a0d0d2e 00000024 00000000  mode....$.......

 

Now let’s dump the physical address and compare the results:

0: kd> dc /p 1c055000

1c055000  00905a4d 00000003 00000004 0000ffff  MZ..............

1c055010  000000b8 00000000 00000040 00000000  ........@.......

1c055020  00000000 00000000 00000000 00000000  ................

1c055030  00000000 00000000 00000000 000000e0  ................

1c055040  0eba1f0e cd09b400 4c01b821 685421cd  ........!..L.!Th

1c055050  70207369 72676f72 63206d61 6f6e6e61  is program canno

1c055060  65622074 6e757220 206e6920 20534f44  t be run in DOS

1c055070  65646f6d 0a0d0d2e 00000024 00000000  mode....$.......

 

The content is identical as expected J.  

 

A new phase...

Even though this has been a strictly technical channel (I've never posted about anything else other than technical stuff) I know a lot of friends read it and so I decided to open an exception on this post and use it to publish something about me that also has a direct effect on how often and what type of content I will be able to keep publishing here.... Effective this week I started a new phase on my career at Microsoft assuming a non-technical role.  I will be working as a people manager for the same group I had once worked before and have a great admiration. 

As a manager I will need to change my focus to better serve my team and make sure they have everything they need from me to succeed.  I'm still passionate about technology as I think I will be forever, but keeping up with the level of knowledge of our engineers here requires a lot of dedication which from now on will be focused on other matters other than only tecnology.  I'm really excited about this new phase and have no doubt it happened at the right time for me. 

 

Kernel dump analysis - Bugcheck 0xA (IRQL_NOT_LESS_OR_EQUAL)

Yet another kernel memory dump to be analyzed - The bugcheck this time is the 0xA - IRQL_NOT_LESS_OR_EQUAL.   To better understand what this message means we would need a little background on Windows Internals but basically when executing anything at a interrupt request level (IRQL) = 2 or higher (in normal circumstances instructions get executed at IRQL = 0) we can't page fault.   On the situation of this specific dump, we would crash even on IRQL = 0 since we're trying to write to the memory address 0x0 which is in user mode address space and is reserved (the first useable address in the user mode address space is 0x00010000)

 

Let's jump to the analysis.

  

Bugcheck info:

0: kd> .bugcheck

Bugcheck code 0000000A

Arguments 00000000 00000002 00000001 8087bb19

                                |              |                    |                        |à This is the failing instruction address

                                |              |                    |à 0x1 means WRITE, 0x0 means READ, so it crashed when trying to WRITE to 0x0

                                |              |à This is the IRQL at which problem happened (IRQL=2)

                                |à This is the memory address where the problem happened

 

So it crashed basically because it failed to access a memory address (which at a lower IRQLs would result in a page fault) on the IRQL = 2.  At IRQL = 2 and above page faults are not allowed to happen.

 

Current thread stack:

 

0: kd> kvnL

 # ChildEBP RetAddr  Args to Child             

00 b7d6ebbc 8087bb19 badb0d00 00000000 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ b7d6ebbc)

01 b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f (FPO: [Non-Fpo]) (CONV: stdcall)

02 b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d (FPO: [Non-Fpo]) (CONV: stdcall)

03 b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)

04 b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67 (FPO: [Non-Fpo]) (CONV: fastcall)

WARNING: Stack unwind information not available. Following frames may be wrong.

05 b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada

06 b7d6ecf0 b775b520 f82cde28 f82cde28 8b572548 naiavf5x+0x737a

07 b7d6ed04 8081dcdf 8b478990 f82cde28 f82cde38 naiavf5x+0x2520

08 b7d6ed18 808f8be4 00000000 00000000 f7ca8448 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)

09 b7d6ed50 80932ca2 00ca8460 f7ca8460 00000001 nt!IopDeleteFile+0x13a (FPO: [Non-Fpo]) (CONV: stdcall)

0a b7d6ed68 80932cec f7ca8460 00000001 8ad47770 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)

0b b7d6ed80 8087f92f 00000000 00000000 8ad47770 nt!ObpProcessRemoveObjectQueue+0x36 (FPO: [1,0,0]) (CONV: stdcall)

0c b7d6edac 80948bd0 00000000 00000000 00000000 nt!ExpWorkerThread+0xeb (FPO: [Non-Fpo]) (CONV: stdcall)

0d b7d6eddc 8088d4e2 8087f844 80000000 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [Non-Fpo]) (CONV: stdcall)

0e 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

 

The current frame being executed is a post-exception frame, I mean, it's just handling the situation.  The problem really happened here:

 

0: kd> .trap b7d6ebbc

ErrCode = 00000002

eax=00000000 ebx=ef4d0848 ecx=00000000 edx=00000000 esi=f7ca7008 edi=00000000

eip=8087bb19 esp=b7d6ec30 ebp=b7d6ec40 iopl=0         nv up ei ng nz na po nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010282

nt!ExDeleteResourceLite+0x1f:

8087bb19 8901            mov     dword ptr [ecx],eax  ds:0023:00000000=????????

0: kd> kvnL

  *** Stack trace for last set context - .thread/.cxr resets it

 # ChildEBP RetAddr  Args to Child             

00 b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f (FPO: [Non-Fpo]) (CONV: stdcall)

01 b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d (FPO: [Non-Fpo]) (CONV: stdcall)

02 b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)

03 b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67 (FPO: [Non-Fpo]) (CONV: fastcall)

WARNING: Stack unwind information not available. Following frames may be wrong.

04 b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada

05 b7d6ecf0 b775b520 f82cde28 f82cde28 8b572548 naiavf5x+0x737a

06 b7d6ed04 8081dcdf 8b478990 f82cde28 f82cde38 naiavf5x+0x2520

07 b7d6ed18 808f8be4 00000000 00000000 f7ca8448 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)

08 b7d6ed50 80932ca2 00ca8460 f7ca8460 00000001 nt!IopDeleteFile+0x13a (FPO: [Non-Fpo]) (CONV: stdcall)

09 b7d6ed68 80932cec f7ca8460 00000001 8ad47770 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)

0a b7d6ed80 8087f92f 00000000 00000000 8ad47770 nt!ObpProcessRemoveObjectQueue+0x36 (FPO: [1,0,0]) (CONV: stdcall)

0b b7d6edac 80948bd0 00000000 00000000 00000000 nt!ExpWorkerThread+0xeb (FPO: [Non-Fpo]) (CONV: stdcall)

0c b7d6eddc 8088d4e2 8087f844 80000000 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [Non-Fpo]) (CONV: stdcall)

0d 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

 

So it's crashing when trying to write to the address pointed by ECX - mov dword ptr [ecx], eax - This is the assembly code of the function where it crashed from the start to the crash instruction :

 

0: kd> u nt!ExDeleteResourceLite nt!ExDeleteResourceLite+0x1f + 1

nt!ExDeleteResourceLite [d:\nt\base\ntos\ex\resource.c @ 2250]:

8087bafa 8bff            mov     edi,edi

8087bafc 55              push    ebp

8087bafd 8bec            mov     ebp,esp

8087baff 83ec0c          sub     esp,0Ch

8087bb02 56              push    esi

8087bb03 8d55f4          lea     edx,[ebp-0Ch]

8087bb06 b9c0418b80      mov     ecx,offset nt!ExpResourceSpinLock (808b41c0)

8087bb0b ff1508118080    call    dword ptr [nt!_imp_KeAcquireInStackQueuedSpinLock (80801108)]

8087bb11 8b7508          mov     esi,dword ptr [ebp+8]  ßESI is obtaining its value from the content of EBP+8

8087bb14 8b4e04          mov     ecx,dword ptr [esi+4]  ß ECX is obtaining its value from the content of ESI+4

8087bb17 8b06            mov     eax,dword ptr [esi]

8087bb19 8901            mov     dword ptr [ecx],eax  ß This is where it crashed

 

 

ECX is getting its value from the address pointed by ESI + 4, and ESI is getting its value from the address pointed by EBP + 8 which is a pointer to the first parameter passed from the previous function on the stack.  Confirming:

 

0: kd> dd esi+4 l1

f7ca700c  00000000

0: kd> dd poi(ebp+8) l1

f7ca7008  00000000

 

Next step is look to the previous function on the stack.  The assembly for the next function up to the point it calls the current one is this:

 

 

0: kd> u nt!SepTokenDeleteMethod nt!SepTokenDeleteMethod+0x9d

nt!SepTokenDeleteMethod [d:\nt\base\ntos\se\token.c @ 2652]:

80972a66 8bff            mov     edi,edi

80972a68 55              push    ebp

80972a69 8bec            mov     ebp,esp

80972a6b 51              push    ecx

80972a6c 51              push    ecx

80972a6d 56              push    esi

80972a6e 8b7508          mov     esi,dword ptr [ebp+8] ß ESI is coming from the content  EBP + 8  (first parameter passed from the previous function)

80972a71 f6868800000020  test    byte ptr [esi+88h],20h

80972a78 57              push    edi

80972a79 7538            jne     nt!SepTokenDeleteMethod+0x4d (80972ab3)

80972a7b 8bbe94000000    mov     edi,dword ptr [esi+94h]

80972a81 53              push    ebx

80972a82 8d4f0c          lea     ecx,[edi+0Ch]

80972a85 eb0f            jmp     nt!SepTokenDeleteMethod+0x30 (80972a96)

80972a87 8d42ff          lea     eax,[edx-1]

80972a8a 8bd8            mov     ebx,eax

80972a8c 8bc2            mov     eax,edx

80972a8e f00fb119        lock cmpxchg dword ptr [ecx],ebx

80972a92 3bc2            cmp     eax,edx

80972a94 741c            je      nt!SepTokenDeleteMethod+0x4c (80972ab2)

80972a96 8b11            mov     edx,dword ptr [ecx]

80972a98 83fa01          cmp     edx,1

80972a9b 75ea            jne     nt!SepTokenDeleteMethod+0x21 (80972a87)

80972a9d 8b4704          mov     eax,dword ptr [edi+4]

80972aa0 8945f8          mov     dword ptr [ebp-8],eax

80972aa3 8b4708          mov     eax,dword ptr [edi+8]

80972aa6 8945fc          mov     dword ptr [ebp-4],eax

80972aa9 8d45f8          lea     eax,[ebp-8]

80972aac 50              push    eax

80972aad e890ecffff      call    nt!SepDeReferenceLogonSession (80971742)

80972ab2 5b              pop     ebx

80972ab3 8d4638          lea     eax,[esi+38h]

80972ab6 8b08            mov     ecx,dword ptr [eax]

80972ab8 0b4804          or      ecx,dword ptr [eax+4]

80972abb 6a00            push    0

80972abd 5f              pop     edi

80972abe 7407            je      nt!SepTokenDeleteMethod+0x61 (80972ac7)

80972ac0 57              push    edi

80972ac1 50              push    eax

80972ac2 e8b5140000      call    nt!SepModifyTokenPolicyCounter (80973f7c)

80972ac7 8b4678          mov     eax,dword ptr [esi+78h]

80972aca 3bc7            cmp     eax,edi

80972acc 7407            je      nt!SepTokenDeleteMethod+0x6f (80972ad5)

80972ace 57              push    edi

80972acf 50              push    eax

80972ad0 e86ff8f1ff      call    nt!ExFreePoolWithTag (80892344)

80972ad5 8b868c000000    mov     eax,dword ptr [esi+8Ch]

80972adb 3bc7            cmp     eax,edi

80972add 7406            je      nt!SepTokenDeleteMethod+0x7f (80972ae5)

80972adf 50              push    eax

80972ae0 e85766ffff      call    nt!SeFreeCapturedObjectTypeList (8096913c)

80972ae5 8b8690000000    mov     eax,dword ptr [esi+90h]

80972aeb 3bc7            cmp     eax,edi

80972aed 7407            je      nt!SepTokenDeleteMethod+0x90 (80972af6)

80972aef 57              push    edi

80972af0 50              push    eax

80972af1 e84ef8f1ff      call    nt!ExFreePoolWithTag (80892344)

80972af6 8b4630          mov     eax,dword ptr [esi+30h]  ß EAX is coming from the content of ESI + 0x30

80972af9 3bc7            cmp     eax,edi

80972afb 740f            je      nt!SepTokenDeleteMethod+0xa6 (80972b0c)

80972afd 50              push    eax  ß EAX is pushed on the stack as the first parameter

80972afe e8f78ff0ff      call    nt!ExDeleteResourceLite (8087bafa)

 

So once again we need to look at the previous function since this one is also receiving the bad value as parameter.  By revisiting the stack we see the first parameter it passes is also received as below:

 

0: kd> kbL5

ChildEBP RetAddr  Args to Child             

b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f

b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d

b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde

b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67

WARNING: Stack unwind information not available. Following frames may be wrong.

b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada

 

 

As we can go directly to the function nt!ObfDereferenceObject+0x67 and try to see where its setting up that parameter from.  The assembly code from this function to the point it calls the next one is this:

 

0: kd> u  nt!ObfDereferenceObject nt!ObfDereferenceObject+0x67 ß This is a fastcall function so the first two parameters are passed through the registers ECX and EDX instead of using the stack (EBP as a reference)

nt!ObfDereferenceObject [d:\nt\base\ntos\ob\obref.c @ 2441]:

8086c13e 8bff            mov     edi,edi

8086c140 55              push    ebp

8086c141 8bec            mov     ebp,esp

8086c143 51              push    ecx

8086c144 803de0088a8000  cmp     byte ptr [nt!ObpTraceEnabled (808a08e0)],0

8086c14b 53              push    ebx

8086c14c 56              push    esi

8086c14d 57              push    edi

8086c14e 894dfc          mov     dword ptr [ebp-4],ecx ß The location pointed by EBP-0x4 which is local variable is receiving the value of ECX which is the first parameter passed to this function

8086c151 8d71e8          lea     esi,[ecx-18h]

8086c154 7408            je      nt!ObfDereferenceObject+0x20 (8086c15e)

8086c156 6a00            push    0

8086c158 56              push    esi

8086c159 e8c8fcffff      call    nt!ObpPushStackInfo (8086be26)

8086c15e 83cbff          or      ebx,0FFFFFFFFh

8086c161 f00fc11e        lock xadd dword ptr [esi],ebx

8086c165 4b              dec     ebx

8086c166 7547            jne     nt!ObfDereferenceObject+0x71 (8086c1af)

8086c168 8b3d90108080    mov     edi,dword ptr [nt!_imp__KeGetCurrentIrql (80801090)]

8086c16e ffd7            call    edi

8086c170 64a124010000    mov     eax,dword ptr fs:[00000124h]

8086c176 6683787200      cmp     word ptr [eax+72h],0

8086c17b 752c            jne     nt!ObfDereferenceObject+0x6b (8086c1a9)

8086c17d ffd7            call    edi

8086c17f 3c01            cmp     al,1

8086c181 7326            jae     nt!ObfDereferenceObject+0x6b (8086c1a9)

8086c183 803de0088a8000  cmp     byte ptr [nt!ObpTraceEnabled (808a08e0)],0

8086c18a 740f            je      nt!ObfDereferenceObject+0x5d (8086c19b)

8086c18c 833dd8078a8000  cmp     dword ptr [nt!ObpTraceNoDeregister (808a07d8)],0

8086c193 7506            jne     nt!ObfDereferenceObject+0x5d (8086c19b)

8086c195 56              push    esi

8086c196 e805fcffff      call    nt!ObpDeregisterObject (8086bda0)

8086c19b 6a00            push    0

8086c19d ff75fc          push    dword ptr [ebp-4] ß Here it is pushing the content of EBP-4 which we know is coming from ECX (first parameter) as a first parameter to the next function

8086c1a0 e81f6a0c00      call    nt!ObpRemoveObjectRoutine (80932bc4)  ß this is the function call to the next function

 

So we know the bad value is coming from the previous function on the stack which is naiavf5x+0xcada.  By looking at the assembly for that we have this:

 

0: kd> u b7765ac1-10 b7765ad5+2

naiavf5x+0xcab1:

b7765ab1 ff1550a476b7    call    dword ptr [naiavf5x+0x11450 (b776a450)]

b7765ab7 5f              pop     edi

b7765ab8 8bc6            mov     eax,esi

b7765aba 5e              pop     esi

b7765abb 5b              pop     ebx

b7765abc 5d              pop     ebp

b7765abd c20800          ret     8

b7765ac0 56              push    esi

b7765ac1 8bf1            mov     esi,ecx  ß ESI is coming from ECX

b7765ac3 8d86a8000000    lea     eax,[esi+0A8h]

b7765ac9 50              push    eax

b7765aca e8818affff      call    naiavf5x+0x5550 (b775e550)

b7765acf 85c0            test    eax,eax

b7765ad1 7516            jne     naiavf5x+0xcae9 (b7765ae9)

b7765ad3 8bce            mov     ecx,esi ß ECX which will be the first parameter to the next function is coming from ESI

b7765ad5 e8b6f7ffff      call    naiavf5x+0xc290 (b7765290)

 

 

CONCLUSION:

So at this point we can’t move further since we have no symbols for the NAIAVF5X.SYS driver – which is a file system filter driver from McAfee.  The conclusion would be that this driver is bad until the  McAfee analyze this same dump and prove us wrong J.  Here is some addition info about the driver:

 

0: kd> lmvm naiavf5x

start    end        module name

b7759000 b776d440   naiavf5x   (no symbols)          

    Loaded symbol image file: naiavf5x.sys

    Image path: \SystemRoot\system32\drivers\naiavf5x.sys

    Image name: naiavf5x.sys

    Timestamp:        Mon Aug 04 20:08:00 2003 (3F2F0370) ß By the way, this is old stuff.  They might have already corrected this on newer versions of the driver and I wouldn’t be surprised to find some documentation about  this on McAfee’s web site.

 

    CheckSum:         00020D5B

    ImageSize:        00014440

    Translations:     0000.04b0 0000.04e0 0409.04b0 0409.04e0

 

By the way, for this dump, “!analyze –v” does the job and points to the right guilty driver.

 

Until the next dump analysis, I mean, next post :)

Kernel dump analysis - Bugcheck 1E (KMODE_EXCEPTION_NOT_HANDLED)

It’s been a long time since my last post, but for some reason lately I’ve been receiving so many nice feedbacks about the blog and the other posts that I feel really motivated again to post a new article.  See how important your feedback is for me? J

Ok, normally I post about user mode debugging, but last week I was working on developing some content and I got this kernel memory dump I decided to take the risk of analyzing it, even considering Kernel debugging is not my specialty. 

Normally the first step when you’re analyzing dumps resultant from Blue Screens (I will write about the differences between kernel mode dumps and user mode dumps in another post) is to run the command !analyze –v.  Well, for this specific dump, !analyze –v is just not pointing to the right driver, at least I did not think so and that’s what motivated me to make a manual analysis of this specific dump.  Ok, too much said and very little action here… Let’s start with some hands on!

 

After opening the dump with Windbg.exe, setting the correct symbol path and running the command !analyze –v, this is what we have in summary:

 

KMODE_EXCEPTION_NOT_HANDLED (1e)

Arguments:

Arg1: c0000005, The exception code that was not handled

Arg2: a012a6c5, The address that the exception occurred at

Arg3: 00000000, Parameter 0 of the exception

Arg4: 00000014, Parameter 1 of the exception

 

FAULTING_IP:

win32k!DdDxApiOpenCaptureDevice+18

a012a6c5 8b5814          mov     ebx,dword ptr [eax+14h]

EXCEPTION_PARAMETER1:  00000000

EXCEPTION_PARAMETER2:  00000014

READ_ADDRESS:  00000014

DEFAULT_BUCKET_ID:  DRIVER_FAULT

BUGCHECK_STR:  0x1E

 

FOLLOWUP_IP:

Dxapi!DxApi+43

bdae59f7 8bc6            mov     eax,esi

SYMBOL_STACK_INDEX:  3

SYMBOL_NAME:  Dxapi!DxApi+43

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Dxapi

IMAGE_NAME:  Dxapi.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  382b8097

FAILURE_BUCKET_ID:  0x1E_Dxapi!DxApi+43

BUCKET_ID:  0x1E_Dxapi!DxApi+43

Followup: MachineOwner

 

So it’s telling us a lot of useful information like what was the bugcheck (1E), what kind of unhandled exception happened (Access Violation – 0xC0000005) and what was the instruction pointer (IP) when the exception happened.  In this case it was address 0xa012a6c5.  It’s also pointing to the driver Dxapi.sys as the probable guilty for the problem.  Well, my curiosity on that was exactly because the instruction pointer was pointing to an address that doesn’t belong to Dxapi.sys.  Let’s check that:

kd> lmvm Dxapi

start    end        module name

bdae4000 bdae6640   Dxapi      (pdb symbols)          c:\debuggers\sym\dxapi.pdb\382B80971\dxapi.pdb

    Loaded symbol image file: c:\debuggers\sym\Dxapi.dbg\382B80972640\Dxapi.dbg

    Image path: \SystemRoot\System32\DRIVERS\Dxapi.sys

    Image name: Dxapi.sys

    Timestamp:        Fri Nov 12 00:51:03 1999 (382B8097)

    CheckSum:         0000730A

    ImageSize:        00002640

    File version:     5.0.2180.1

    Product version:  5.0.2180.1

    File flags:       0 (Mask 3F)

    File OS:          40004 NT Win32

    File type:        3.7 Driver

    File date:        00000000.00000000

    Translations:     0409.04b0

    CompanyName:      Microsoft Corporation

    ProductName:      Microsoft(R) Windows (R) 2000 Operating System

    InternalName:     dxapi.sys

    OriginalFilename: dxapi.sys

    ProductVersion:   5.00.2180.1

    FileVersion:      5.00.2180.1

    FileDescription:  DirectX API Driver

    LegalCopyright:   Copyright (C) Microsoft Corp. 1981-1999

 

By checking the base address (yellow) of the module and its end address (green) we notice that the address for the instruction pointers is not in between them.  Actually the instruction point address was from Win32k.sys:

 

kd> lmvm win32k

start    end        module name

a0000000 a01a5e60   win32k     (pdb symbols)          c:\debuggers\sym\win32k.pdb\38E910C02\win32k.pdb

    Loaded symbol image file: c:\debuggers\sym\win32k.dbg\3947E2231a5e60\win32k.dbg

    Image path: \??\D:\WINNT\system32\win32k.sys

    Image name: win32k.sys

    Timestamp:        Wed Jun 14 16:50:59 2000 (3947E223)

    CheckSum:         001B02D1

    ImageSize:        001A5E60

    Translations:     0000.04b0 0000.04e0 0409.04b0 0409.04e0

 

So the address 0xa012a6c5 is indeed in between those addresses which means it was an instruction from this driver (Win32k.sys) that tried to an illegal operation.  But let’s wait a little bit before blaming the poor kernel image of the Win32 Subsystem (This is basically what the Win32k.sys is).  Let’s check our assembly code and try to see what really happened here…

 

Checking our thread’s stack:

 

kd> k

ChildEBP RetAddr 

bd2858a0 80463b2b nt!KiDispatchException+0x30e

bd285908 80463add nt!CommonDispatchException+0x4d

bd285908 a012a6c5 nt!KiUnexpectedInterruptTail+0x1f4

bd2859b4 f0593a8e Dxapi!DxApi+0x43

WARNING: Stack unwind information not available. Following frames may be wrong.

bd285a14 f0592bb4 3dfxV3TV+0x3a8e

bd285a34 f05928cc 3dfxV3TV+0x2bb4

bd285a4c f05a0cfd 3dfxV3TV+0x28cc

00000000 00000000 STREAM!SCIssueRequestToDevice+0xad

 

This actually happened after the exception happened – It was dispatching the exception.  The original context (call stack and registers) was actually saved on a trap frame.  From the debugger help we get the information which states that the third parameter passed to the function KiDispatchException is a pointer to our trap frame.  The command .trap will load the context saved on the trap.

 

kd> kb1

ChildEBP RetAddr  Args to Child             

bd2858a0 80463b2b bd2858bc 00000000 bd285910 nt!KiDispatchException+0x30e

 

kd> .trap bd285910

ErrCode = 00000000

eax=00000000 ebx=84f0ba10 ecx=a012a6ad edx=00000020 esi=bd2859dc edi=80d724a8

eip=a012a6c5 esp=bd285984 ebp=bd2859a0 iopl=0         nv up ei pl zr na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246

win32k!DdDxApiOpenCaptureDevice+0x18:

a012a6c5 8b5814          mov     ebx,dword ptr [eax+14h] ds:0023:00000014=????????

 

Ok, now we have the right context where the exception happened.  Notice that our current EIP on this context matches the one pointed by the !analyze –v as the one where the exception happened (in red above).  We basically crashed because we’re trying to move the content pointed by EAX+0x14 to EBX.  Since EAX = 0 (green above) we failed when trying to read from the invalid address 0x00000014 (which is the resultant address from 0x0 + 0x14).

 

Let’s look at the assembly code that execute immediately before this a012a6c5.

 

kd> ub a012a6c8

win32k!DdDxApiOpenCaptureDevice+0x7:

a012a6b4 56              push    esi

a012a6b5 8b7508          mov     esi,dword ptr [ebp+8]

a012a6b8 57              push    edi

a012a6b9 8365fc00        and     dword ptr [ebp-4],0

a012a6bd 8b06            mov     eax,dword ptr [esi]

a012a6bf 8b7814          mov     edi,dword ptr [eax+14h]

a012a6c2 8b4604          mov     eax,dword ptr [esi+4]

a012a6c5 8b5814          mov     ebx,dword ptr [eax+14h] ß This is where we crashed

 

 

So we need to find out on the assembly block above where EAX gets set to 0x0. 

 

kd> ub a012a6c8

win32k!DdDxApiOpenCaptureDevice+0x7:

a012a6b4 56              push    esi

a012a6b5 8b7508          mov     esi,dword ptr [ebp+8]

a012a6b8 57              push    edi

a012a6b9 8365fc00        and     dword ptr [ebp-4],0

a012a6bd 8b06            mov     eax,dword ptr [esi] 

a012a6bf 8b7814          mov     edi,dword ptr [eax+14h]

 

a012a6c2 8b4604          mov     eax,dword ptr [esi+4] ß This is where EAX gets updated for the last time before the crash

 

a012a6c5 8b5814          mov     ebx,dword ptr [eax+14h]

 

 

So EAX is actually receiving its value from the content of ESI+4.  Following the same logic we need to find out where ESI is getting updated for the last time before being involved on the instruction with EAX.

 

 

kd> ub a012a6c8

win32k!DdDxApiOpenCaptureDevice+0x7:

a012a6b4 56              push    esi

 

a012a6b5 8b7508          mov     esi,dword ptr [ebp+8] ß This where ESI gets updated for the last time before the instruction at a012a6c2

 

a012a6b8 57              push    edi

a012a6b9 8365fc00        and     dword ptr [ebp-4],0

a012a6bd 8b06            mov     eax,dword ptr [esi] 

a012a6bf 8b7814          mov     edi,dword ptr [eax+14h]

a012a6c2 8b4604          mov     eax,dword ptr [esi+4]

a012a6c5 8b5814          mov     ebx,dword ptr [eax+14h]

 

So ESI is getting its value from the content of EBP+8.  EBP+8 happens to be a fixed position on the stack used as a reference for the first parameter passed to the function (considering the calling convention used is not fast call).  EBP+8 being the first parameter, EBP+0xC the second, EBP+0x10 the third and so on… There are other pre-defined positions on the stack you should be aware of like EBP+4 is where the return address gets saved or EBP-4 for the first local variable etc, but this is subject for another post…

 

Let’s confirm our theory:

 

kd> dd poi(ebp+8)+4 l1

bd2859e0  00000000

 

kd> dd ebp+8 l1

bd2859a8  bd2859dc

 

Ok, so this is really where the 0x0 on EAX is coming from.  At this point we exclude Win32K.sys from our list of suspects after all it’s just receiving an invalid parameter being passed from the previous function on the stack:

 

kd> kb3

ChildEBP RetAddr  Args to Child             

bd2859a0 bdae59f7 bd2859dc bd285a0c 00028bb1 win32k!DdDxApiOpenCaptureDevice+0x18

bd2859b4 f0593a8e 00000513 bd2859dc 00000020 Dxapi!DxApi+0x43

bd285a14 f0592bb4 81c83228 84f0ba10 822efac8 3dfxV3TV+0x3a8e

 

As we can see from the stack above, the first parameter passed to the Win32k!DdDxApiOpenCaptureDevice is really the one with the corrupted value.  The parameter is being passed from Dxapi!dDxApi.

 

Now take at that stack again…  Pay attention to the details on the stack..  Can you give me a reason for excluding DxApi.sys from our suspect list also?  No?  Ok, let’s do it on the funniest way then J,  shall we?

 

Now we need to look at the assembly code of Dxapi!Dxapi.  Here is the entire assembly block for such function:

 

kd> uf Dxapi!DxApi

Dxapi!DxApi:

bdae59b4 55              push    ebp

bdae59b5 8bec            mov     ebp,esp

bdae59b7 8b4508          mov     eax,dword ptr [ebp+8]

bdae59ba 56              push    esi

bdae59bb 2d00050000      sub     eax,500h

bdae59c0 33f6            xor     esi,esi

bdae59c2 83f816          cmp     eax,16h

bdae59c5 7330            jae     Dxapi!DxApi+0x43 (bdae59f7)

 

Dxapi!DxApi+0x13:

bdae59c7 8d0440          lea     eax,[eax+eax*2]

bdae59ca c1e002          shl     eax,2

 

bdae59cd 8b88005baebd    mov     ecx,dword ptr Dxapi!gDxApiEntryPoint (bdae5b00)[eax]

Here we’re loading ECX whith the pointer to an external function – In this case this is the function Win32k!DdDxApiOpenCaptureDevice

 

bdae59d3 85c9            test    ecx,ecx

bdae59d5 7420            je      Dxapi!DxApi+0x43 (bdae59f7)

 

Dxapi!DxApi+0x23:

bdae59d7 8b5510          mov     edx,dword ptr [ebp+10h]

bdae59da 3b90045baebd    cmp     edx,dword ptr Dxapi!gDxApiEntryPoint+0x4 (bdae5b04)[eax]

bdae59e0 7215            jb      Dxapi!DxApi+0x43 (bdae59f7)

 

Dxapi!DxApi+0x2e:

bdae59e2 8b80085baebd    mov     eax,dword ptr Dxapi!gDxApiEntryPoint+0x8 (bdae5b08)[eax]

bdae59e8 394518          cmp     dword ptr [ebp+18h],eax

bdae59eb 720a            jb      Dxapi!DxApi+0x43 (bdae59f7)

 

Dxapi!DxApi+0x39:

bdae59ed ff7514          push    dword ptr [ebp+14h]

bdae59f0 8bf0            mov     esi,eax

bdae59f2 ff750c          push    dword ptr [ebp+0Ch]

 

bdae59f5 ffd1            call    ecx  ß Here is where the function effectively gets called

 

Dxapi!DxApi+0x43:

bdae59f7 8bc6            mov     eax,esi  ß This is our return address as you check on the stack

 

bdae59f9 5e              pop     esi

bdae59fa 5d              pop     ebp

bdae59fb c21400          ret     14h

 

Look at this specific part of the block:

 

Dxapi!DxApi+0x39:

bdae59ed ff7514          push    dword ptr [ebp+14h]

bdae59f0 8bf0            mov     esi,eax

 

bdae59f2 ff750c          push    dword ptr [ebp+0Ch]  ß This is the first parameter being pushed on the stack.  This will be read by the next function at the EBP+8 relative position.

 

bdae59f5 ffd1            call    ecx  ß Here is where the function effectively gets called

 

So our bad parameter comes from EBP+0Ch on this specific function.   EBP+0Ch happens to be the location for the second parameter (the parameters, considering we’re using the calling convention STANDARD or CDECL – FAST_CALL will pass the two first parameters through the registers ECX and EDX instead) passed to this function from the previous one (EBP+0x8 is relative location where the first parameter can be retrieved, EBP+0xC is basically 4 bytes beyond and is the second parameter, the same way EBP+0x10 will store the third parameter and so on).

 

So our next step will be revisiting the stack and investigate the previous function.

 

 

kd> kb3

ChildEBP RetAddr  Args to Child             

bd2859a0 bdae59f7 bd2859dc bd285a0c 00028bb1 win32k!DdDxApiOpenCaptureDevice+0x18

bd2859b4 f0593a8e 00000513 bd2859dc 00000020 Dxapi!DxApi+0x43

bd285a14 f0592bb4 81c83228 84f0ba10 822efac8 3dfxV3TV+0x3a8e

 

So the module 3dfxV3TV.sys is the one whose an internal function is passing the bad parameter which eventually caused our crash.  This is a third party module which means we will have no symbols for it.  In this case it’s time to look at the assembly code once again J

This is the assembly code for the function in question:

kd> ub f0593a8e l10

3dfxV3TV+0x3a5f:

f0593a5f 2bc7            sub     eax,edi

f0593a61 7902            jns     3dfxV3TV+0x3a65 (f0593a65)

f0593a63 f7d8            neg     eax

f0593a65 3bc8            cmp     ecx,eax

f0593a67 7e06            jle     3dfxV3TV+0x3a6f (f0593a6f)

f0593a69 8993c4000000    mov     dword ptr [ebx+0C4h],edx

f0593a6f 8b83c4000000    mov     eax,dword ptr [ebx+0C4h]

f0593a75 8945d8          mov     dword ptr [ebp-28h],eax

f0593a78 8d45f8          lea     eax,[ebp-8]

f0593a7b 6a08            push    8

f0593a7d 50              push    eax

f0593a7e 8d45c8          lea     eax,[ebp-38h]

 

f0593a81 6a20            push    20h   ß  Third parameter being pushed (this will be read by the next function at the EBP+0x10 relative position)

 

f0593a83 50              push    eax  ß  Second parameter being pushed (this will be read by the next function at the EBP+0xC relative position)  This is the one we’re interested in.

 

f0593a84 6813050000      push    513h ß First parameter being pushed (this will be read by the next function at the EBP+0x8 relative position)

 

f0593a89 e874750000      call    3dfxV3TV!SetDecoderSize+0x32ec (f059b002)  ß This is where the next function on the stack gets called

 

Ok, so we identified on the assembly block what would be the second parameter passed to the next function.  EAX was the register  storing its value, so the next step is find out where EAX is being updated for the last time before being used to push the parameter on the stack.  Giving a second look at the same assembly block, now focused on any instruction involving EAX we see this:

 

kd> ub f0593a8e l10

3dfxV3TV+0x3a5f:

f0593a5f 2bc7            sub     eax,edi

f0593a61 7902            jns     3dfxV3TV+0x3a65 (f0593a65)

f0593a63 f7d8            neg     eax

f0593a65 3bc8            cmp     ecx,eax

f0593a67 7e06            jle     3dfxV3TV+0x3a6f (f0593a6f)

f0593a69 8993c4000000    mov     dword ptr [ebx+0C4h],edx

f0593a6f 8b83c4000000    mov     eax,dword ptr [ebx+0C4h]

f0593a75 8945d8          mov     dword ptr [ebp-28h],eax

f0593a78 8d45f8          lea     eax,[ebp-8]

f0593a7b 6a08            push    8

f0593a7d 50              push    eax

f0593a7e 8d45c8          lea     eax,[ebp-38h]

f0593a81 6a20            push    20h

f0593a83 50              push    eax

f0593a84 6813050000      push    513h

f0593a89 e874750000      call    3dfxV3TV!SetDecoderSize+0x32ec (f059b002)

 

The instruction above (lea eax, [ebp-38h]) is basically copying the content of the relative position EBP – 38h to EAX.   So our bad value is whatever EBP-0x38 points to.  Well, I told you EBP+0x8 is the reference for what the first parameter is being passed and then the following parameters would be 4 bytes apart…  EBP is also a reference for some other important things also stored on the stack (as long as we’re not FPO optimized – again this is subject for another post) like EBP+0x4 is always the location where the return address will be stored and EBP-0x4 is where the first local variable for the function will be stored. 

EBP-0x38 is for sure a location of a local variable used by the function (EBP-0x4 is the first local variable.  From this point on, the location of other variables will actually depend on the variable type since this will determine how much space we need to reserve on the stack to store such values).  Being a local variable on the function, there isn’t much we can do without symbols and source code, but we thing is for sure:  Different from what “!analyze –v” told us the problem is actually being caused by the driver 3DFXV3TV.SYS and the solution or part of it would be either remove or update such driver J.

Until the next post J

 

 

 

 

 

 

How to attach a debugger from the creation time of the Worker Process (w3wp.exe)

Normally the answer for this would be as simple as use the file image execution options (through direct editing the registry or using gflags.exe from Debugging Tools For Windows) and set the “debugger” option to you preferred debugger tool and this would do the job.  At least this is the way we do for troubleshooting services startup in general.  There is one detail though which is the fact the services are likely loaded in a different winstation (they would have not interactivity with the desktop) and so will be loaded your debugger meaning that even though the debugger will run as it should you will not have access to it.

 

Well, to fix this all you need to is to go to the services mmc, navigate to the specific service’s properties, make it use the Local System account and check the checkbox that says “allow interact with the desktop” and this would fix your problem.  Right, but on thing is not ok here: Worker Processes are spawned to run in background as well as services however they are not exactly services and so there is no option to “allow interact with desktop” which means your debugger will still be loaded in the background preventing you to access it from the desktop.

 

The solution?  There is more than one for sure, but I found this one interesting.  Let’s see how it goes:

 

Both debuggers cdb.exe and windbg.exe from Debugging Tools For Windows are able to create TCP sockets (or named pipes connection if you prefer) and listen to them so other debuggers would be able to connect to it and share the same debugging session.  So I thought we could use this socket feature to create a tunnel between the winstations where our w3wp.exe and the debugger will be originally located and the desktop where we would have second debugger running (this one in foreground).

 

So, first things first…

 

Step 1: Set the file image execution options to have your first debugger attached to the worker process when it get spawned.

 

To do that you need to use the regedit and create the following registry key and values:

 

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\w3wp.exe

SZ Debugger  = “c:\debuggers\cdb.exe –cf c:\debuggers\cmds.txt”

 

This is supposing you have the Debugging Tools For Windows installed in the c:\debuggers folder.

 

Step 2: Automate the debugger to open a socket for future usage

 

Notice the “-cf” option in the cdb.exe command line above.  This will tell the cdb.exe to read some initial commands from a specified file to be executed as soon as the debugger get’s attached to the process.  So we need to create a text file called cmds.txt located in the c:\debuggers folder.  The content of the file will be the following:

 

.logopen c:\debuggers\debugging.log

.server tcp:port=9999

g

 

Save the file.  See the comments below on what each command above does from the debugger:

 

.logopen – opens a log file where everything happening from this debugging session will be properly logged

.server tcp:port=9999  - listen to a TCP socket on the port 9999

g – go J

 

Step 3: Start the application pool and make at least one request.

 

When the w3wp.exe process gets created you should be able to see through the task manager that now you also have a cdb.exe process running but you can’t directly access it.

 

Step 4: Run your second debugger and connect it to the socket.

 

Run the windbg.exe from Debugging Tools For Windows passing the parameter –remote from the command line like below:

 

Windbg.exe  -remote tcp:Port=9001,Server=MYSERVER

 

The “MYSERVER” above should be replaced by the server’s name.

 

Voila!!!  The windbg.exe will connect to the socket created by the cdb.exe and you will now be able to live debug your w3wp.exe from the scratch!

 

Have fun!

 

 

 

 

 

 

 

Real case: Application Pool’s worker process (w3wp.exe) crashing during recycling

I was teaching a workshop in London last week and one of the students brought a very interesting issue he had been working on for some time and based on the dump analysis he was not able to point to anything different from our own code (Microsoft’s modules) as the ones causing the crash.  He had done some research and found out that some crashes when recycling the app pool have been reported around the internet due to a PHP ISAPI problem.  He had such ISAPI loaded in the crashing app pool so he suspected from that but the dumps still point to our code…

Direct to point:

By looking at the dumps generated the [DebugDiag] we could see that the crash is a typical Access Violation (0xC0000005) as below:

The AV details:
0:000> .exr -1
ExceptionAddress: 5a3b01b4 (w3core!W3_SERVER::Terminate+0x00000164)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 608fc030
Attempt to read from address 608fc030

The thread stack when the crash happened was something like this:

0:000> kL
ChildEBP RetAddr
0006fcb4 5a3bc347 w3core!W3_SERVER::Terminate+0x164
0006ff0c 0100187c w3core!UlW3Start+0x280
0006ff44 01001a27 w3wp!wmain+0x22a
0006ffc0 77e6f23b w3wp!wmainCRTStartup+0x12f
0006fff0 00000000 kernel32!BaseProcessStart+0x23

 

So as we can see, the w3core.dll was in the top of the stack and this would lead us to conclude this is the culprit module crashing the process.  But let’s hold our conclusions for a moment… All the w3core.dll is trying to do is to read from an address that is actually part of another module (the ADMPROX.DLL) that is currently unloaded and that’s why we’re crashing here. 

So the W3_SERVER::Terminate is trying to access an invalid address. The address being referenced is somewhere in ADMWPROX.DLL as below:
0:000> ln 608fc030
(608fc030) &ltUnloaded_ADMWPROX.DLL>+0xc030

This module is already unloaded as below and that's why we're crashing.
Unloaded modules:
608f0000 60901000 ADMWPROX.DLL

78130000 781cb000 MSVCR80.dll
60060000 60066000 aspnet_filter.dll
10000000 1004c000 ISAPI_Rewrite.dll
5a320000 5a332000 w3isapi.dll
022e0000 022f7000 odbcint.dll
77530000 775c7000 COMCTL32.dll
7c8d0000 7d0ce000 SHELL32.dll
762b0000 762f9000 comdlg32.dll
48890000 488cd000 ODBC32.dll
01c90000 02167000 php5ts.dll
76cf0000 76d0a000 iphlpapi.dll
03670000 03726000 ZendOptimizer.dll
02180000 02185000 ZendExtensionManager.dll
02440000 02609000 php_mbstring.dll
02610000 02621000 php_bz2.dll

 


The ADMPROX.DLL is also our code. So the question is why are we holding a reference for something already unloaded? We noticed that indeed there were also some PHP modules loaded (at the time the dump was generated they were unloaded as well).  One possible reason would be somebody COM initializing on a not owned thread (I've seen this before in a case where the IIS mmc got locked because a reference being hold by an unloaded module that called COM initialize in one of the IIS threads instead of its own) - More details about that can be obtained in the following article:

http://support.microsoft.com/kb/289112/en-us

Next step is to live debug and set a bp in the CoInitializeEx (http://msdn2.microsoft.com/en-us/library/ms695279.aspx) and find out which module would be doing that (if this is the case – this is a theory and we are about to test it). We created a new app pool and website and pointed the web site to the same content folder as the one crashing and started working with this worker process instead.  We had some hard time using file image execution options and tyring to get a debugger attached when the worker process gets spawned. We workaround the situation by automating the cdb to attach and open a socket so we would be able to connect to the socket via Windbg (I will post something specifically about this later on).

After attaching the windbg and setting the proper bp as below, we caught who else is COM intializing as below:


Setting the bp at the Ole32::CoInitializeEx
[116714-SERVER1\rack (tcp 192.168.1.202:1083)] 0:000> bp 776bc826 ".time;k100;g"
[116714-SERVER1\rack (tcp 192.168.1.202:1083)] 0:000> bl
0 e 776bc826 0001 (0001) 0:**** ole32!CoInitializeEx ".time;k100;g"

Bp got hit! This is php5ts.dll as below:


Debug session time: Thu Aug 2 19:50:03.357 2007 (GMT+1)
System Uptime: 8 days 14:47:04.585
Process Uptime: 0 days 0:01:08.125
Kernel time: 0 days 0:00:00.046
User time: 0 days 0:00:00.093
ChildEBP RetAddr
00fbe744 776a0d1f ole32!CoInitializeEx
00fbe754 01c186f8 ole32!CoInitialize+0xf
00fbe7b4 10001ec8 php5ts!php_request_startup+0x18

00fbecd4 5a3968ff w3isapi!ProcessIsapiRequest+0x100
00fbed08 5a3967e0 w3core+0x68ff
00fbed28 5a396764 w3core+0x67e0
00fbed48 5a3966f4 w3core+0x6764

This is not a PHP thread, so they were right when suspecting from PHP modules.  You can’t call CoInitializeEx in a thread not owned by you.  Next step is contact PHP vendor and let them know about the problem.

 

Later the customer informed us they were able to workaround the problem by simple downgrading the version of the PHP modules.  This seems to happen only in the new PHP ISAPI versions.


Anyway it was a very good sample to ilustrate some debugging during the training.

Tip: How many users are hitting my web site?

This is a question we hear very often from our customers, in forums or through distribution lists.  The reason we keep hearing the same question over the time is very simple:  There isn’t a definitive answer for that – At least not for IIS since it’s not exactly driven by the number of users hitting it.  It would make more sense to monitor the number of connections and requests being made through these than simply working around a number of users.   Why I think this way?  This is a subject for another post J.  In this post I will just try to give you another (hopefully a more precise one) answer for “how do I know how many users are accessing my web site or application hosted in a IIS server?”

 

Direct to the point:  I’ve been seeing really good people making a very common mistake when trying to answer this question – The most popular answer is:  Go to the IIS logs and count how many different client IP you have logged during a specific period of time.  It’s a common mistake to think that client IPs are directly related to the number of users hitting the server.  Why?  Really simple, just think that you can have an entire organization, like here at Microsoft, accessing the web through proxy servers or entire IP intranet classes (192.168.0.0, 10.0.0.0, 172.16.0.0) behind routers providing NAT services.  This client machines (some times thousands of them) will show up in your IIS log as a single client IP address.

 

So, it’s very nice to use LogParser to generate some pretty neat charts showing how many different IPs have been hitting your server based on the c-ip field in the IIS logs, however they will be nice charts with useless information.

 

The answer: Well, fortunately there is another field called cs(cookie) that will make IIS log the cookies sent back and forth.  What does this have to do with the number of users?  Unless you have the session state feature completely disabled for you classic ASP applications, they will use cookies to store the information for that.  For instance, sessions being created by classic ASP will generate cookies always initiated by the string “ASPSESSION” followed bv a unique ID number.  Can you already feel where we going to?  Exactly, all I’m proposing here is replace the c-ip by the cs(cookie) in your LogParser queries.

 

Since every new browser session will generate a new cookie regardless of which IP the client is using, it makes much more sense to monitor it instead.  For instance the query below will tell you how many different classic ASP sessions were used during 06:00pm (GTM) and 07:00pm (GTM) of the day 7/20/2007.

 

logparser "select count(distinct cs(cookie)) as sessions_total from ex070720.txt where cs(cookie) like 'ASPSESSION%' and date = '2007-07-20' and time > '18:00:00' and time < '19:00:00'

 

Now your nice charts can be both nice and useful J.

 

 

Logparser fails when using the ChartType option

I'm posting this as I couldn't find this solution documented anywhere (including within MSFT through the discussions around logparser). 

 

Problem: After I installed the Office 2007, my logparser queries which use the parameter "ChartType" no longer work returning the following error:

 

Error: invalid parameter "chartType"

 

The solution?  Logparser depends on the Office Web Components to create the charts.  Just intall them (I just reinstalled the OWC for Office 2003) and everything will work again.

How to Extract SQL info from a thread stack

In the post about troubleshooting Asp.Net poor performance I showed you how to identify one possible cause for high response times when processing Asp.Net pages – the cause we explored on that post was a database server taking too long to respond.  We were able to verify all the information about the SQL side of the scenario through analyzing the Asp.Net thread stack and using the SOS.dll specific commands.  So the question now is:  What about when I’m processing a Classic Active Server Page (ASP) page instead of a Asp.Net one? 


Well in this case, as we already know, the SOS.dll won’t help, so let’s see how to do it manually.

 

First step is to identify the thread and look at it call stack as below:

 

      <

0:016> kb10

ChildEBP RetAddr  Args to Child             

00f2e1dc 7c822124 77e6baa8 000006b8 00000000 ntdll!KiFastSystemCallRet

00f2e1e0 77e6baa8 000006b8 00000000 00f2e224 ntdll!NtWaitForSingleObject+0xc

00f2e250 77e6ba12 000006b8 00004e20 00000000 kernel32!WaitForSingleObjectEx+0xac

00f2e264 74cd2e3f 000006b8 00004e20 00004e20 kernel32!WaitForSingleObject+0x12

00f2e290 6d56db96 00c1d2f8 01ab4270 00000009 DBmsLPCn!ConnectionRead+0xaf

00f2e2b0 6d56871b 01ab6290 01ab4270 00000009 DBNETLIB!WrapperRead+0x2c

00f2e308 4e252d91 01ab6290 01ab4270 01ab4270 DBNETLIB!ConnectionRead+0x519

00f2e33c 4e252df0 01ab6290 01ab4270 00000009 SQLOLEDB!CDataSource::ConnectionRead+0x35

00f2e388 4e2522b6 01ab403e 00000001 00000000 SQLOLEDB!CDBConnection::GetBytes+0x269

00f2e3d4 4e259f63 015c4e08 00000088 0000001e SQLOLEDB!CDBConnection::ProcessTDSStream+0x157

00f2e490 4e25a030 015c5a00 00000014 015c4970 SQLOLEDB!CStmt::ExecDirect+0x786

00f2e4a8 4e2596d1 015c5a00 00000014 00000000 SQLOLEDB!CStmt::SQLExecDirect+0x28

00f2e4d8 4e259b1c 00000000 4e2590b4 00000014 SQLOLEDB!CCommand::ExecuteHelper+0x157

00f2e55c 4e2595ea 015c5a68 00000000 4bbe9658 SQLOLEDB!CCommand::Execute+0x76b

00f2e594 4bbe952e 015c4d60 00000000 4bbe9658 SQLOLEDB!CImpICommandText::Execute+0xdd

00f2e5d4 4bc0c9db 015c12b0 015c4a04 015c1638 msado15!CConnection::Execute+0x9d

00f2e7a4 4bbe9388 015c5660 00000003 015c48a8 msado15!_ExecuteAsync+0x19f

00f2e7b8 4bbe90e0 015c5660 00f2e988 00000000 msado15!ExecuteAsync+0x23

00f2e8a0 4bbe9303 00000003 015c1a00 00000000 msado15!CQuery::Execute+0xa5e

00f2e90c 4bbe6128 015c48a8 00000003 015c1a00 msado15!CCommand::_Execute+0x153

00f2e9a0 4bbe6400 ffffffff 015c1a00 00f2eb34 msado15!CRecordset::_Open+0x30f

00f2eb18 11004397 005c1638 00000008 00000000 msado15!CRecordset::Open+0x5dc

 

So, that’s a classic OLEDB thread stack.  To obtain the SQL command being send to the SQL server identify the function call SQLOLEDB!CStmt::SQLExecDirect above.  Grab the first parameter and run the following command:

 

0:016> du 015c5a00

015c5a00  "SELECT * FROM TITLES"

 

Of course you still want to know which SQL server this thread holds a connection with.  So how about checking the connection string?  Just identify the the function call msado15!CConnection::Execute, grab the first parameter and execute the following:

 

0:016> du poi(015c12b0+64)

015c1aa8  "Provider=SQLOLEDB.1;Integrated S"

015c1ae8  "ecurity=SSPI;Persist Security In"

015c1b28  "fo=False;Initial Catalog=pubs;Da"

015c1b68  "ta Source=(local)"

 

So now you can call the DBA in charge of the server “local” and ask him to collaborate with you on this low performance problem your ASP app has been unfairly blamed about J


Have fun!

How to identify the process and thread being called in a COM call from a thread stack

I’ve just published a post on how to manually identify which classic ASP page is being processed by a specific thread.  Following the same idea you can also identify COM calls being made by a thread.   As always, first identify the thread doing the COM call you want investigate and then look at the thread stack as below:

 

<0:032> 

0:032> kb100

ChildEBP RetAddr  Args to Child             

0235e93c 7c822114 77e6711b 00000002 0235e98c ntdll!KiFastSystemCallRet

0235e940 77e6711b 00000002 0235e98c 00000001 ntdll!NtWaitForMultipleObjects+0xc

0235e9e8 7739cd08 00000002 0235ea10 00000000 kernel32!WaitForMultipleObjectsEx+0x11a

0235ea44 77697483 00000001 0235ea8c 000003e8 USER32!RealMsgWaitForMultipleObjectsEx+0x141

0235ea6c 776974f2 0235ea8c 000003e8 0235ea9c ole32!CCliModalLoop::BlockFn+0x8

0235ea94 7778866b ffffffff 0235eb94 000e513c ole32!ModalLoop+0x5b

0235eab0 77788011 00000000 00000000 00000000 ole32!ThreadSendReceive+0xa0

0235eacc 77787ed7 0235eb94 0011825c 0235ebf0 le32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112

0235ebac 776975b8 0011825c 0235ecc0 0235ecb0 ole32!CRpcChannelBuffer::SendReceive2+0xc1

0235ebc8 7769756a 0235ecc0 0235ecb0 0011825c ole32!CCliModalLoop::SendReceive+0x1e

0235ec34 776c4eee 0011825c 0235ecc0 0235ecb0 ole32!CAptRpcChnl::SendReceive+0x6f

0235ec88 77ce127e 00000001 0235ecc0 0235ecb0 ole32!CCtxComChnl::SendReceive+0x91

0235eca4 77ce13ca 0011750c 0235ecec 00000000 RPCRT4!NdrProxySendReceive+0x43

0235f08c 77d0c947 77d12028 77d150c2 0235f0ac RPCRT4!NdrClientCall2+0x206

0235f0a4 77d0c911 0011750c 60030001 73464c18 OLEAUT32!IDispatch_RemoteInvoke_Proxy+0x1c

0235f364 73469f71 0011750c 60030001 73464c18 OLEAUT32!IDispatch_Invoke_Proxy+0xb6

0235f3b8 73468f37 01ef0100 0011750c 60030001 vbscript!CatchIDispatchInvoke+0x46

0235f3f8 73468ea6 01ef0288 0011750c 60030001 vbscript!IDispatchInvoke2+0xaf

0235f434 73469000 01ef0288 0011750c 60030001 vbscript!IDispatchInvoke+0x59

0235f548 73467bb6 01ef0288 00000000 60030001 vbscript!InvokeDispatch+0x13a

0235f56c 73467cad 01ef0288 0011750c 60030001 vbscript!InvokeByName+0x42

0235f848 73464940 00000000 00000000 01ef0288 vbscript!CScriptRuntime::Run+0x1331

0235f940 73464cd2 00000000 00000000 00000000 vbscript!CScriptEntryPoint::Call+0x5c

0235f998 73465522 01ef0768 00000000 00000000 vbscript!CSession::Execute+0xb4

0235f9e8 7346189b 00000000 00000000 709e19b4 vbscript!COleScript::ExecutePendingScripts+0x13e

0235fa04 709e2f5a 01612040 01612040 01112700 vbscript!COleScript::SetScriptState+0x150

0235fa30 709e2f1a 00000000 709e19b4 0235fb38 asp!CActiveScriptEngine::TryCall+0x19

0235fa6c 709e2e50 00000000 647246fe 014123e8 asp!CActiveScriptEngine::Call+0x31

0235fa88 709e2d54 0235fb0c 00000000 00000000 asp!CallScriptFunctionOfEngine+0x5b

0235fadc 709e2c7f 01911e90 00000000 0235fb68 asp!ExecuteRequest+0x17e

0235fb44 709e2a4d 01911e90 014123e8 0235fb68 asp!Execute+0x249

0235fb98 709e271a 00000000 00000000 00114c28 asp!CHitObj::ViperAsyncCallback+0x3e8

0235fbb4 75bd748e 01493048 00097be8 0235fd74 asp!CViperAsyncRequest::OnCall+0x92

0235fbd0 7770f153 00114c28 000a4178 00000000 COMSVCS!CSTAActivityWork::STAActivityWorkHelper+0x32

0235fc1c 7770fba0 00000000 000a4178 75bd745c ole32!EnterForCallback+0xc4

0235fd7c 777100aa 0235fc54 75bd745c 00114c28 ole32!SwitchForCallback+0x1a3

0235fda8 7769408c 000a4178 75bd745c 00114c28 ole32!PerformCallback+0x54

0235fe40 77712865 00097be8 75bd745c 00114c28 ole32!CObjectContext::InternalContextCallback+0x159

0235fe60 75bd7831 00097be8 75bd745c 00114c28 ole32!CObjectContext::DoCallback+0x1c

0235fecc 75bd7b95 00114e18 00114df8 001147c4 COMSVCS!CSTAActivityWork::DoWork+0x12d

0235fee4 75bd852e 00114c28 00000001 00114df8 COMSVCS!CSTAThread::DoWork+0x18

0235ff04 75bd897a 00000000 01eefcb0 00039608 COMSVCS!CSTAThread::ProcessQueueWork+0x37

0235ff84 77bcb530 00114df8 00000000 00000000 COMSVCS!CSTAThread::WorkerLoop+0x17c

0235ffb8 77e66063 00039608 00000000 00000000 msvcrt!_endthreadex+0xa3

0235ffec 00000000 77bcb4bc 00039608 00000000 kernel32!BaseThreadStart+0x34

 

 

Now identify the function call ole32!CRpcChannelBuffer::SendReceive2 above.  Grab the first parameter and run the following command:

 

0:032> dd poi(0011825c +18)+8 l2

000de8d0  00000d8c 00000ddc

 

The output above represents the process ID (in this case the hex number 0xd8c) of the server process for this COM call and the thread ID (in this case the hex number 0xddc).

 

Notice that the output values are in hex format and your PID (the one you identify by using task manager, tlist, tasklist etc) info will probably be in  decimal format so you still need to make the convertion:

 

In this case the PID of the server process being called in this thread would be 3462 = 00000d86.

 

How to identify which ASP page is being processed on a specific thread

I’ve been teaching one of our workshops offerings (Workshop Plus: IIS6 Critical Problem Management) and when we go through some debugging samples, one thing I like to demo is how to find which Asp.Net pages are being processed in which threads.  Well, the SOS.dll extension supports the “!aspxpages” command that will make your life really easier in a sense that all you need to do is run the command and it will return a list of the threads and their respective IDs as well as which Asp.Net page is being processed by such threads (aspx, asmx).

 

And then the students try to use the same command to identify classic Active Server Pages (.asp).  This will not work since the commands implemented by the SOS.dll are specifically for managed code (.Net) and ASP pages will be processed by threads running native code.  Well, the easiest way to accomplish that would be just feedback the memory dump to the Debug Diagnostics Tool (DebugDiag) and run the hang/crash analysis scripts against it.  But how about when you don’t have Debug Diagnostics Tool (DebugDiag) or any other extension available for some reason to do that for you?  Can you still do that manually through the Windbg from Debugging Tools For Windows?  The answer is YES!  And you don’t even need symbols for that ;-).

 

Here is what you can do to find out what ASP page is being processed in a specific thread:

 

First verify the thread call stack:

 

     0:032> kb

ChildEBP RetAddr

0235fa88 709e2d54 0235fb0c 00000000 00000000 asp!CallScriptFunctionOfEngine+0x5b

0235fadc 709e2c7f 01911e90 00000000 0235fb68 asp!ExecuteRequest+0x17e

0235fb44 709e2a4d 01911e90 014123e8 0235fb68 asp!Execute+0x249

0235fb98 709e271a 00000000 00000000 00114c28 asp!CHitObj::ViperAsyncCallback+0x3e8

0235fbb4 75bd748e 01493048 00097be8 0235fd74 asp!CViperAsyncRequest::OnCall+0x92

0235fbd0 7770f153 00114c28 000a4178 00000000 COMSVCS!CSTAActivityWork::STAActivityWorkHelper+0x32

0235fc1c 7770fba0 00000000 000a4178 75bd745c ole32!EnterForCallback+0xc4

0235fd7c 777100aa 0235fc54 75bd745c 00114c28 ole32!SwitchForCallback+0x1a3

0235fda8 7769408c 000a4178 75bd745c 00114c28 ole32!PerformCallback+0x54

0235fe40 77712865 00097be8 75bd745c 00114c28 ole32!CObjectContext::InternalContextCallback+0x159

0235fe60 75bd7831 00097be8 75bd745c 00114c28 ole32!CObjectContext::DoCallback+0x1c

0235fecc 75bd7b95 00114e18 00114df8 001147c4 COMSVCS!CSTAActivityWork::DoWork+0x12d

0235fee4 75bd852e 00114c28 00000001 00114df8 COMSVCS!CSTAThread::DoWork+0x18

0235ff04 75bd897a 00000000 01eefcb0 00039608 COMSVCS!CSTAThread::ProcessQueueWork+0x37

0235ff84 77bcb530 00114df8 00000000 00000000 COMSVCS!CSTAThread::WorkerLoop+0x17c

0235ffb8 77e66063 00039608 00000000 00000000 msvcrt!_endthreadex+0xa3

0235ffec 00000000 77bcb4bc 00039608 00000000 kernel32!BaseThreadStart+0x34

 

Identify the function asp!Execute or asp!ExecuteRequest as above.  If you have no symbols available and can’t see the function names in the stack, try to grab the third ASP function call loaded in the stack (that will be probably the one we need).

 

Now, grab the first parameter passed to this function (notice that for the functions Execute and ExecuteRequest it’s pretty much the same and that’s why it doesn’t really matter which function we will work with).  Execute the following command on that:

 

0:032> du poi(poi(poi(01911e90 +9c)))

01ba32e8  "/WebSite001/mypage.asp"

 

As you can notice, it returns the relative path in the URL of the web site with the ASP page info there!

 

Now, if you just dump it with an offset of 4 bytes as below:


0:032> du poi(poi(poi(01911e90 +9c))+4)

01ba42e8  "C:\INETPUB\WWWROOT\MYPAGE.ASP" 

 

It will return then the absolute path where the page is located in the file system.

 

So now you can identify which classic ASP page is being processed in a specific thread J  Have fun!

 

Tip: Kernel Debugging a VPC Server

Here is little tip for you that want to practice kernel mode debugging but either don’t have 2 machine machines to play the TARGET and HOST roles or simply don’t want to play with (for any reason) with the old null modem serial cable used to connect the HOST to the TARGET:  It’s possible to use a Virtual PC machine to play the TARGET role, so all you would need is the Virtual PC image correctly configured to boot up in DEBUG mode (through the regular boot.ini options - /DEBUG /DEBUGPORT=COM# /BAUDRATE=115200) and map its serial COM port to a named pipe.

 

How to do so:

Supposing you already have a Virtual PC image set up (it can be an image of Windows 2000, Windows XP or Windows 2003), you just need to follow the step by step list below:

 

1. Login as admin in the VPC image and edit the c:\boot.ini file to include a new boot up OS option as following:

 

A default boot.ini file will likely have the following content:

 

[boot loader]

timeout=3

default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS

[operating systems]

multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect

 

Edit the boot.ini and add the highlighted line below:

 

[boot loader]

timeout=3

default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS

[operating systems]

multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect

multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional – DEBUG" /fastdetect  / DEBUG /DEBUGPORT=COM1 /BAUDRATE=115200

 

2. Before rebooting the VPC image, go the VPC console, select the option “Settings” under the menu “Edit”.  Find the configuration item for the COM1 port, select the radio button “Named Pipe” and type the string \\.\pipe\COM1 as the picture below illustrates.

 

 VPC Settings

 

This VPC image (the virtual machine) will be the TARGET system, or the machine whose the kernel will be debugged.  Your real machine (the one hosting the VPC image) will be the HOST machine or the one which will run the debugger that will be attached to the TARGET’s system kernel.

 

3. Start a Windbg (from the Debugging Tools For Windows) instance with the following command line options:

 

windbg -k com:pipe,port=\\.\pipe\COM1,resets=0,reconnect

 

The Windbg will open up it’s command window and will wait to connect through the named pipe mapped port as soon as VPC OS starts up.  The picture below illustrates how the Windbg will start up with these command line parameters:

 

Windbg KD 

 

4. Now reboot the VPC image and select the new option you’ve included in the boot.ini in the step 1 of this procedure.  During the boot process, as soon as the kernel gets loaded the Windbg will attach to it.  All you need to do now if break the debugger at the point you want to start debugging and have fun J.

 

 

Real Case: Random apps running 100% CPU (advanced)

This is another 100% CPU issue.  This time I’m not working on any customer case as the issue is happening on my own Windows XP SP2 machine.  So I resolved trying to troubleshoot that just for fun J.

First, let me explain the scenario:

Random interactive applications, like Outlook, Word, Excel, Explorer, Internet Explorer etc, were unexpectedly consuming 100% CPU.  Once they started eating up all the CPU they get completely hung and do not stop using all the CPU until I actually kill their processes.  Some times after killing their process and restarting the application the issue was gone, some times the issue comes back right after the application restarts and I need to kill it again…

Ok, I made my part playing as an end user by just killing the processes for some time but it started buzzing me as it was becoming more frequent.  So now it’s time have some fun and go to the core of the issue.

All I need to do is follow the steps in the post Lab: Win32 Application Causing 100% Condition.  So let’s go start doing that:

Once the problem happened next time, it was in the infopath.exe application, I attached the Windbg (from the Debugging Tools For Windows) and executed the command below in order to find out which thread I should look at:

 

Ok, so now I know the thread 0 is the one I’m looking for since it’s the one who’s been consumed the CPU for the last 40 seconds.  You might be asking “what if there were other threads also consuming the CPU?”  Well, for this specific issue I knew we would find only one single thread since my machine is a dual CPU and whenever the problem happens the app takes only one of the CPUs.  

 

With that said, let’s take a look at the call stack of the thread 0:

 

  

Ok, so we don’t have that much… the return address is 0x0 and ebp register is 0x0 so since we don’t have  the proper symbols to work with this I guess we won’t be able to rebuild the stack.  So we will have even more fun diving in the ASM from now on!!!

 

We already know the application nview.dll is the one loaded in the call stack.  Let’s see what is it doing or trying to do.  Our current instruction pointer tells us this:

 

 

So all we need to do to follow the assembler execution path is to get the current CPU registers value and start reproducing the effects the real instructions would have.  The current value of the CPU registers is below:

 

 

Notice that both EAX and EDX (the two registers used by the first operation) are 0x0.  So the first instruction mov eax, edx won’t change anything.  It will make a copy of the EDX register to the EAX but since they are both 0x0 this will result in no changes at our registers.  Let’s take a look at the subsequent instructions and their effects:

 

and     eax, 01000000h

It will make an AND logical operation between EAX and the value 0x1000000 and load the result at the EAX register.  Again, this won’t change the EAX value since 0x0 AND 0x1000000 results in 0x0 which is already the current EAX value.

 

xor      ebp,ebp

It will make an Exclusive-OR operation, or XOR, between the stack base pointer EBP and itself and load the result at the EBP (which is the destination operand on this operation).  Needless to say this would load EBP with 0x0, however as we can notice below from the current register values, EBP is already 0x0:

 

 

 

or        eax,ebp

It will make an OR logical operation between EAX and EBP and load the result at the EAX.   We already know this will be the equivalent of making 0x0 OR 0x0 which will result in guess what… Exactly! – It will result in 0x0 which doesn’t change the original value of EAX.

 

je         nview!NVLoadDatabase+0xc8a (100020ea)

This will cause the instruction pointer EIP to “jump” to 100020ea just in case the operands of the previous OR operation were equal in value.  Well, EAX is indeed equal, in value, to EBP since they’re both 0x0.  So the EIP is loaded with the value 100020ea and our assembly execution path is properly changed as below:

 

 

So, let’s continue from the instruction located at 100020ea.

 

shl       edx,8

It will make a shift left on the EDX value of 8 positions.  EDX is a 32-bit representation of the value 0x0, so shifting it 8 positions wouldn’t change anything.  EDX remains 0x0 after the operation as we can see below:

 

 

 

mov    eax,edx

This one we already know.  It won’t change anything as EAX = EDX = 0x0.

 

sar      eax,8

It will a shift right on the EAX of 8 positions.  As we’ve seen before with the EDX and the shift left operation, this won’t change anything as EAX = 0x0.

 

add     eax,esi

Finally EAX changed its original value of 0x0 to the result of the sum between itself and ESI register.  ESI = 0003fff8 and so EAX will from now on be EAX = 0003fff8 as can see below:

 

sar      edx,1Fh

Again, a shift righ operation on the EDX with 31 positions.  EDX will remain with 0x0 after that as we’ve seen before.

 

cdq

This will convert the value of EAX to a 64-bit value.  But it will keep it original value of 0003fff8.

 

mov    eax,  esi

It will copy the value of ESI to EAX, however as we noticed before EAX was made equal to ESI by the operation add eax, esi previously executed (since EAX was 0x0 by that time).  So this operation doesn’t change anything either.

 

lea       eax,[ecx+esi]

This will load EAX with the value resultant from the operation ECX + ESI.  So let’s make a little math here… ECX + ESI = 00c90000 + 0003fff8 = 00ccfff8.  So EAX will be from now on = 00ccfff8 as we see below:

 

 

cmp    eax, edi

This will compare the values of EAX and EDI and change the proper flags accordingly.

 

mov    dword ptr [esp+14h],edx

This will copy the EDX value to the memory location referred by the stack pointer plus 20 bytes, or in other words, this will copy EDX value to 20 bytes down on the stack.  So, doing a little math again we will have that such a position would be the ESP = 0012c6c0 plus 20 bytes, so 0012c6c0 + 0x14 = 0012c6d4.  Let’s see what is current content of 0012c6d4:

 

 

All right, so it’s currently 0x0 and we are copying the value of EDX which is also 0x0 to that location in the memory.  So, we won’t actually change anything here.  Let’s move on.

 

jb         nview!NVLoadDatabase+0xc60 (100020c0)

This will make our execution path jumping to 100020c0 just in case the result of the operation cmp previously executed, tells us the value of EAX is lower (or below) EDI. By the time the cpm operation was executed, EAX was 00xxfff8 and EDI was 00cd0000 so EAX is indeed below EDI and will therefore jump in the execution path as below:

 

 

 

So we moved back some bytes on our execution path.  Let’s keep moving…

 

mov    edx, dword prt [eax]

It will copy the value referred by EAX to the register EDX.  The value currently referred by EAX is the following:

 

 

 So we’re just copying 0x0 to EDX which is already 0x0, so it doesn’t change our register’s value.

 

mov    eax, dword ptr [eax+4]

This will copy the value referred by the EAX plus 4 bytes to EAX.  The valued pointed by EAX + 4 is the value pointed by 00ccfffc which is 0x0.  So now we’re basically making EAX = 0x0 again – As we had in the very beginning of our execution path.

 

 

mov    dword ptr [esp+14h],eax

This will just copy the EAX value to a specific position on the stack (20 bytes under the top pointed by the ESP).  So it doesn’t change any register and it actually doesn’t change the stack value either since EAX is currently 0x0 once again and we had already saved the value 0x0 at that same position on the stack when executed the instruction located at 100020ff.

 

Now hold on:  The next instruction is exactly the one we executed in the very first step of all this investigation.  And guess what?  The registers and the stack have exactly the same values as we had in the very beginning.  So we don’t need to move forward on this to conclude that we will end up again in the same place, and again, and again, and again… indefinitely!  This turns out to be an infinite loop!! 

 

Conclusion:

This explains why this thread is using 100% of our CPU and concludes our investigation.  Unfortunately this is something we can’t fix since it’s happening within the nview.dll’s code and this is not a MS product.  

 

NVIEW.DLL is a NVidia module apparently used by the NVidia Desktop Manager software.  I’ve done some research through the NVidia web site and I haven’t found any reference to this so I believe there is no definitive solution for this yet.  As a workaround I simply disabled the Desktop Manager feature and everything seems to be fine now.

 

 

 

 

Real Case: .Net Apps - "No connection could be made because the target machine actively refused it" (basic)

Recently we’ve been receiving (by coincidence) some escalations to troubleshoot the same issue.  It’s about .Net apps returning the following exception:

 

SocketException::.ctor() 10061:No connection could be made because the target machine actively refused it

 

The cases I’ve been working with show different stack traces but the main point here is that exception is originally occurring when the application is trying to make use of a TCP socket to establish a connection with a remote machine.  For all the cases escalated to us so far, the problem has ended up being simply what the error message states, i.e the remote machine, or something in between, refuses the connection attempt.  This post is just to show how we’ve been able to troubleshoot these problems and isolate them to network related issues.

Let’s go through two different scenarios and see step by step how to troubleshoot them.  In the first scenario we have an IIS server (which we will call from now on WEB1 – IP:10.0.0.1) hosting a .Net application which is intermittently failing to make a web services call to another IIS server (let’s call it WEB2) located in a remote environment.  The problem has already been isolated to the point when the .Net app in the WEB1 makes a call to a web service hosted in the server WEB2.

The two last frames on the stack for this first scenario would be the following ones:

 

System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)

System.Web.Services.Protocols.WebClientProtocol.GetWebResponse(WebRequest request)

 

The first step, as always, is understand the infrastructure involved in the communication between WEB1 and WEB2.  Making it short, WEB2 is the virtual name used to mask a round robin DNS rotation between 10 other web servers and between the WEB1 and WEB2 there are at least 2 firewalls, one at the WEB1 location and another one at the WEB2 location (classic environment where the WEB servers are located in the DMZ provided by both firewalls on their respective locations).

Next step is to generate a network trace from the server WEB1 (it would be also nice if we could generate a trace simultaneously from the WEB2, however first in this scenario we don’t have access to the WEB2 and even if so we would need to take a trace from all the 10 web servers because we don’t know at this point to which the WEB1 is trying to establish a connection when the problem happens.  So we went ahead, installed the Microsoft Network Monitor tool in the WEB1 started collecting the network trace and waited until the problem happens.  In the mean time while the problem did not happen, we just got informed about the other 10 IP addresses of the 10 web servers hereby called simply by WEB2.

Ok, the problem happened, so now it’s time to dive in the network trace:

All the know-how we need now it’s a basic knowledge about how TCP connections happen.  At the moment of the problem the trace shows the following suspect frames:

 

2         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       TCP      TCP: Flags=.S......, SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621893, Ack=0, Win=16384 (scale factor 0) = 16384

3         {TCP:2, IPv4:1}            192.168.0.1       10.0.0.1            TCP      TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790416, Ack=3411621894, Win=0 (scale factor 0) = 0

4         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       TCP      TCP: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621894, Ack=1096790417, Win=16616 (scale factor 0) = 16616

5         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

6         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

7         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.1       HTTP     HTTP: HTTP Payload

Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616

8         {TCP:2, IPv4:1}            4.79.185.20       10.0.0.1            TCP      TCP: Flags=..R....., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790417, Ack=843163868, Win=0 (scale factor 0) = 0

 

The frame #2 represents the server WEB1 starting a TCP 3 way hand shake with the server whose the IP address is 4.79.185.20 at the port TCP 80 (remember that this is a web service call) which at this point is the one representing our server WEB2 (there are 10 possible IP addresses depending on which one the DNS round robin will provide through the name resolution).  The 3 way hand shake has been successfully accomplished through the frames #2,#3 and #4 (the sequence SYN, ACK SYN, ACK has occurred).  After that something interesting happens:  The WEB1 starts sending packets (starting on frame #5) of 1 single byte (notice that difference between the sequence number results in 1 -> 3411621895 - 3411621894 = 1) instead of making the complete POST to the web service.

The WEB1 receives no acknowledge from WEB2 to the packet sent at the frame #5, so it sends it again (that’s why the frame #6 and #7 are being considered retransmissions) in the frame #6, again the WEB2 doesn’t ack that, so WEB1 keeps resending that 1 byte packet in the frame #7 and then finally the WEB2 responds to that but with a RESET instead.  The packet sent by the WEB2 at the frame #8 in unexpected and will cause the application to handle that as the exception mentioned in the beginning of this article.

The question is:  Why the server WEB1 keeps sending a 1 byte length packet after the hand shake? 

Before answering this question, you might make a mistake and think that the WEB2 is legitimately resetting the connection with the WEB1 since WEB1 has not behaved as expected (which would be send a POST to the web service and not the 1 byte packets) and this would take you to the assumption that there is something wrong with the server WEB1 and lead you to completely wrong troubleshooting line… Didn’t you see what’s wrong yet?  Ok, it’s really a little tricky but the problem here happens at the hand shake (event though is was successful one)… Do you see now?  No?  Ok, look again to the 3 frames that compose the hand shake, more precisely to the frame #3.  Notice that the WINDOW size is 0 (Win=0) which basically means the server WEB2 can not receive anything larger than 0 bytes… Interesting, isn’t it?  So after receiving the 0 bytes length window size advertised by the WEB2, the WEB1 sends a 1 byte packet in attempt to make the WEB2 to advertise a new valid WINDOW size, which doesn’t happen after the first attempt, so the WEB1 keeps trying until the WEB2 resets the connection.

So the problem has been isolated to be at the WEB2 end.  Another question remains:  Why is the problem intermittent?  After all, if this behavior of advertising an invalid window size is always happening, the problem should also always happen.  It’s time to take a new network trace…

On a new network trace, we have both situations:  When it works and when it doesn’t work.  We already know what happens when it doesn’t work, so now let’s see what happens when it works:

 

2         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       TCP      TCP: Flags=.S......, SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591585, Ack=0, Win=16384 (scale factor 0) = 16384

3         {TCP:2, IPv4:1}            8.7.82.20           10.0.0.1            TCP      TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4961, Len=0, Seq=1446804074, Ack=843591586, Win=8760 (scale factor 0) = 8760

4         {TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       TCP      TCP: Flags=....A..., SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591586, Ack=1446804075, Win=16560 (scale factor 0) = 16560

5         {HTTP:3, TCP:2, IPv4:1}            10.0.0.1            192.168.0.2       HTTP     HTTP: Request, POST /WebSrv/QueryWebSrv.asmx

6         {HTTP:3, TCP:2, IPv4:1}            192.168.0.2       10.0.0.1            HTTP     HTTP: Response, HTTP/1.1, Status Code = 100

 

Ok, now we don’t see the same window size problem and 3 way hand shake is ok and everything is fine.  However our WEB1 server is not connecting to the same server WEB2 as now the server WEB2 is being represented by a different IP address – 192.168.0.2 as oppose to 192.168.0.1 that we’ve see before – due to the round robin balance system in place.  So, it answers our second question:  The problem actually happens always when the WEB2 is the 192.168.0.1 server, however since there is a round robin balance in place, the server WEB2 will not always be the one whose the IP address is 192.168.0.1.

The solution of problem?  Well, after isolating the problem we contacted the people in charge of the WEB2 end and they generated new traces on their end and found out there was a firewall in between the servers which was misconfigured regarding the IP 192.168.0.1.  They reconfigure the firewall and this fixed the problem for our first scenario.

The second scenario is a IIS server (from now one WEB1 IP: 10.0.0.1) reporting the same error as above, however now the stack, even though the exception happens at the same place, is a little different as below:

 

System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)

Microsoft.AnalysisServices.AdmdCleintXmlaClient.GetTcpClient(ConnectionInfo connectionInfo)

The difference here is just that in this scenario our .Net application being hosted by IIS is calling a SQL Server (IP: 172.16.0.1) running the Analysis Services as opposed to a web service.  The architecture of this scenario is a little bit easier to work with since the problem has already been isolate to one specific SQL Server when it’s accessed from any web server.

By following the logic as before, we will first generate a network trace from the WEB1 server (again, we don’t have control over the other end – the SQL server, as it would be useful to also generate a network trace from that server).  By looking at the trace we see the following:

7447                {TCP:534, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7448                {TCP:534, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

7467                {TCP:535, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7468                {TCP:535, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

7554                {TCP:538, IPv4:533}     10.0.0.1            172.16.0.1         TCP      TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384

7556                {TCP:538, IPv4:533}     172.16.0.1         10.0.0.1            TCP      TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

What we see here is the WEB1 making successive attempts to connect to the port TCP:2383 (the default port for non named instances of Analysis Services running in SQL Server 2005) and the SQL Server is resetting all of them.  We realized that there are two routers between the server WEB1 and the SQL Server.  We can’t identify if the resets are coming from the SQL or any of the routers since the MAC address in the reset packets will always be the one of the last router in the network path between the servers.  However by doing closer look at the reset packets, we notice that the TTL has been decremented by 2 (ttl=126) as below:

 

+ Ethernet: Etype = Internet IP (IPv4)

- Ipv4: Next Protocol = TCP, Packet ID = 14341, Total IP Length = 40

  + Versions: IPv4, Internet Protocol; Header Length = 20

  + DifferentiatedServicesField: DSCP: 0, ECN: 0

    TotalLength: 40 (0x28)

    Identification: 14341 (0x3805)

  + FragmentFlags: 0 (0x0)

    TimeToLive: 126 (0x7E)

    NextProtocol: TCP, 6(0x6)

    Checksum: 9687 (0x25D7)

    SourceAddress: 10.205.100.146

    DestinationAddress: 10.205.100.200

- Tcp: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0

    SrcPort: 2383

    DstPort: 20531

    SequenceNumber: 0 (0x0)

    AcknowledgementNumber: 655839664 (0x271751B0)

  + DataOffset: 80 (0x50)

  + Flags: ..R.A...

    Window: 0 (scale factor 0) = 0

    Checksum: 65170 (0xFE92)

    UrgentPointer: 0 (0x0)

 

The TTL being decremented by two basically means this packet has been routed twice and since we know there are two routers between the servers, we can conclude the resets are really being originated at the SQL server and not by any of the routers.  Why is the SQL Server resetting these connection attempts?  The answer for this question is also the solution for the problem and it is that the port TCP 2383 is the one used by SQL server 2005 but our SQL Server is a SQL Server 2000 which has the default instance for the Analysis Services running in the port TCP 2725 instead.

We confirmed that by running a “netstat –b” at the SQL Server.  Here is the abbreviated version of the output:

TCP    svp020064:2725         SVP060005:42802        ESTABLISHED     2344

 

TCP    svp020064:2725         SVP060002:58478        ESTABLISHED     2344

 

TCP    svp020064:2725         SVP060001:2064         ESTABLISHED     2344

We now use the command “tasklist” to identify which process has the PID 2344.  The abbreviated version of the output is below:

Image Name                     PID            Session Name                   Session#           Mem Usage

============   =======  =============       ===========           ==========

msmdsrv.exe                 2344                   Console                              0           734,136 K

 

The solution is to change the client application so it connects to the right port which is the TCP 2725 instead of TCP 2383.

 

 

Real Case: IIS and Asp.Net hanging or poor performance (intermediary)

This week I'm working on this case related to IIS6 server hosting an Asp.Net application (it's a C# app) whose the performance is poor even when a very low number of users is hitting it.  The architecture is the classical web layer accessing the backend DB servers and my goal is to isolate the problem, i.e pinpoint where the problem resides and try to find out why.

The first step after understanding the environment and being aware of all relevant servers and infrastructure devices, is to actually see the problem happening to have a better understanding of what is going on.   We had one single user accessing the application and there are indeed some actions the user takes on the pages that take a long time to return, so the next step is to  establish a baseline and agree on a SLA for the performance the application is supposed to have.

Another very important relevant point to this scenario is that since the environment counts on a NLB composed by multiple IIS servers, we need first to isolate which IIS server the client will connect to when we start trying to repro the problem.  So after checking their NLB configuration and making sure they have the affinity set to single we can just open the IIS logs of the servers and search the client IP at the c-ip field.   Since the affinity is single the server which has been serving the client will continue to be the same.

Once we know which server we will work with during the tests, my favorite tool to create the baseline is the System Monitor (perfrmon.exe) and we can also make some basic analysis of the IIS logs.  So I set the perfmon with some specific counters that will help me to establish a baseline on how their app is currently behaving and also made sure they have the IIS log fields I need to work with when establishing the baseline.

The counters I used in the perfmon were the following:

 

 

ASP.NET v1.1.4322\Applications Running
ASP.NET v1.1.4322\Request Execution Time
ASP.NET v1.1.4322\Request Wait Time
ASP.NET v1.1.4322\Requests Current
ASP.NET v1.1.4322\Requests Queued
ASP.NET v1.1.4322\State Server Sessions Active
ASP.NET v1.1.4322\Worker Process Restarts
ASP.NET v1.1.4322\Worker Processes Running

 

The fields I needed to make sure are being included in the IIS logs are the following:

 

 

c-bytes

s-bytes

cs-uri-stem

cs-uri-query

time-taken

 

 

 

After that we reproduced the problem, which by the way was really happening with only one user accessing it which already eliminates any hypothesis of the problem being related to the amount of load hitting the server, we started working on the baseline.  

 

The counters in perfmon pretty much confirm what we see in the IIS logs:  Their main application page takes an average time longer than 40 seconds to execute.  The perfmon counter “ASP.NET v.1.1.4322\Request Execution Time” will report that as well as the IIS logs if you use the LogParser tool and run the following query against them:


First we need to consolidate all the IIS logs in a single CSV file:

 

 

LogParser "select date,time,cs-uri-stem,cs-uri-query,sc-status,sc-win32-status,time-taken,sc-bytes,cs-bytes from ex061024.log, ex061025.log to merged.csv" -o:CSV

 

 

Now we run the following query to get the average time-taken for the URLs in the logs and send the output to a new CSV file named AvgTimeTaken.txt

 

 

logparser "select cs-uri-stem,count(cs-uri-stem), avg(time-taken) from merged.csv to AvgTimeTaken.txt where time-taken>=10000 GROUP BY cs-uri-stem ORDER BY avg(time-taken) DESC" -o:CSV -i:CSV

 

The result is as following:

 

cs-uri-stem

COUNT(ALL cs-uri-stem)

AVG(ALL time-taken)

/App/MainPage.aspx

15

48353

 

/App/MainPage.aspx

3

37738

 

 

Ok, now that we see the problem happening and have our baseline it’s time to start working on isolating problem. 


We don’t know what exactly the user does in order to have the mainpage.aspx taking so long to execute, so what we do is try to reproduce the problem again and keep following the perfmon counter “ASP.NET v1.1.4322\Requests Current”.  This method of tracking the “Requests Current” counter will not work for all scenarios, since when you have the server under some load you will likely not be able to tell when a specific request is taking long or not because there will be many requests at the same time and they will keep changing the value of “Requests Current”.  For this cases we would need to have the collaboration of the user (which would tell us when the page is taking long to return) or look at other counters like the “Request Queued”.  However for this specific scenario where we have the variables under control and there is a single user testing the system for us, it will work just fine.

 

When the “Requests Current” shows 1 request and don’t get back to zero after, let’s say, 10 seconds, that’s the point where we know this request is taking too long to execute.  Remember that our baseline is around 40 seconds…  Now it’s right time to generate a hung dump of the worker process.  To do that we first run the iisapp.vbs script in order to identify the PID of the worker process hosting our application pool and after that we can use any tool to simply generate multiple hang dumps (like Debugging Tools For Windows or Debug Diagnostics Tool (DebugDiag)) some seconds apart from each other.  In this case I used the adplus.vbs from Debugging Tools For Windows with the following sintaxe:

 

 

Adplus.vbs –hang –p 2364 –quiet –o c:\dumps

 

 

This will generate a memory dump of the w3wp.exe whose the PID is 2364 and store it at the c:\dumps folder.

 

The dumps analysis will require just some basic .Net debugging steps (I used the Windbg from Debugging Tools For Windows to analyze them) as below:

 

This is a managed application (it’s aspx) so we want to make sure we have, besides the proper symbols, the sos.dll extension properly loaded:

 

0:000> .chain

Extension DLL chain:

    aspvbscriptext.dll: API 1.0.0, built Fri Feb 06 18:20:44 2004

        [path: C:\debuggers\aspvbscriptext.dll]

    mdacexts.dll: image 2004.05.11.1, API 1.0.0, built Tue May 11 17:46:12 2004

        [path: C:\debuggers\mdacexts.dll]

    webdbg: API 3.5.5, built Thu Nov 15 18:26:28 2001

        [path: C:\debuggers\webdbg.dll]

    sieext: API 1.0.0, built Fri Nov 05 16:07:40 2004

        [path: C:\debuggers\winext\sieext.dll]

    dbghelp: image 6.6.0005.0, API 6.0.6, built Fri Apr 14 17:03:13 2006

        [path: C:\debuggers\dbghelp.dll]

    ext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:35 2006

        [path: C:\debuggers\winext\ext.dll]

    exts: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:30 2006

        [path: C:\debuggers\WINXP\exts.dll]

    uext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:37 2006

        [path: C:\debuggers\winext\uext.dll]

    ntsdexts: image 6.0.5357.0, API 1.0.0, built Fri Apr 14 17:21:31 2006

        [path: C:\debuggers\WINXP\ntsdexts.dll]

    clr10\sos: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:47 2006

        [path: C:\debuggers\clr10\sos.dll]

 

 

 

Ok, so once we know we have the right extension loaded, let’s start finding which threads are being used to process the Asp.Net mainpage.aspx.  We expect to see only one thread here since our user is simply accessing this single page.  By looking at the first dump we see the following:

 

 

0:000> !aspxpages

Loaded Son of Strike data table version 5 from "C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\mscorsvr.dll"

Going to print all threads with ASP.NET Pages running.

Thread 11: /App/Mainpage.aspx

 

 

 

Ok, as we suspected there is only one thread processing the Asp.Net request and it’s the thread 11.  Now let’s look on what this thread is currently doing:

 

 

 

0:011> !clrstack
Thread 11
ESP EIP
0x0604f4dc 0x7ffe0304 [FRAME: NDirectMethodFrameStandalone] [DEFAULT] UI2 System.Data.Common.UnsafeNativeMethods/Dbnetlib.ConnectionRead(ValueClass System.Runtime.InteropServices.HandleRef,SZArray UI1,UI2,UI2,UI2,ByRef I)
0x0604f4f8 0x069cf5e4 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadNetlib(I4) 
ß it seems we have some SQL stuff on the stack…
0x0604f558 0x069cf531 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadBuffer()
0x0604f560 0x069cf463 [DEFAULT] [hasThis] UI1 System.Data.SqlClient.TdsParser.ReadByte()
0x0604f568 0x069ced7a [DEFAULT] [hasThis] Boolean System.Data.SqlClient.TdsParser.Run(ValueClass System.Data.SqlClient.RunBehavior,Class System.Data.SqlClient.SqlCommand,Class System.Data.SqlClient.SqlDataReader)
0x0604f5b0 0x06aa3c79 [DEFAULT] [hasThis] Void System.Data.SqlClient.SqlDataReader.ConsumeMetaData()
0x0604f5c4 0x06aa3c0e [DEFAULT] [hasThis] SZArray Class System.Data.SqlClient._SqlMetaData System.Data.SqlClient.SqlDataReader.get_MetaData()
0x0604f5cc 0x06aa1be7 [DEFAULT] [hasThis] Class System.Data.SqlClient.SqlDataReader System.Data.SqlClient.SqlCommand.ExecuteReader(ValueClass System.Data.CommandBehavior,ValueClass

<cut for brevity>

 

 

So, the managed stack for the thread 11 seems to be waiting on SQL connection.  The native stack would be at a ntdll!NtWaitingForSingleObject() function which basically means this thread is doing nothing but waiting.  So let’s examine a little further to see if we can identify which results are we waiting from the SQL Server:

 

 

0:011> !dso
Thread 11
ESP/REG Object Name
eax 0x10073b34 System.Collections.Hashtable
ecx 0x10073b34 System.Collections.Hashtable
0x604f4c8 0x1801dad4 System.Byte[]
0x604f4d0 0x28061840 System.Data.SqlClient.TdsParser
0x604f4d4 0x1801dad4 System.Byte[]
0x604f4f8 0x28061840 System.Data.SqlClient.TdsParser
0x604f500 0x28090704 System.Data.SqlClient.SqlDataReader
0x604f504 0x28061840 System.Data.SqlClient.TdsParser
0x604f51c 0x28061840 System.Data.SqlClient.TdsParser
0x604f54c 0x2808fd28 System.Data.SqlClient.SqlCommand 
ß this is what we’re looking for


<cut for brevity>

 

 

 

Let’s taker a closer look at this specific object:

 

 

0:011> !do 0x2808fd28
Name: System.Data.SqlClient.SqlCommand
MethodTable 0x06854dd8
EEClass 0x0657fb60
Size 64(0x40) bytes
GC Generation: 0
mdToken: 0x02000194 (c:\windows\assembly\gac\system.data\1.0.5000.0__b77a5c561934e089\system.data.dll)
FieldDesc*: 0x068545c4
MT Field Offset Type Attr Value Name
0x79b80d44 0x40000b6 0x4 CLASS instance 0x00000000 __identity
0x0649824c 0x4000502 0x8 CLASS instance 0x00000000 site
0x0649824c 0x4000503 0xc CLASS instance 0x00000000 events
0x0649824c 0x4000501 0 CLASS shared static EventDisposed
>> Domain:Value 0x00099230:NotInit 0x000dfad8:0x100144d8 <<
0x06854dd8 0x4000eac 0x10 CLASS instance 0x10032b90 cmdText
ß this string seems promising
0x06854dd8 0x4000ead 0x24 System.Int32 instance 4 cmdType
0x06854dd8 0x4000eae 0x28 System.Int32 instance 3600 _timeout
0x06854dd8 0x4000eaf 0x2c System.Int32 instance 3 _updatedRowSource
0x06854dd8 0x4000eb0 0x38 System.Boolean instance 0 _inPrepare
0x06854dd8 0x4000eb1 0x30 System.Int32 instance -1 _handle
0x06854dd8 0x4000eb2 0x14 CLASS instance 0x28090450 _parameters
0x06854dd8 0x4000eb3 0x18 CLASS instance 0x2808f624 _activeConnection
0x06854dd8 0x4000eb4 0x39 System.Boolean instance 0 _dirty
0x06854dd8 0x4000eb5 0x3a System.Byte instance 0 _execType
0x06854dd8 0x4000eb6 0x1c CLASS instance 0x00000000 _cachedMetaData
0x06854dd8 0x4000eb7 0x34 System.Int32 instance -1 _rowsAffected
0x06854dd8 0x4000eb8 0x3b System.Boolean instance 0 designTimeVisible
0x06854dd8 0x4000eb9 0x20 CLASS instance 0x00000000 _transaction

 

Looking at such an instance:

 


0:011> !do 0x10032b90
Name: System.String
MethodTable 0x79b7daf8
EEClass 0x79b7de44
Size 80(0x50) bytes
GC Generation: 2
mdToken: 0x0200000f (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll)
String: dbo.[spGenerateOptions]
ß Bingo!!!  This is the stored procedure we’ve been waiting on to return
FieldDesc*: 0x79b7dea8
MT Field Offset Type Attr Value Name
0x79b7daf8 0x4000013 0x4 System.Int32 instance 31 m_arrayLength
0x79b7daf8 0x4000014 0x8 System.Int32 instance 30 m_stringLength
0x79b7daf8 0x4000015 0xc System.Char instance 0x64 m_firstChar
0x79b7daf8 0x4000016 0 CLASS shared static Empty
>> Domain:Value 0x00099230:0x28000224 0x000dfad8:0x28000224 <<
0x79b7daf8 0x4000017 0x4 CLASS shared static WhitespaceChars
>> Domain:Value 0x00099230:0x28000238 0x000dfad8:0x2000364c <<

 

 

 


So we know that at the point this dump was taken we were just waiting at the stored procedure spGenerateOptions to complete it’s execution at a SQL Server.

 

We are going now to look at the subsequent dump (taken 30 seconds later) to verify whether we have any progress or not on that thread, however let’s take a loot at one last thing on this dump before starting with the other one (this can save us a lot of time here):

 

 

0:011> !runaway

 User Mode Time

  Thread       Time

  13:d5c       0 days 0:00:37.500

  14:c00       0 days 0:00:24.718

  16:9c8       0 days 0:00:16.406

  15:9e8       0 days 0:00:13.562

  11:bb8       0 days 0:00:12.015 ß This is the CPU time spent by the thread 11 at this point

 

 

 

So we know that the thread 11 have had the CPU for the aprox. 12.015 seconds when this dump was generated.  So after opening the second dump, before jumping on any deeper analysis, lets just run the same command and verify if the thread 11 made any progress.

 

 

0:011> !runaway

 User Mode Time

  Thread       Time

  13:d5c       0 days 0:00:37.500

  14:c00       0 days 0:00:24.718

  16:9c8       0 days 0:00:16.406

  15:9e8       0 days 0:00:13.562

  11:bb8       0 days 0:00:12.015 ß The time is exactly the same 30 seconds later

 

 



Ok, so there is not a single additional millisecond being spent by the thread 11 on the second dump.  At this point we don’t even need to take a look at the thread stack to know it’s exactly the same.  In other words this thread has been doing nothing but waiting on that stored procedure to execute at the SQL Server.

 

The problem has been isolated as a database performance issue (mission accomplished!!! J).  The next step is work on the database server (running in another box) to understand why this sp is taking too long to execute and, of course, properly addressing the issue to provide a definitive solution.  IIS ended up just being the server where the symptom is faced and not the one where the root cause of the problem is really happening.

More Posts Next page »
Page view tracker