In previous posts we talked about virtual address space and how virtual memory is managed. I've never posted anything about virtual address translation though, and for the ones interested on the details behing this operation I recommend reading the chapter 7 - Memory Management (specifically the Address Translation section) of the Windows Internals 4th Edition book. So I won't go too deep on the details of how the system does the translation itself, my goal here is just how to show you how easily translate from one address to another once you have a memory dump or a live debugging session.
Once you choose the virtual address you want to do the translation the first step is to run the command !pte and find out some additional information for the corresponding PTE. For this example I'm going to use the virtual address used as a base address to load the tcpip driver itself on a 32-bit Windows 2003 SP2 system:
0: kd> lmvm tcpip
start end module name
f7625000 f7686000 tcpip (deferred)
Image path: \SystemRoot\system32\DRIVERS\tcpip.sys
Image name: tcpip.sys
Timestamp: Thu Mar 24 19:40:31 2005 (42435DFF)
CheckSum: 0005606F
ImageSize: 00061000
Above highlighted in yellow we have the base address where the tcpip driver has been loaded (virtual address). Now we run the command !pte passing this address as an argument:
0: kd> !pte f7625000
VA f7625000
PDE at C0300F74 PTE at C03DD894
contains 00ACC963 contains 1C055963
pfn acc -G-DA--KWEV pfn 1c055 -G-DA—KWEV
The highlighted information above is about the PFN correlated to the physical address associated with the virtual address we passed as an input to !pte. However you noticed that 1c055 uses up to 20 bits (which are used to locate the PDE and PTE itself which will point to the physical page) and we know the completed addresses are 32 bits wide. The last 12 bits on the translation process are used as a byte index to the physical page where the content is actually stored, so they come straight from the virtual address.
The final physical address will then be 1c055 followed by the last 12 bits of the virtual address which in this case is 000. So putting them together we have that 1c055000 is the physical address correlated to the virtual address f7652000. Let’s verify that by dumping both address and comparing their content:
Dumping the virtual address first:
0: kd> dc f7625000
f7625000 00905a4d 00000003 00000004 0000ffff MZ..............
f7625010 000000b8 00000000 00000040 00000000 ........@.......
f7625020 00000000 00000000 00000000 00000000 ................
f7625030 00000000 00000000 00000000 000000e0 ................
f7625040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th
f7625050 70207369 72676f72 63206d61 6f6e6e61 is program canno
f7625060 65622074 6e757220 206e6920 20534f44 t be run in DOS
f7625070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......
Now let’s dump the physical address and compare the results:
0: kd> dc /p 1c055000
1c055000 00905a4d 00000003 00000004 0000ffff MZ..............
1c055010 000000b8 00000000 00000040 00000000 ........@.......
1c055020 00000000 00000000 00000000 00000000 ................
1c055030 00000000 00000000 00000000 000000e0 ................
1c055040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th
1c055050 70207369 72676f72 63206d61 6f6e6e61 is program canno
1c055060 65622074 6e757220 206e6920 20534f44 t be run in DOS
1c055070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......
The content is identical as expected J.
Even though this has been a strictly technical channel (I've never posted about anything else other than technical stuff) I know a lot of friends read it and so I decided to open an exception on this post and use it to publish something about me that also has a direct effect on how often and what type of content I will be able to keep publishing here.... Effective this week I started a new phase on my career at Microsoft assuming a non-technical role. I will be working as a people manager for the same group I had once worked before and have a great admiration.
As a manager I will need to change my focus to better serve my team and make sure they have everything they need from me to succeed. I'm still passionate about technology as I think I will be forever, but keeping up with the level of knowledge of our engineers here requires a lot of dedication which from now on will be focused on other matters other than only tecnology. I'm really excited about this new phase and have no doubt it happened at the right time for me.
Yet another kernel memory dump to be analyzed - The bugcheck this time is the 0xA - IRQL_NOT_LESS_OR_EQUAL. To better understand what this message means we would need a little background on Windows Internals but basically when executing anything at a interrupt request level (IRQL) = 2 or higher (in normal circumstances instructions get executed at IRQL = 0) we can't page fault. On the situation of this specific dump, we would crash even on IRQL = 0 since we're trying to write to the memory address 0x0 which is in user mode address space and is reserved (the first useable address in the user mode address space is 0x00010000)
Let's jump to the analysis.
Bugcheck info:
0: kd> .bugcheck
Bugcheck code 0000000A
Arguments 00000000 00000002 00000001 8087bb19
| | | |à This is the failing instruction address
| | |à 0x1 means WRITE, 0x0 means READ, so it crashed when trying to WRITE to 0x0
| |à This is the IRQL at which problem happened (IRQL=2)
|à This is the memory address where the problem happened
So it crashed basically because it failed to access a memory address (which at a lower IRQLs would result in a page fault) on the IRQL = 2. At IRQL = 2 and above page faults are not allowed to happen.
Current thread stack:
0: kd> kvnL
# ChildEBP RetAddr Args to Child
00 b7d6ebbc 8087bb19 badb0d00 00000000 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ b7d6ebbc)
01 b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f (FPO: [Non-Fpo]) (CONV: stdcall)
02 b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d (FPO: [Non-Fpo]) (CONV: stdcall)
03 b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)
04 b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67 (FPO: [Non-Fpo]) (CONV: fastcall)
WARNING: Stack unwind information not available. Following frames may be wrong.
05 b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada
06 b7d6ecf0 b775b520 f82cde28 f82cde28 8b572548 naiavf5x+0x737a
07 b7d6ed04 8081dcdf 8b478990 f82cde28 f82cde38 naiavf5x+0x2520
08 b7d6ed18 808f8be4 00000000 00000000 f7ca8448 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)
09 b7d6ed50 80932ca2 00ca8460 f7ca8460 00000001 nt!IopDeleteFile+0x13a (FPO: [Non-Fpo]) (CONV: stdcall)
0a b7d6ed68 80932cec f7ca8460 00000001 8ad47770 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)
0b b7d6ed80 8087f92f 00000000 00000000 8ad47770 nt!ObpProcessRemoveObjectQueue+0x36 (FPO: [1,0,0]) (CONV: stdcall)
0c b7d6edac 80948bd0 00000000 00000000 00000000 nt!ExpWorkerThread+0xeb (FPO: [Non-Fpo]) (CONV: stdcall)
0d b7d6eddc 8088d4e2 8087f844 80000000 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [Non-Fpo]) (CONV: stdcall)
0e 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
The current frame being executed is a post-exception frame, I mean, it's just handling the situation. The problem really happened here:
0: kd> .trap b7d6ebbc
ErrCode = 00000002
eax=00000000 ebx=ef4d0848 ecx=00000000 edx=00000000 esi=f7ca7008 edi=00000000
eip=8087bb19 esp=b7d6ec30 ebp=b7d6ec40 iopl=0 nv up ei ng nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010282
nt!ExDeleteResourceLite+0x1f:
8087bb19 8901 mov dword ptr [ecx],eax ds:0023:00000000=????????
0: kd> kvnL
*** Stack trace for last set context - .thread/.cxr resets it
# ChildEBP RetAddr Args to Child
00 b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f (FPO: [Non-Fpo]) (CONV: stdcall)
01 b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d (FPO: [Non-Fpo]) (CONV: stdcall)
02 b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)
03 b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67 (FPO: [Non-Fpo]) (CONV: fastcall)
WARNING: Stack unwind information not available. Following frames may be wrong.
04 b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada
05 b7d6ecf0 b775b520 f82cde28 f82cde28 8b572548 naiavf5x+0x737a
06 b7d6ed04 8081dcdf 8b478990 f82cde28 f82cde38 naiavf5x+0x2520
07 b7d6ed18 808f8be4 00000000 00000000 f7ca8448 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)
08 b7d6ed50 80932ca2 00ca8460 f7ca8460 00000001 nt!IopDeleteFile+0x13a (FPO: [Non-Fpo]) (CONV: stdcall)
09 b7d6ed68 80932cec f7ca8460 00000001 8ad47770 nt!ObpRemoveObjectRoutine+0xde (FPO: [Non-Fpo]) (CONV: stdcall)
0a b7d6ed80 8087f92f 00000000 00000000 8ad47770 nt!ObpProcessRemoveObjectQueue+0x36 (FPO: [1,0,0]) (CONV: stdcall)
0b b7d6edac 80948bd0 00000000 00000000 00000000 nt!ExpWorkerThread+0xeb (FPO: [Non-Fpo]) (CONV: stdcall)
0c b7d6eddc 8088d4e2 8087f844 80000000 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [Non-Fpo]) (CONV: stdcall)
0d 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
So it's crashing when trying to write to the address pointed by ECX - mov dword ptr [ecx], eax - This is the assembly code of the function where it crashed from the start to the crash instruction :
0: kd> u nt!ExDeleteResourceLite nt!ExDeleteResourceLite+0x1f + 1
nt!ExDeleteResourceLite [d:\nt\base\ntos\ex\resource.c @ 2250]:
8087bafa 8bff mov edi,edi
8087bafc 55 push ebp
8087bafd 8bec mov ebp,esp
8087baff 83ec0c sub esp,0Ch
8087bb02 56 push esi
8087bb03 8d55f4 lea edx,[ebp-0Ch]
8087bb06 b9c0418b80 mov ecx,offset nt!ExpResourceSpinLock (808b41c0)
8087bb0b ff1508118080 call dword ptr [nt!_imp_KeAcquireInStackQueuedSpinLock (80801108)]
8087bb11 8b7508 mov esi,dword ptr [ebp+8] ßESI is obtaining its value from the content of EBP+8
8087bb14 8b4e04 mov ecx,dword ptr [esi+4] ß ECX is obtaining its value from the content of ESI+4
8087bb17 8b06 mov eax,dword ptr [esi]
8087bb19 8901 mov dword ptr [ecx],eax ß This is where it crashed
ECX is getting its value from the address pointed by ESI + 4, and ESI is getting its value from the address pointed by EBP + 8 which is a pointer to the first parameter passed from the previous function on the stack. Confirming:
0: kd> dd esi+4 l1
f7ca700c 00000000
0: kd> dd poi(ebp+8) l1
f7ca7008 00000000
Next step is look to the previous function on the stack. The assembly for the next function up to the point it calls the current one is this:
0: kd> u nt!SepTokenDeleteMethod nt!SepTokenDeleteMethod+0x9d
nt!SepTokenDeleteMethod [d:\nt\base\ntos\se\token.c @ 2652]:
80972a66 8bff mov edi,edi
80972a68 55 push ebp
80972a69 8bec mov ebp,esp
80972a6b 51 push ecx
80972a6c 51 push ecx
80972a6d 56 push esi
80972a6e 8b7508 mov esi,dword ptr [ebp+8] ß ESI is coming from the content EBP + 8 (first parameter passed from the previous function)
80972a71 f6868800000020 test byte ptr [esi+88h],20h
80972a78 57 push edi
80972a79 7538 jne nt!SepTokenDeleteMethod+0x4d (80972ab3)
80972a7b 8bbe94000000 mov edi,dword ptr [esi+94h]
80972a81 53 push ebx
80972a82 8d4f0c lea ecx,[edi+0Ch]
80972a85 eb0f jmp nt!SepTokenDeleteMethod+0x30 (80972a96)
80972a87 8d42ff lea eax,[edx-1]
80972a8a 8bd8 mov ebx,eax
80972a8c 8bc2 mov eax,edx
80972a8e f00fb119 lock cmpxchg dword ptr [ecx],ebx
80972a92 3bc2 cmp eax,edx
80972a94 741c je nt!SepTokenDeleteMethod+0x4c (80972ab2)
80972a96 8b11 mov edx,dword ptr [ecx]
80972a98 83fa01 cmp edx,1
80972a9b 75ea jne nt!SepTokenDeleteMethod+0x21 (80972a87)
80972a9d 8b4704 mov eax,dword ptr [edi+4]
80972aa0 8945f8 mov dword ptr [ebp-8],eax
80972aa3 8b4708 mov eax,dword ptr [edi+8]
80972aa6 8945fc mov dword ptr [ebp-4],eax
80972aa9 8d45f8 lea eax,[ebp-8]
80972aac 50 push eax
80972aad e890ecffff call nt!SepDeReferenceLogonSession (80971742)
80972ab2 5b pop ebx
80972ab3 8d4638 lea eax,[esi+38h]
80972ab6 8b08 mov ecx,dword ptr [eax]
80972ab8 0b4804 or ecx,dword ptr [eax+4]
80972abb 6a00 push 0
80972abd 5f pop edi
80972abe 7407 je nt!SepTokenDeleteMethod+0x61 (80972ac7)
80972ac0 57 push edi
80972ac1 50 push eax
80972ac2 e8b5140000 call nt!SepModifyTokenPolicyCounter (80973f7c)
80972ac7 8b4678 mov eax,dword ptr [esi+78h]
80972aca 3bc7 cmp eax,edi
80972acc 7407 je nt!SepTokenDeleteMethod+0x6f (80972ad5)
80972ace 57 push edi
80972acf 50 push eax
80972ad0 e86ff8f1ff call nt!ExFreePoolWithTag (80892344)
80972ad5 8b868c000000 mov eax,dword ptr [esi+8Ch]
80972adb 3bc7 cmp eax,edi
80972add 7406 je nt!SepTokenDeleteMethod+0x7f (80972ae5)
80972adf 50 push eax
80972ae0 e85766ffff call nt!SeFreeCapturedObjectTypeList (8096913c)
80972ae5 8b8690000000 mov eax,dword ptr [esi+90h]
80972aeb 3bc7 cmp eax,edi
80972aed 7407 je nt!SepTokenDeleteMethod+0x90 (80972af6)
80972aef 57 push edi
80972af0 50 push eax
80972af1 e84ef8f1ff call nt!ExFreePoolWithTag (80892344)
80972af6 8b4630 mov eax,dword ptr [esi+30h] ß EAX is coming from the content of ESI + 0x30
80972af9 3bc7 cmp eax,edi
80972afb 740f je nt!SepTokenDeleteMethod+0xa6 (80972b0c)
80972afd 50 push eax ß EAX is pushed on the stack as the first parameter
80972afe e8f78ff0ff call nt!ExDeleteResourceLite (8087bafa)
So once again we need to look at the previous function since this one is also receiving the bad value as parameter. By revisiting the stack we see the first parameter it passes is also received as below:
0: kd> kbL5
ChildEBP RetAddr Args to Child
b7d6ec40 80972b03 f7ca7008 80a56be4 00000000 nt!ExDeleteResourceLite+0x1f
b7d6ec5c 80932ca2 ef4d0860 ef4d0848 00000000 nt!SepTokenDeleteMethod+0x9d
b7d6ec74 8086c1a5 ef4d0860 00000000 8b478df0 nt!ObpRemoveObjectRoutine+0xde
b7d6ec94 b7765ada f7ca3444 b7764007 f7ca8460 nt!ObfDereferenceObject+0x67
WARNING: Stack unwind information not available. Following frames may be wrong.
b7d6ecb0 b776037a f7cbf004 b7d6ece0 f82cde28 naiavf5x+0xcada
As we can go directly to the function nt!ObfDereferenceObject+0x67 and try to see where its setting up that parameter from. The assembly code from this function to the point it calls the next one is this:
0: kd> u nt!ObfDereferenceObject nt!ObfDereferenceObject+0x67 ß This is a fastcall function so the first two parameters are passed through the registers ECX and EDX instead of using the stack (EBP as a reference)
nt!ObfDereferenceObject [d:\nt\base\ntos\ob\obref.c @ 2441]:
8086c13e 8bff mov edi,edi
8086c140 55 push ebp
8086c141 8bec mov ebp,esp
8086c143 51 push ecx
8086c144 803de0088a8000 cmp byte ptr [nt!ObpTraceEnabled (808a08e0)],0
8086c14b 53 push ebx
8086c14c 56 push esi
8086c14d 57 push edi
8086c14e 894dfc mov dword ptr [ebp-4],ecx ß The location pointed by EBP-0x4 which is local variable is receiving the value of ECX which is the first parameter passed to this function
8086c151 8d71e8 lea esi,[ecx-18h]
8086c154 7408 je nt!ObfDereferenceObject+0x20 (8086c15e)
8086c156 6a00 push 0
8086c158 56 push esi
8086c159 e8c8fcffff call nt!ObpPushStackInfo (8086be26)
8086c15e 83cbff or ebx,0FFFFFFFFh
8086c161 f00fc11e lock xadd dword ptr [esi],ebx
8086c165 4b dec ebx
8086c166 7547 jne nt!ObfDereferenceObject+0x71 (8086c1af)
8086c168 8b3d90108080 mov edi,dword ptr [nt!_imp__KeGetCurrentIrql (80801090)]
8086c16e ffd7 call edi
8086c170 64a124010000 mov eax,dword ptr fs:[00000124h]
8086c176 6683787200 cmp word ptr [eax+72h],0
8086c17b 752c jne nt!ObfDereferenceObject+0x6b (8086c1a9)
8086c17d ffd7 call edi
8086c17f 3c01 cmp al,1
8086c181 7326 jae nt!ObfDereferenceObject+0x6b (8086c1a9)
8086c183 803de0088a8000 cmp byte ptr [nt!ObpTraceEnabled (808a08e0)],0
8086c18a 740f je nt!ObfDereferenceObject+0x5d (8086c19b)
8086c18c 833dd8078a8000 cmp dword ptr [nt!ObpTraceNoDeregister (808a07d8)],0
8086c193 7506 jne nt!ObfDereferenceObject+0x5d (8086c19b)
8086c195 56 push esi
8086c196 e805fcffff call nt!ObpDeregisterObject (8086bda0)
8086c19b 6a00 push 0
8086c19d ff75fc push dword ptr [ebp-4] ß Here it is pushing the content of EBP-4 which we know is coming from ECX (first parameter) as a first parameter to the next function
8086c1a0 e81f6a0c00 call nt!ObpRemoveObjectRoutine (80932bc4) ß this is the function call to the next function
So we know the bad value is coming from the previous function on the stack which is naiavf5x+0xcada. By looking at the assembly for that we have this:
0: kd> u b7765ac1-10 b7765ad5+2
naiavf5x+0xcab1:
b7765ab1 ff1550a476b7 call dword ptr [naiavf5x+0x11450 (b776a450)]
b7765ab7 5f pop edi
b7765ab8 8bc6 mov eax,esi
b7765aba 5e pop esi
b7765abb 5b pop ebx
b7765abc 5d pop ebp
b7765abd c20800 ret 8
b7765ac0 56 push esi
b7765ac1 8bf1 mov esi,ecx ß ESI is coming from ECX
b7765ac3 8d86a8000000 lea eax,[esi+0A8h]
b7765ac9 50 push eax
b7765aca e8818affff call naiavf5x+0x5550 (b775e550)
b7765acf 85c0 test eax,eax
b7765ad1 7516 jne naiavf5x+0xcae9 (b7765ae9)
b7765ad3 8bce mov ecx,esi ß ECX which will be the first parameter to the next function is coming from ESI
b7765ad5 e8b6f7ffff call naiavf5x+0xc290 (b7765290)
CONCLUSION:
So at this point we can’t move further since we have no symbols for the NAIAVF5X.SYS driver – which is a file system filter driver from McAfee. The conclusion would be that this driver is bad until the McAfee analyze this same dump and prove us wrong J. Here is some addition info about the driver:
0: kd> lmvm naiavf5x
start end module name
b7759000 b776d440 naiavf5x (no symbols)
Loaded symbol image file: naiavf5x.sys
Image path: \SystemRoot\system32\drivers\naiavf5x.sys
Image name: naiavf5x.sys
Timestamp: Mon Aug 04 20:08:00 2003 (3F2F0370) ß By the way, this is old stuff. They might have already corrected this on newer versions of the driver and I wouldn’t be surprised to find some documentation about this on McAfee’s web site.
CheckSum: 00020D5B
ImageSize: 00014440
Translations: 0000.04b0 0000.04e0 0409.04b0 0409.04e0
By the way, for this dump, “!analyze –v” does the job and points to the right guilty driver.
Until the next dump analysis, I mean, next post :)
It’s been a long time since my last post, but for some reason lately I’ve been receiving so many nice feedbacks about the blog and the other posts that I feel really motivated again to post a new article. See how important your feedback is for me? J
Ok, normally I post about user mode debugging, but last week I was working on developing some content and I got this kernel memory dump I decided to take the risk of analyzing it, even considering Kernel debugging is not my specialty.
Normally the first step when you’re analyzing dumps resultant from Blue Screens (I will write about the differences between kernel mode dumps and user mode dumps in another post) is to run the command !analyze –v. Well, for this specific dump, !analyze –v is just not pointing to the right driver, at least I did not think so and that’s what motivated me to make a manual analysis of this specific dump. Ok, too much said and very little action here… Let’s start with some hands on!
After opening the dump with Windbg.exe, setting the correct symbol path and running the command !analyze –v, this is what we have in summary:
KMODE_EXCEPTION_NOT_HANDLED (1e)
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: a012a6c5, The address that the exception occurred at
Arg3: 00000000, Parameter 0 of the exception
Arg4: 00000014, Parameter 1 of the exception
FAULTING_IP:
win32k!DdDxApiOpenCaptureDevice+18
a012a6c5 8b5814 mov ebx,dword ptr [eax+14h]
EXCEPTION_PARAMETER1: 00000000
EXCEPTION_PARAMETER2: 00000014
READ_ADDRESS: 00000014
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x1E
FOLLOWUP_IP:
Dxapi!DxApi+43
bdae59f7 8bc6 mov eax,esi
SYMBOL_STACK_INDEX: 3
SYMBOL_NAME: Dxapi!DxApi+43
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: Dxapi
IMAGE_NAME: Dxapi.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 382b8097
FAILURE_BUCKET_ID: 0x1E_Dxapi!DxApi+43
BUCKET_ID: 0x1E_Dxapi!DxApi+43
Followup: MachineOwner
So it’s telling us a lot of useful information like what was the bugcheck (1E), what kind of unhandled exception happened (Access Violation – 0xC0000005) and what was the instruction pointer (IP) when the exception happened. In this case it was address 0xa012a6c5. It’s also pointing to the driver Dxapi.sys as the probable guilty for the problem. Well, my curiosity on that was exactly because the instruction pointer was pointing to an address that doesn’t belong to Dxapi.sys. Let’s check that:
kd> lmvm Dxapi
start end module name
bdae4000 bdae6640 Dxapi (pdb symbols) c:\debuggers\sym\dxapi.pdb\382B80971\dxapi.pdb
Loaded symbol image file: c:\debuggers\sym\Dxapi.dbg\382B80972640\Dxapi.dbg
Image path: \SystemRoot\System32\DRIVERS\Dxapi.sys
Image name: Dxapi.sys
Timestamp: Fri Nov 12 00:51:03 1999 (382B8097)
CheckSum: 0000730A
ImageSize: 00002640
File version: 5.0.2180.1
Product version: 5.0.2180.1
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 3.7 Driver
File date: 00000000.00000000
Translations: 0409.04b0
CompanyName: Microsoft Corporation
ProductName: Microsoft(R) Windows (R) 2000 Operating System
InternalName: dxapi.sys
OriginalFilename: dxapi.sys
ProductVersion: 5.00.2180.1
FileVersion: 5.00.2180.1
FileDescription: DirectX API Driver
LegalCopyright: Copyright (C) Microsoft Corp. 1981-1999
By checking the base address (yellow) of the module and its end address (green) we notice that the address for the instruction pointers is not in between them. Actually the instruction point address was from Win32k.sys:
kd> lmvm win32k
start end module name
a0000000 a01a5e60 win32k (pdb symbols) c:\debuggers\sym\win32k.pdb\38E910C02\win32k.pdb
Loaded symbol image file: c:\debuggers\sym\win32k.dbg\3947E2231a5e60\win32k.dbg
Image path: \??\D:\WINNT\system32\win32k.sys
Image name: win32k.sys
Timestamp: Wed Jun 14 16:50:59 2000 (3947E223)
CheckSum: 001B02D1
ImageSize: 001A5E60
Translations: 0000.04b0 0000.04e0 0409.04b0 0409.04e0
So the address 0xa012a6c5 is indeed in between those addresses which means it was an instruction from this driver (Win32k.sys) that tried to an illegal operation. But let’s wait a little bit before blaming the poor kernel image of the Win32 Subsystem (This is basically what the Win32k.sys is). Let’s check our assembly code and try to see what really happened here…
Checking our thread’s stack:
kd> k
ChildEBP RetAddr
bd2858a0 80463b2b nt!KiDispatchException+0x30e
bd285908 80463add nt!CommonDispatchException+0x4d
bd285908 a012a6c5 nt!KiUnexpectedInterruptTail+0x1f4
bd2859b4 f0593a8e Dxapi!DxApi+0x43
WARNING: Stack unwind information not available. Following frames may be wrong.
bd285a14 f0592bb4 3dfxV3TV+0x3a8e
bd285a34 f05928cc 3dfxV3TV+0x2bb4
bd285a4c f05a0cfd 3dfxV3TV+0x28cc
00000000 00000000 STREAM!SCIssueRequestToDevice+0xad
This actually happened after the exception happened – It was dispatching the exception. The original context (call stack and registers) was actually saved on a trap frame. From the debugger help we get the information which states that the third parameter passed to the function KiDispatchException is a pointer to our trap frame. The command .trap will load the context saved on the trap.
kd> kb1
ChildEBP RetAddr Args to Child
bd2858a0 80463b2b bd2858bc 00000000 bd285910 nt!KiDispatchException+0x30e
kd> .trap bd285910
ErrCode = 00000000
eax=00000000 ebx=84f0ba10 ecx=a012a6ad edx=00000020 esi=bd2859dc edi=80d724a8
eip=a012a6c5 esp=bd285984 ebp=bd2859a0 iopl=0 nv up ei pl zr na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
win32k!DdDxApiOpenCaptureDevice+0x18:
a012a6c5 8b5814 mov ebx,dword ptr [eax+14h] ds:0023:00000014=????????
Ok, now we have the right context where the exception happened. Notice that our current EIP on this context matches the one pointed by the !analyze –v as the one where the exception happened (in red above). We basically crashed because we’re trying to move the content pointed by EAX+0x14 to EBX. Since EAX = 0 (green above) we failed when trying to read from the invalid address 0x00000014 (which is the resultant address from 0x0 + 0x14).
Let’s look at the assembly code that execute immediately before this a012a6c5.
kd> ub a012a6c8
win32k!DdDxApiOpenCaptureDevice+0x7:
a012a6b4 56 push esi
a012a6b5 8b7508 mov esi,dword ptr [ebp+8]
a012a6b8 57 push edi
a012a6b9 8365fc00 and dword ptr [ebp-4],0
a012a6bd 8b06 mov eax,dword ptr [esi]
a012a6bf 8b7814 mov edi,dword ptr [eax+14h]
a012a6c2 8b4604 mov eax,dword ptr [esi+4]
a012a6c5 8b5814 mov ebx,dword ptr [eax+14h] ß This is where we crashed
So we need to find out on the assembly block above where EAX gets set to 0x0.
kd> ub a012a6c8
win32k!DdDxApiOpenCaptureDevice+0x7:
a012a6b4 56 push esi
a012a6b5 8b7508 mov esi,dword ptr [ebp+8]
a012a6b8 57 push edi
a012a6b9 8365fc00 and dword ptr [ebp-4],0
a012a6bd 8b06 mov eax,dword ptr [esi]
a012a6bf 8b7814 mov edi,dword ptr [eax+14h]
a012a6c2 8b4604 mov eax,dword ptr [esi+4] ß This is where EAX gets updated for the last time before the crash
a012a6c5 8b5814 mov ebx,dword ptr [eax+14h]
So EAX is actually receiving its value from the content of ESI+4. Following the same logic we need to find out where ESI is getting updated for the last time before being involved on the instruction with EAX.
kd> ub a012a6c8
win32k!DdDxApiOpenCaptureDevice+0x7:
a012a6b4 56 push esi
a012a6b5 8b7508 mov esi,dword ptr [ebp+8] ß This where ESI gets updated for the last time before the instruction at a012a6c2
a012a6b8 57 push edi
a012a6b9 8365fc00 and dword ptr [ebp-4],0
a012a6bd 8b06 mov eax,dword ptr [esi]
a012a6bf 8b7814 mov edi,dword ptr [eax+14h]
a012a6c2 8b4604 mov eax,dword ptr [esi+4]
a012a6c5 8b5814 mov ebx,dword ptr [eax+14h]
So ESI is getting its value from the content of EBP+8. EBP+8 happens to be a fixed position on the stack used as a reference for the first parameter passed to the function (considering the calling convention used is not fast call). EBP+8 being the first parameter, EBP+0xC the second, EBP+0x10 the third and so on… There are other pre-defined positions on the stack you should be aware of like EBP+4 is where the return address gets saved or EBP-4 for the first local variable etc, but this is subject for another post…
Let’s confirm our theory:
kd> dd poi(ebp+8)+4 l1
bd2859e0 00000000
kd> dd ebp+8 l1
bd2859a8 bd2859dc
Ok, so this is really where the 0x0 on EAX is coming from. At this point we exclude Win32K.sys from our list of suspects after all it’s just receiving an invalid parameter being passed from the previous function on the stack:
kd> kb3
ChildEBP RetAddr Args to Child
bd2859a0 bdae59f7 bd2859dc bd285a0c 00028bb1 win32k!DdDxApiOpenCaptureDevice+0x18
bd2859b4 f0593a8e 00000513 bd2859dc 00000020 Dxapi!DxApi+0x43
bd285a14 f0592bb4 81c83228 84f0ba10 822efac8 3dfxV3TV+0x3a8e
As we can see from the stack above, the first parameter passed to the Win32k!DdDxApiOpenCaptureDevice is really the one with the corrupted value. The parameter is being passed from Dxapi!dDxApi.
Now take at that stack again… Pay attention to the details on the stack.. Can you give me a reason for excluding DxApi.sys from our suspect list also? No? Ok, let’s do it on the funniest way then J, shall we?
Now we need to look at the assembly code of Dxapi!Dxapi. Here is the entire assembly block for such function:
kd> uf Dxapi!DxApi
Dxapi!DxApi:
bdae59b4 55 push ebp
bdae59b5 8bec mov ebp,esp
bdae59b7 8b4508 mov eax,dword ptr [ebp+8]
bdae59ba 56 push esi
bdae59bb 2d00050000 sub eax,500h
bdae59c0 33f6 xor esi,esi
bdae59c2 83f816 cmp eax,16h
bdae59c5 7330 jae Dxapi!DxApi+0x43 (bdae59f7)
Dxapi!DxApi+0x13:
bdae59c7 8d0440 lea eax,[eax+eax*2]
bdae59ca c1e002 shl eax,2
bdae59cd 8b88005baebd mov ecx,dword ptr Dxapi!gDxApiEntryPoint (bdae5b00)[eax]
Here we’re loading ECX whith the pointer to an external function – In this case this is the function Win32k!DdDxApiOpenCaptureDevice
bdae59d3 85c9 test ecx,ecx
bdae59d5 7420 je Dxapi!DxApi+0x43 (bdae59f7)
Dxapi!DxApi+0x23:
bdae59d7 8b5510 mov edx,dword ptr [ebp+10h]
bdae59da 3b90045baebd cmp edx,dword ptr Dxapi!gDxApiEntryPoint+0x4 (bdae5b04)[eax]
bdae59e0 7215 jb Dxapi!DxApi+0x43 (bdae59f7)
Dxapi!DxApi+0x2e:
bdae59e2 8b80085baebd mov eax,dword ptr Dxapi!gDxApiEntryPoint+0x8 (bdae5b08)[eax]
bdae59e8 394518 cmp dword ptr [ebp+18h],eax
bdae59eb 720a jb Dxapi!DxApi+0x43 (bdae59f7)
Dxapi!DxApi+0x39:
bdae59ed ff7514 push dword ptr [ebp+14h]
bdae59f0 8bf0 mov esi,eax
bdae59f2 ff750c push dword ptr [ebp+0Ch]
bdae59f5 ffd1 call ecx ß Here is where the function effectively gets called
Dxapi!DxApi+0x43:
bdae59f7 8bc6 mov eax,esi ß This is our return address as you check on the stack
bdae59f9 5e pop esi
bdae59fa 5d pop ebp
bdae59fb c21400 ret 14h
Look at this specific part of the block:
Dxapi!DxApi+0x39:
bdae59ed ff7514 push dword ptr [ebp+14h]
bdae59f0 8bf0 mov esi,eax
bdae59f2 ff750c push dword ptr [ebp+0Ch] ß This is the first parameter being pushed on the stack. This will be read by the next function at the EBP+8 relative position.
bdae59f5 ffd1 call ecx ß Here is where the function effectively gets called
So our bad parameter comes from EBP+0Ch on this specific function. EBP+0Ch happens to be the location for the second parameter (the parameters, considering we’re using the calling convention STANDARD or CDECL – FAST_CALL will pass the two first parameters through the registers ECX and EDX instead) passed to this function from the previous one (EBP+0x8 is relative location where the first parameter can be retrieved, EBP+0xC is basically 4 bytes beyond and is the second parameter, the same way EBP+0x10 will store the third parameter and so on).
So our next step will be revisiting the stack and investigate the previous function.
kd> kb3
ChildEBP RetAddr Args to Child
bd2859a0 bdae59f7 bd2859dc bd285a0c 00028bb1 win32k!DdDxApiOpenCaptureDevice+0x18
bd2859b4 f0593a8e 00000513 bd2859dc 00000020 Dxapi!DxApi+0x43
bd285a14 f0592bb4 81c83228 84f0ba10 822efac8 3dfxV3TV+0x3a8e
So the module 3dfxV3TV.sys is the one whose an internal function is passing the bad parameter which eventually caused our crash. This is a third party module which means we will have no symbols for it. In this case it’s time to look at the assembly code once again J
This is the assembly code for the function in question:
kd> ub f0593a8e l10
3dfxV3TV+0x3a5f:
f0593a5f 2bc7 sub eax,edi
f0593a61 7902 jns 3dfxV3TV+0x3a65 (f0593a65)
f0593a63 f7d8 neg eax
f0593a65 3bc8 cmp ecx,eax
f0593a67 7e06 jle 3dfxV3TV+0x3a6f (f0593a6f)
f0593a69 8993c4000000 mov dword ptr [ebx+0C4h],edx
f0593a6f 8b83c4000000 mov eax,dword ptr [ebx+0C4h]
f0593a75 8945d8 mov dword ptr [ebp-28h],eax
f0593a78 8d45f8 lea eax,[ebp-8]
f0593a7b 6a08 push 8
f0593a7d 50 push eax
f0593a7e 8d45c8 lea eax,[ebp-38h]
f0593a81 6a20 push 20h ß Third parameter being pushed (this will be read by the next function at the EBP+0x10 relative position)
f0593a83 50 push eax ß Second parameter being pushed (this will be read by the next function at the EBP+0xC relative position) This is the one we’re interested in.
f0593a84 6813050000 push 513h ß First parameter being pushed (this will be read by the next function at the EBP+0x8 relative position)
f0593a89 e874750000 call 3dfxV3TV!SetDecoderSize+0x32ec (f059b002) ß This is where the next function on the stack gets called
Ok, so we identified on the assembly block what would be the second parameter passed to the next function. EAX was the register storing its value, so the next step is find out where EAX is being updated for the last time before being used to push the parameter on the stack. Giving a second look at the same assembly block, now focused on any instruction involving EAX we see this:
kd> ub f0593a8e l10
3dfxV3TV+0x3a5f:
f0593a5f 2bc7 sub eax,edi
f0593a61 7902 jns 3dfxV3TV+0x3a65 (f0593a65)
f0593a63 f7d8 neg eax
f0593a65 3bc8 cmp ecx,eax
f0593a67 7e06 jle 3dfxV3TV+0x3a6f (f0593a6f)
f0593a69 8993c4000000 mov dword ptr [ebx+0C4h],edx
f0593a6f 8b83c4000000 mov eax,dword ptr [ebx+0C4h]
f0593a75 8945d8 mov dword ptr [ebp-28h],eax
f0593a78 8d45f8 lea eax,[ebp-8]
f0593a7b 6a08 push 8
f0593a7d 50 push eax
f0593a7e 8d45c8 lea eax,[ebp-38h]
f0593a81 6a20 push 20h
f0593a83 50 push eax
f0593a84 6813050000 push 513h
f0593a89 e874750000 call 3dfxV3TV!SetDecoderSize+0x32ec (f059b002)
The instruction above (lea eax, [ebp-38h]) is basically copying the content of the relative position EBP – 38h to EAX. So our bad value is whatever EBP-0x38 points to. Well, I told you EBP+0x8 is the reference for what the first parameter is being passed and then the following parameters would be 4 bytes apart… EBP is also a reference for some other important things also stored on the stack (as long as we’re not FPO optimized – again this is subject for another post) like EBP+0x4 is always the location where the return address will be stored and EBP-0x4 is where the first local variable for the function will be stored.
EBP-0x38 is for sure a location of a local variable used by the function (EBP-0x4 is the first local variable. From this point on, the location of other variables will actually depend on the variable type since this will determine how much space we need to reserve on the stack to store such values). Being a local variable on the function, there isn’t much we can do without symbols and source code, but we thing is for sure: Different from what “!analyze –v” told us the problem is actually being caused by the driver 3DFXV3TV.SYS and the solution or part of it would be either remove or update such driver J.
Until the next post J
Normally the answer for this would be as simple as use the file image execution options (through direct editing the registry or using gflags.exe from Debugging Tools For Windows) and set the “debugger” option to you preferred debugger tool and this would do the job. At least this is the way we do for troubleshooting services startup in general. There is one detail though which is the fact the services are likely loaded in a different winstation (they would have not interactivity with the desktop) and so will be loaded your debugger meaning that even though the debugger will run as it should you will not have access to it.
Well, to fix this all you need to is to go to the services mmc, navigate to the specific service’s properties, make it use the Local System account and check the checkbox that says “allow interact with the desktop” and this would fix your problem. Right, but on thing is not ok here: Worker Processes are spawned to run in background as well as services however they are not exactly services and so there is no option to “allow interact with desktop” which means your debugger will still be loaded in the background preventing you to access it from the desktop.
The solution? There is more than one for sure, but I found this one interesting. Let’s see how it goes:
Both debuggers cdb.exe and windbg.exe from Debugging Tools For Windows are able to create TCP sockets (or named pipes connection if you prefer) and listen to them so other debuggers would be able to connect to it and share the same debugging session. So I thought we could use this socket feature to create a tunnel between the winstations where our w3wp.exe and the debugger will be originally located and the desktop where we would have second debugger running (this one in foreground).
So, first things first…
Step 1: Set the file image execution options to have your first debugger attached to the worker process when it get spawned.
To do that you need to use the regedit and create the following registry key and values:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\w3wp.exe
SZ Debugger = “c:\debuggers\cdb.exe –cf c:\debuggers\cmds.txt”
This is supposing you have the Debugging Tools For Windows installed in the c:\debuggers folder.
Step 2: Automate the debugger to open a socket for future usage
Notice the “-cf” option in the cdb.exe command line above. This will tell the cdb.exe to read some initial commands from a specified file to be executed as soon as the debugger get’s attached to the process. So we need to create a text file called cmds.txt located in the c:\debuggers folder. The content of the file will be the following:
.logopen c:\debuggers\debugging.log
.server tcp:port=9999
g
Save the file. See the comments below on what each command above does from the debugger:
.logopen – opens a log file where everything happening from this debugging session will be properly logged
.server tcp:port=9999 - listen to a TCP socket on the port 9999
g – go J
Step 3: Start the application pool and make at least one request.
When the w3wp.exe process gets created you should be able to see through the task manager that now you also have a cdb.exe process running but you can’t directly access it.
Step 4: Run your second debugger and connect it to the socket.
Run the windbg.exe from Debugging Tools For Windows passing the parameter –remote from the command line like below:
Windbg.exe -remote tcp:Port=9001,Server=MYSERVER
The “MYSERVER” above should be replaced by the server’s name.
Voila!!! The windbg.exe will connect to the socket created by the cdb.exe and you will now be able to live debug your w3wp.exe from the scratch!
Have fun!
I was teaching a workshop in London last week and one of the students brought a very interesting issue he had been working on for some time and based on the dump analysis he was not able to point to anything different from our own code (Microsoft’s modules) as the ones causing the crash. He had done some research and found out that some crashes when recycling the app pool have been reported around the internet due to a PHP ISAPI problem. He had such ISAPI loaded in the crashing app pool so he suspected from that but the dumps still point to our code…
Direct to point:
By looking at the dumps generated the [DebugDiag] we could see that the crash is a typical Access Violation (0xC0000005) as below:
The AV details:
0:000> .exr -1
ExceptionAddress: 5a3b01b4 (w3core!W3_SERVER::Terminate+0x00000164)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000000
Parameter[1]: 608fc030
Attempt to read from address 608fc030
The thread stack when the crash happened was something like this:
0:000> kL
ChildEBP RetAddr
0006fcb4 5a3bc347 w3core!W3_SERVER::Terminate+0x164
0006ff0c 0100187c w3core!UlW3Start+0x280
0006ff44 01001a27 w3wp!wmain+0x22a
0006ffc0 77e6f23b w3wp!wmainCRTStartup+0x12f
0006fff0 00000000 kernel32!BaseProcessStart+0x23
So as we can see, the w3core.dll was in the top of the stack and this would lead us to conclude this is the culprit module crashing the process. But let’s hold our conclusions for a moment… All the w3core.dll is trying to do is to read from an address that is actually part of another module (the ADMPROX.DLL) that is currently unloaded and that’s why we’re crashing here.
So the W3_SERVER::Terminate is trying to access an invalid address. The address being referenced is somewhere in ADMWPROX.DLL as below:
0:000> ln 608fc030
(608fc030) <Unloaded_ADMWPROX.DLL>+0xc030
This module is already unloaded as below and that's why we're crashing.
Unloaded modules:
608f0000 60901000 ADMWPROX.DLL
78130000 781cb000 MSVCR80.dll
60060000 60066000 aspnet_filter.dll
10000000 1004c000 ISAPI_Rewrite.dll
5a320000 5a332000 w3isapi.dll
022e0000 022f7000 odbcint.dll
77530000 775c7000 COMCTL32.dll
7c8d0000 7d0ce000 SHELL32.dll
762b0000 762f9000 comdlg32.dll
48890000 488cd000 ODBC32.dll
01c90000 02167000 php5ts.dll
76cf0000 76d0a000 iphlpapi.dll
03670000 03726000 ZendOptimizer.dll
02180000 02185000 ZendExtensionManager.dll
02440000 02609000 php_mbstring.dll
02610000 02621000 php_bz2.dll
The ADMPROX.DLL is also our code. So the question is why are we holding a reference for something already unloaded? We noticed that indeed there were also some PHP modules loaded (at the time the dump was generated they were unloaded as well). One possible reason would be somebody COM initializing on a not owned thread (I've seen this before in a case where the IIS mmc got locked because a reference being hold by an unloaded module that called COM initialize in one of the IIS threads instead of its own) - More details about that can be obtained in the following article:
http://support.microsoft.com/kb/289112/en-us
Next step is to live debug and set a bp in the CoInitializeEx (http://msdn2.microsoft.com/en-us/library/ms695279.aspx) and find out which module would be doing that (if this is the case – this is a theory and we are about to test it). We created a new app pool and website and pointed the web site to the same content folder as the one crashing and started working with this worker process instead. We had some hard time using file image execution options and tyring to get a debugger attached when the worker process gets spawned. We workaround the situation by automating the cdb to attach and open a socket so we would be able to connect to the socket via Windbg (I will post something specifically about this later on).
After attaching the windbg and setting the proper bp as below, we caught who else is COM intializing as below:
Setting the bp at the Ole32::CoInitializeEx
[116714-SERVER1\rack (tcp 192.168.1.202:1083)] 0:000> bp 776bc826 ".time;k100;g"
[116714-SERVER1\rack (tcp 192.168.1.202:1083)] 0:000> bl
0 e 776bc826 0001 (0001) 0:**** ole32!CoInitializeEx ".time;k100;g"
Bp got hit! This is php5ts.dll as below:
Debug session time: Thu Aug 2 19:50:03.357 2007 (GMT+1)
System Uptime: 8 days 14:47:04.585
Process Uptime: 0 days 0:01:08.125
Kernel time: 0 days 0:00:00.046
User time: 0 days 0:00:00.093
ChildEBP RetAddr
00fbe744 776a0d1f ole32!CoInitializeEx
00fbe754 01c186f8 ole32!CoInitialize+0xf
00fbe7b4 10001ec8 php5ts!php_request_startup+0x18
00fbecd4 5a3968ff w3isapi!ProcessIsapiRequest+0x100
00fbed08 5a3967e0 w3core+0x68ff
00fbed28 5a396764 w3core+0x67e0
00fbed48 5a3966f4 w3core+0x6764
This is not a PHP thread, so they were right when suspecting from PHP modules. You can’t call CoInitializeEx in a thread not owned by you. Next step is contact PHP vendor and let them know about the problem.
Later the customer informed us they were able to workaround the problem by simple downgrading the version of the PHP modules. This seems to happen only in the new PHP ISAPI versions.
Anyway it was a very good sample to ilustrate some debugging during the training.
This is a question we hear very often from our customers, in forums or through distribution lists. The reason we keep hearing the same question over the time is very simple: There isn’t a definitive answer for that – At least not for IIS since it’s not exactly driven by the number of users hitting it. It would make more sense to monitor the number of connections and requests being made through these than simply working around a number of users. Why I think this way? This is a subject for another post J. In this post I will just try to give you another (hopefully a more precise one) answer for “how do I know how many users are accessing my web site or application hosted in a IIS server?”
Direct to the point: I’ve been seeing really good people making a very common mistake when trying to answer this question – The most popular answer is: Go to the IIS logs and count how many different client IP you have logged during a specific period of time. It’s a common mistake to think that client IPs are directly related to the number of users hitting the server. Why? Really simple, just think that you can have an entire organization, like here at Microsoft, accessing the web through proxy servers or entire IP intranet classes (192.168.0.0, 10.0.0.0, 172.16.0.0) behind routers providing NAT services. This client machines (some times thousands of them) will show up in your IIS log as a single client IP address.
So, it’s very nice to use LogParser to generate some pretty neat charts showing how many different IPs have been hitting your server based on the c-ip field in the IIS logs, however they will be nice charts with useless information.
The answer: Well, fortunately there is another field called cs(cookie) that will make IIS log the cookies sent back and forth. What does this have to do with the number of users? Unless you have the session state feature completely disabled for you classic ASP applications, they will use cookies to store the information for that. For instance, sessions being created by classic ASP will generate cookies always initiated by the string “ASPSESSION” followed bv a unique ID number. Can you already feel where we going to? Exactly, all I’m proposing here is replace the c-ip by the cs(cookie) in your LogParser queries.
Since every new browser session will generate a new cookie regardless of which IP the client is using, it makes much more sense to monitor it instead. For instance the query below will tell you how many different classic ASP sessions were used during 06:00pm (GTM) and 07:00pm (GTM) of the day 7/20/2007.
logparser "select count(distinct cs(cookie)) as sessions_total from ex070720.txt where cs(cookie) like 'ASPSESSION%' and date = '2007-07-20' and time > '18:00:00' and time < '19:00:00'
Now your nice charts can be both nice and useful J.
I'm posting this as I couldn't find this solution documented anywhere (including within MSFT through the discussions around logparser).
Problem: After I installed the Office 2007, my logparser queries which use the parameter "ChartType" no longer work returning the following error:
Error: invalid parameter "chartType"
The solution? Logparser depends on the Office Web Components to create the charts. Just intall them (I just reinstalled the OWC for Office 2003) and everything will work again.
In the post about troubleshooting Asp.Net poor performance I showed you how to identify one possible cause for high response times when processing Asp.Net pages – the cause we explored on that post was a database server taking too long to respond. We were able to verify all the information about the SQL side of the scenario through analyzing the Asp.Net thread stack and using the SOS.dll specific commands. So the question now is: What about when I’m processing a Classic Active Server Page (ASP) page instead of a Asp.Net one?
Well in this case, as we already know, the SOS.dll won’t help, so let’s see how to do it manually.
First step is to identify the thread and look at it call stack as below:
<
0:016> kb10
ChildEBP RetAddr Args to Child
00f2e1dc 7c822124 77e6baa8 000006b8 00000000 ntdll!KiFastSystemCallRet
00f2e1e0 77e6baa8 000006b8 00000000 00f2e224 ntdll!NtWaitForSingleObject+0xc
00f2e250 77e6ba12 000006b8 00004e20 00000000 kernel32!WaitForSingleObjectEx+0xac
00f2e264 74cd2e3f 000006b8 00004e20 00004e20 kernel32!WaitForSingleObject+0x12
00f2e290 6d56db96 00c1d2f8 01ab4270 00000009 DBmsLPCn!ConnectionRead+0xaf
00f2e2b0 6d56871b 01ab6290 01ab4270 00000009 DBNETLIB!WrapperRead+0x2c
00f2e308 4e252d91 01ab6290 01ab4270 01ab4270 DBNETLIB!ConnectionRead+0x519
00f2e33c 4e252df0 01ab6290 01ab4270 00000009 SQLOLEDB!CDataSource::ConnectionRead+0x35
00f2e388 4e2522b6 01ab403e 00000001 00000000 SQLOLEDB!CDBConnection::GetBytes+0x269
00f2e3d4 4e259f63 015c4e08 00000088 0000001e SQLOLEDB!CDBConnection::ProcessTDSStream+0x157
00f2e490 4e25a030 015c5a00 00000014 015c4970 SQLOLEDB!CStmt::ExecDirect+0x786
00f2e4a8 4e2596d1 015c5a00 00000014 00000000 SQLOLEDB!CStmt::SQLExecDirect+0x28
00f2e4d8 4e259b1c 00000000 4e2590b4 00000014 SQLOLEDB!CCommand::ExecuteHelper+0x157
00f2e55c 4e2595ea 015c5a68 00000000 4bbe9658 SQLOLEDB!CCommand::Execute+0x76b
00f2e594 4bbe952e 015c4d60 00000000 4bbe9658 SQLOLEDB!CImpICommandText::Execute+0xdd
00f2e5d4 4bc0c9db 015c12b0 015c4a04 015c1638 msado15!CConnection::Execute+0x9d
00f2e7a4 4bbe9388 015c5660 00000003 015c48a8 msado15!_ExecuteAsync+0x19f
00f2e7b8 4bbe90e0 015c5660 00f2e988 00000000 msado15!ExecuteAsync+0x23
00f2e8a0 4bbe9303 00000003 015c1a00 00000000 msado15!CQuery::Execute+0xa5e
00f2e90c 4bbe6128 015c48a8 00000003 015c1a00 msado15!CCommand::_Execute+0x153
00f2e9a0 4bbe6400 ffffffff 015c1a00 00f2eb34 msado15!CRecordset::_Open+0x30f
00f2eb18 11004397 005c1638 00000008 00000000 msado15!CRecordset::Open+0x5dc
So, that’s a classic OLEDB thread stack. To obtain the SQL command being send to the SQL server identify the function call SQLOLEDB!CStmt::SQLExecDirect above. Grab the first parameter and run the following command:
0:016> du 015c5a00
015c5a00 "SELECT * FROM TITLES"
Of course you still want to know which SQL server this thread holds a connection with. So how about checking the connection string? Just identify the the function call msado15!CConnection::Execute, grab the first parameter and execute the following:
0:016> du poi(015c12b0+64)
015c1aa8 "Provider=SQLOLEDB.1;Integrated S"
015c1ae8 "ecurity=SSPI;Persist Security In"
015c1b28 "fo=False;Initial Catalog=pubs;Da"
015c1b68 "ta Source=(local)"
So now you can call the DBA in charge of the server “local” and ask him to collaborate with you on this low performance problem your ASP app has been unfairly blamed about J
Have fun!
I’ve just published a post on how to manually identify which classic ASP page is being processed by a specific thread. Following the same idea you can also identify COM calls being made by a thread. As always, first identify the thread doing the COM call you want investigate and then look at the thread stack as below:
<0:032>
0:032> kb100
ChildEBP RetAddr Args to Child
0235e93c 7c822114 77e6711b 00000002 0235e98c ntdll!KiFastSystemCallRet
0235e940 77e6711b 00000002 0235e98c 00000001 ntdll!NtWaitForMultipleObjects+0xc
0235e9e8 7739cd08 00000002 0235ea10 00000000 kernel32!WaitForMultipleObjectsEx+0x11a
0235ea44 77697483 00000001 0235ea8c 000003e8 USER32!RealMsgWaitForMultipleObjectsEx+0x141
0235ea6c 776974f2 0235ea8c 000003e8 0235ea9c ole32!CCliModalLoop::BlockFn+0x8
0235ea94 7778866b ffffffff 0235eb94 000e513c ole32!ModalLoop+0x5b
0235eab0 77788011 00000000 00000000 00000000 ole32!ThreadSendReceive+0xa0
0235eacc 77787ed7 0235eb94 0011825c 0235ebf0 le32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112
0235ebac 776975b8 0011825c 0235ecc0 0235ecb0 ole32!CRpcChannelBuffer::SendReceive2+0xc1
0235ebc8 7769756a 0235ecc0 0235ecb0 0011825c ole32!CCliModalLoop::SendReceive+0x1e
0235ec34 776c4eee 0011825c 0235ecc0 0235ecb0 ole32!CAptRpcChnl::SendReceive+0x6f
0235ec88 77ce127e 00000001 0235ecc0 0235ecb0 ole32!CCtxComChnl::SendReceive+0x91
0235eca4 77ce13ca 0011750c 0235ecec 00000000 RPCRT4!NdrProxySendReceive+0x43
0235f08c 77d0c947 77d12028 77d150c2 0235f0ac RPCRT4!NdrClientCall2+0x206
0235f0a4 77d0c911 0011750c 60030001 73464c18 OLEAUT32!IDispatch_RemoteInvoke_Proxy+0x1c
0235f364 73469f71 0011750c 60030001 73464c18 OLEAUT32!IDispatch_Invoke_Proxy+0xb6
0235f3b8 73468f37 01ef0100 0011750c 60030001 vbscript!CatchIDispatchInvoke+0x46
0235f3f8 73468ea6 01ef0288 0011750c 60030001 vbscript!IDispatchInvoke2+0xaf
0235f434 73469000 01ef0288 0011750c 60030001 vbscript!IDispatchInvoke+0x59
0235f548 73467bb6 01ef0288 00000000 60030001 vbscript!InvokeDispatch+0x13a
0235f56c 73467cad 01ef0288 0011750c 60030001 vbscript!InvokeByName+0x42
0235f848 73464940 00000000 00000000 01ef0288 vbscript!CScriptRuntime::Run+0x1331
0235f940 73464cd2 00000000 00000000 00000000 vbscript!CScriptEntryPoint::Call+0x5c
0235f998 73465522 01ef0768 00000000 00000000 vbscript!CSession::Execute+0xb4
0235f9e8 7346189b 00000000 00000000 709e19b4 vbscript!COleScript::ExecutePendingScripts+0x13e
0235fa04 709e2f5a 01612040 01612040 01112700 vbscript!COleScript::SetScriptState+0x150
0235fa30 709e2f1a 00000000 709e19b4 0235fb38 asp!CActiveScriptEngine::TryCall+0x19
0235fa6c 709e2e50 00000000 647246fe 014123e8 asp!CActiveScriptEngine::Call+0x31
0235fa88 709e2d54 0235fb0c 00000000 00000000 asp!CallScriptFunctionOfEngine+0x5b
0235fadc 709e2c7f 01911e90 00000000 0235fb68 asp!ExecuteRequest+0x17e
0235fb44 709e2a4d 01911e90 014123e8 0235fb68 asp!Execute+0x249
0235fb98 709e271a 00000000 00000000 00114c28 asp!CHitObj::ViperAsyncCallback+0x3e8
0235fbb4 75bd748e 01493048 00097be8 0235fd74 asp!CViperAsyncRequest::OnCall+0x92
0235fbd0 7770f153 00114c28 000a4178 00000000 COMSVCS!CSTAActivityWork::STAActivityWorkHelper+0x32
0235fc1c 7770fba0 00000000 000a4178 75bd745c ole32!EnterForCallback+0xc4
0235fd7c 777100aa 0235fc54 75bd745c 00114c28 ole32!SwitchForCallback+0x1a3
0235fda8 7769408c 000a4178 75bd745c 00114c28 ole32!PerformCallback+0x54
0235fe40 77712865 00097be8 75bd745c 00114c28 ole32!CObjectContext::InternalContextCallback+0x159
0235fe60 75bd7831 00097be8 75bd745c 00114c28 ole32!CObjectContext::DoCallback+0x1c
0235fecc 75bd7b95 00114e18 00114df8 001147c4 COMSVCS!CSTAActivityWork::DoWork+0x12d
0235fee4 75bd852e 00114c28 00000001 00114df8 COMSVCS!CSTAThread::DoWork+0x18
0235ff04 75bd897a 00000000 01eefcb0 00039608 COMSVCS!CSTAThread::ProcessQueueWork+0x37
0235ff84 77bcb530 00114df8 00000000 00000000 COMSVCS!CSTAThread::WorkerLoop+0x17c
0235ffb8 77e66063 00039608 00000000 00000000 msvcrt!_endthreadex+0xa3
0235ffec 00000000 77bcb4bc 00039608 00000000 kernel32!BaseThreadStart+0x34
Now identify the function call ole32!CRpcChannelBuffer::SendReceive2 above. Grab the first parameter and run the following command:
0:032> dd poi(0011825c +18)+8 l2
000de8d0 00000d8c 00000ddc
The output above represents the process ID (in this case the hex number 0xd8c) of the server process for this COM call and the thread ID (in this case the hex number 0xddc).
Notice that the output values are in hex format and your PID (the one you identify by using task manager, tlist, tasklist etc) info will probably be in decimal format so you still need to make the convertion:
In this case the PID of the server process being called in this thread would be 3462 = 00000d86.
I’ve been teaching one of our workshops offerings (Workshop Plus: IIS6 Critical Problem Management) and when we go through some debugging samples, one thing I like to demo is how to find which Asp.Net pages are being processed in which threads. Well, the SOS.dll extension supports the “!aspxpages” command that will make your life really easier in a sense that all you need to do is run the command and it will return a list of the threads and their respective IDs as well as which Asp.Net page is being processed by such threads (aspx, asmx).
And then the students try to use the same command to identify classic Active Server Pages (.asp). This will not work since the commands implemented by the SOS.dll are specifically for managed code (.Net) and ASP pages will be processed by threads running native code. Well, the easiest way to accomplish that would be just feedback the memory dump to the Debug Diagnostics Tool (DebugDiag) and run the hang/crash analysis scripts against it. But how about when you don’t have Debug Diagnostics Tool (DebugDiag) or any other extension available for some reason to do that for you? Can you still do that manually through the Windbg from Debugging Tools For Windows? The answer is YES! And you don’t even need symbols for that ;-).
Here is what you can do to find out what ASP page is being processed in a specific thread:
First verify the thread call stack:
0:032> kb
ChildEBP RetAddr
0235fa88 709e2d54 0235fb0c 00000000 00000000 asp!CallScriptFunctionOfEngine+0x5b
0235fadc 709e2c7f 01911e90 00000000 0235fb68 asp!ExecuteRequest+0x17e
0235fb44 709e2a4d 01911e90 014123e8 0235fb68 asp!Execute+0x249
0235fb98 709e271a 00000000 00000000 00114c28 asp!CHitObj::ViperAsyncCallback+0x3e8
0235fbb4 75bd748e 01493048 00097be8 0235fd74 asp!CViperAsyncRequest::OnCall+0x92
0235fbd0 7770f153 00114c28 000a4178 00000000 COMSVCS!CSTAActivityWork::STAActivityWorkHelper+0x32
0235fc1c 7770fba0 00000000 000a4178 75bd745c ole32!EnterForCallback+0xc4
0235fd7c 777100aa 0235fc54 75bd745c 00114c28 ole32!SwitchForCallback+0x1a3
0235fda8 7769408c 000a4178 75bd745c 00114c28 ole32!PerformCallback+0x54
0235fe40 77712865 00097be8 75bd745c 00114c28 ole32!CObjectContext::InternalContextCallback+0x159
0235fe60 75bd7831 00097be8 75bd745c 00114c28 ole32!CObjectContext::DoCallback+0x1c
0235fecc 75bd7b95 00114e18 00114df8 001147c4 COMSVCS!CSTAActivityWork::DoWork+0x12d
0235fee4 75bd852e 00114c28 00000001 00114df8 COMSVCS!CSTAThread::DoWork+0x18
0235ff04 75bd897a 00000000 01eefcb0 00039608 COMSVCS!CSTAThread::ProcessQueueWork+0x37
0235ff84 77bcb530 00114df8 00000000 00000000 COMSVCS!CSTAThread::WorkerLoop+0x17c
0235ffb8 77e66063 00039608 00000000 00000000 msvcrt!_endthreadex+0xa3
0235ffec 00000000 77bcb4bc 00039608 00000000 kernel32!BaseThreadStart+0x34
Identify the function asp!Execute or asp!ExecuteRequest as above. If you have no symbols available and can’t see the function names in the stack, try to grab the third ASP function call loaded in the stack (that will be probably the one we need).
Now, grab the first parameter passed to this function (notice that for the functions Execute and ExecuteRequest it’s pretty much the same and that’s why it doesn’t really matter which function we will work with). Execute the following command on that:
0:032> du poi(poi(poi(01911e90 +9c)))
01ba32e8 "/WebSite001/mypage.asp"
As you can notice, it returns the relative path in the URL of the web site with the ASP page info there!
Now, if you just dump it with an offset of 4 bytes as below:
0:032> du poi(poi(poi(01911e90 +9c))+4)
01ba42e8 "C:\INETPUB\WWWROOT\MYPAGE.ASP"
It will return then the absolute path where the page is located in the file system.
So now you can identify which classic ASP page is being processed in a specific thread J Have fun!
Here is little tip for you that want to practice kernel mode debugging but either don’t have 2 machine machines to play the TARGET and HOST roles or simply don’t want to play with (for any reason) with the old null modem serial cable used to connect the HOST to the TARGET: It’s possible to use a Virtual PC machine to play the TARGET role, so all you would need is the Virtual PC image correctly configured to boot up in DEBUG mode (through the regular boot.ini options - /DEBUG /DEBUGPORT=COM# /BAUDRATE=115200) and map its serial COM port to a named pipe.
How to do so:
Supposing you already have a Virtual PC image set up (it can be an image of Windows 2000, Windows XP or Windows 2003), you just need to follow the step by step list below:
1. Login as admin in the VPC image and edit the c:\boot.ini file to include a new boot up OS option as following:
A default boot.ini file will likely have the following content:
|
[boot loader]
timeout=3
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect |
Edit the boot.ini and add the highlighted line below:
|
[boot loader]
timeout=3
default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional" /fastdetect
multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Microsoft Windows XP Professional – DEBUG" /fastdetect / DEBUG /DEBUGPORT=COM1 /BAUDRATE=115200 |
2. Before rebooting the VPC image, go the VPC console, select the option “Settings” under the menu “Edit”. Find the configuration item for the COM1 port, select the radio button “Named Pipe” and type the string \\.\pipe\COM1 as the picture below illustrates.

This VPC image (the virtual machine) will be the TARGET system, or the machine whose the kernel will be debugged. Your real machine (the one hosting the VPC image) will be the HOST machine or the one which will run the debugger that will be attached to the TARGET’s system kernel.
3. Start a Windbg (from the Debugging Tools For Windows) instance with the following command line options:
|
windbg -k com:pipe,port=\\.\pipe\COM1,resets=0,reconnect |
The Windbg will open up it’s command window and will wait to connect through the named pipe mapped port as soon as VPC OS starts up. The picture below illustrates how the Windbg will start up with these command line parameters:
4. Now reboot the VPC image and select the new option you’ve included in the boot.ini in the step 1 of this procedure. During the boot process, as soon as the kernel gets loaded the Windbg will attach to it. All you need to do now if break the debugger at the point you want to start debugging and have fun J.
This is another 100% CPU issue. This time I’m not working on any customer case as the issue is happening on my own Windows XP SP2 machine. So I resolved trying to troubleshoot that just for fun J.
First, let me explain the scenario:
Random interactive applications, like Outlook, Word, Excel, Explorer, Internet Explorer etc, were unexpectedly consuming 100% CPU. Once they started eating up all the CPU they get completely hung and do not stop using all the CPU until I actually kill their processes. Some times after killing their process and restarting the application the issue was gone, some times the issue comes back right after the application restarts and I need to kill it again…
Ok, I made my part playing as an end user by just killing the processes for some time but it started buzzing me as it was becoming more frequent. So now it’s time have some fun and go to the core of the issue.
All I need to do is follow the steps in the post Lab: Win32 Application Causing 100% Condition. So let’s go start doing that:
Once the problem happened next time, it was in the infopath.exe application, I attached the Windbg (from the Debugging Tools For Windows) and executed the command below in order to find out which thread I should look at:

Ok, so now I know the thread 0 is the one I’m looking for since it’s the one who’s been consumed the CPU for the last 40 seconds. You might be asking “what if there were other threads also consuming the CPU?” Well, for this specific issue I knew we would find only one single thread since my machine is a dual CPU and whenever the problem happens the app takes only one of the CPUs.
With that said, let’s take a look at the call stack of the thread 0:
Ok, so we don’t have that much… the return address is 0x0 and ebp register is 0x0 so since we don’t have the proper symbols to work with this I guess we won’t be able to rebuild the stack. So we will have even more fun diving in the ASM from now on!!!
We already know the application nview.dll is the one loaded in the call stack. Let’s see what is it doing or trying to do. Our current instruction pointer tells us this:
So all we need to do to follow the assembler execution path is to get the current CPU registers value and start reproducing the effects the real instructions would have. The current value of the CPU registers is below:
Notice that both EAX and EDX (the two registers used by the first operation) are 0x0. So the first instruction mov eax, edx won’t change anything. It will make a copy of the EDX register to the EAX but since they are both 0x0 this will result in no changes at our registers. Let’s take a look at the subsequent instructions and their effects:
and eax, 01000000h
It will make an AND logical operation between EAX and the value 0x1000000 and load the result at the EAX register. Again, this won’t change the EAX value since 0x0 AND 0x1000000 results in 0x0 which is already the current EAX value.
xor ebp,ebp
It will make an Exclusive-OR operation, or XOR, between the stack base pointer EBP and itself and load the result at the EBP (which is the destination operand on this operation). Needless to say this would load EBP with 0x0, however as we can notice below from the current register values, EBP is already 0x0:
or eax,ebp
It will make an OR logical operation between EAX and EBP and load the result at the EAX. We already know this will be the equivalent of making 0x0 OR 0x0 which will result in guess what… Exactly! – It will result in 0x0 which doesn’t change the original value of EAX.
je nview!NVLoadDatabase+0xc8a (100020ea)
This will cause the instruction pointer EIP to “jump” to 100020ea just in case the operands of the previous OR operation were equal in value. Well, EAX is indeed equal, in value, to EBP since they’re both 0x0. So the EIP is loaded with the value 100020ea and our assembly execution path is properly changed as below:
So, let’s continue from the instruction located at 100020ea.
shl edx,8
It will make a shift left on the EDX value of 8 positions. EDX is a 32-bit representation of the value 0x0, so shifting it 8 positions wouldn’t change anything. EDX remains 0x0 after the operation as we can see below:

mov eax,edx
This one we already know. It won’t change anything as EAX = EDX = 0x0.
sar eax,8
It will a shift right on the EAX of 8 positions. As we’ve seen before with the EDX and the shift left operation, this won’t change anything as EAX = 0x0.
add eax,esi
Finally EAX changed its original value of 0x0 to the result of the sum between itself and ESI register. ESI = 0003fff8 and so EAX will from now on be EAX = 0003fff8 as can see below:

sar edx,1Fh
Again, a shift righ operation on the EDX with 31 positions. EDX will remain with 0x0 after that as we’ve seen before.
cdq
This will convert the value of EAX to a 64-bit value. But it will keep it original value of 0003fff8.
mov eax, esi
It will copy the value of ESI to EAX, however as we noticed before EAX was made equal to ESI by the operation add eax, esi previously executed (since EAX was 0x0 by that time). So this operation doesn’t change anything either.
lea eax,[ecx+esi]
This will load EAX with the value resultant from the operation ECX + ESI. So let’s make a little math here… ECX + ESI = 00c90000 + 0003fff8 = 00ccfff8. So EAX will be from now on = 00ccfff8 as we see below:
cmp eax, edi
This will compare the values of EAX and EDI and change the proper flags accordingly.
mov dword ptr [esp+14h],edx
This will copy the EDX value to the memory location referred by the stack pointer plus 20 bytes, or in other words, this will copy EDX value to 20 bytes down on the stack. So, doing a little math again we will have that such a position would be the ESP = 0012c6c0 plus 20 bytes, so 0012c6c0 + 0x14 = 0012c6d4. Let’s see what is current content of 0012c6d4:
All right, so it’s currently 0x0 and we are copying the value of EDX which is also 0x0 to that location in the memory. So, we won’t actually change anything here. Let’s move on.
jb nview!NVLoadDatabase+0xc60 (100020c0)
This will make our execution path jumping to 100020c0 just in case the result of the operation cmp previously executed, tells us the value of EAX is lower (or below) EDI. By the time the cpm operation was executed, EAX was 00xxfff8 and EDI was 00cd0000 so EAX is indeed below EDI and will therefore jump in the execution path as below:
So we moved back some bytes on our execution path. Let’s keep moving…
mov edx, dword prt [eax]
It will copy the value referred by EAX to the register EDX. The value currently referred by EAX is the following:
So we’re just copying 0x0 to EDX which is already 0x0, so it doesn’t change our register’s value.
mov eax, dword ptr [eax+4]
This will copy the value referred by the EAX plus 4 bytes to EAX. The valued pointed by EAX + 4 is the value pointed by 00ccfffc which is 0x0. So now we’re basically making EAX = 0x0 again – As we had in the very beginning of our execution path.
mov dword ptr [esp+14h],eax
This will just copy the EAX value to a specific position on the stack (20 bytes under the top pointed by the ESP). So it doesn’t change any register and it actually doesn’t change the stack value either since EAX is currently 0x0 once again and we had already saved the value 0x0 at that same position on the stack when executed the instruction located at 100020ff.
Now hold on: The next instruction is exactly the one we executed in the very first step of all this investigation. And guess what? The registers and the stack have exactly the same values as we had in the very beginning. So we don’t need to move forward on this to conclude that we will end up again in the same place, and again, and again, and again… indefinitely! This turns out to be an infinite loop!!
Conclusion:
This explains why this thread is using 100% of our CPU and concludes our investigation. Unfortunately this is something we can’t fix since it’s happening within the nview.dll’s code and this is not a MS product.
NVIEW.DLL is a NVidia module apparently used by the NVidia Desktop Manager software. I’ve done some research through the NVidia web site and I haven’t found any reference to this so I believe there is no definitive solution for this yet. As a workaround I simply disabled the Desktop Manager feature and everything seems to be fine now.
Recently we’ve been receiving (by coincidence) some escalations to troubleshoot the same issue. It’s about .Net apps returning the following exception:
|
SocketException::.ctor() 10061:No connection could be made because the target machine actively refused it
|
The cases I’ve been working with show different stack traces but the main point here is that exception is originally occurring when the application is trying to make use of a TCP socket to establish a connection with a remote machine. For all the cases escalated to us so far, the problem has ended up being simply what the error message states, i.e the remote machine, or something in between, refuses the connection attempt. This post is just to show how we’ve been able to troubleshoot these problems and isolate them to network related issues.
Let’s go through two different scenarios and see step by step how to troubleshoot them. In the first scenario we have an IIS server (which we will call from now on WEB1 – IP:10.0.0.1) hosting a .Net application which is intermittently failing to make a web services call to another IIS server (let’s call it WEB2) located in a remote environment. The problem has already been isolated to the point when the .Net app in the WEB1 makes a call to a web service hosted in the server WEB2.
The two last frames on the stack for this first scenario would be the following ones:
|
System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)
System.Web.Services.Protocols.WebClientProtocol.GetWebResponse(WebRequest request)
|
The first step, as always, is understand the infrastructure involved in the communication between WEB1 and WEB2. Making it short, WEB2 is the virtual name used to mask a round robin DNS rotation between 10 other web servers and between the WEB1 and WEB2 there are at least 2 firewalls, one at the WEB1 location and another one at the WEB2 location (classic environment where the WEB servers are located in the DMZ provided by both firewalls on their respective locations).
Next step is to generate a network trace from the server WEB1 (it would be also nice if we could generate a trace simultaneously from the WEB2, however first in this scenario we don’t have access to the WEB2 and even if so we would need to take a trace from all the 10 web servers because we don’t know at this point to which the WEB1 is trying to establish a connection when the problem happens. So we went ahead, installed the Microsoft Network Monitor tool in the WEB1 started collecting the network trace and waited until the problem happens. In the mean time while the problem did not happen, we just got informed about the other 10 IP addresses of the 10 web servers hereby called simply by WEB2.
Ok, the problem happened, so now it’s time to dive in the network trace:
All the know-how we need now it’s a basic knowledge about how TCP connections happen. At the moment of the problem the trace shows the following suspect frames:
|
2 {TCP:2, IPv4:1} 10.0.0.1 192.168.0.1 TCP TCP: Flags=.S......, SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621893, Ack=0, Win=16384 (scale factor 0) = 16384
3 {TCP:2, IPv4:1} 192.168.0.1 10.0.0.1 TCP TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790416, Ack=3411621894, Win=0 (scale factor 0) = 0
4 {TCP:2, IPv4:1} 10.0.0.1 192.168.0.1 TCP TCP: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=0, Seq=3411621894, Ack=1096790417, Win=16616 (scale factor 0) = 16616
5 {HTTP:3, TCP:2, IPv4:1} 10.0.0.1 192.168.0.1 HTTP HTTP: HTTP Payload
Tcp: Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616
6 {HTTP:3, TCP:2, IPv4:1} 10.0.0.1 192.168.0.1 HTTP HTTP: HTTP Payload
Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616
7 {HTTP:3, TCP:2, IPv4:1} 10.0.0.1 192.168.0.1 HTTP HTTP: HTTP Payload
Tcp: [ReTransmit #5]Flags=....A..., SrcPort=4973, DstPort=HTTP(80), Len=1, Seq=3411621894 - 3411621895, Ack=1096790417, Win=16616 (scale factor 0) = 16616
8 {TCP:2, IPv4:1} 4.79.185.20 10.0.0.1 TCP TCP: Flags=..R....., SrcPort=HTTP(80), DstPort=4973, Len=0, Seq=1096790417, Ack=843163868, Win=0 (scale factor 0) = 0
|
The frame #2 represents the server WEB1 starting a TCP 3 way hand shake with the server whose the IP address is 4.79.185.20 at the port TCP 80 (remember that this is a web service call) which at this point is the one representing our server WEB2 (there are 10 possible IP addresses depending on which one the DNS round robin will provide through the name resolution). The 3 way hand shake has been successfully accomplished through the frames #2,#3 and #4 (the sequence SYN, ACK SYN, ACK has occurred). After that something interesting happens: The WEB1 starts sending packets (starting on frame #5) of 1 single byte (notice that difference between the sequence number results in 1 -> 3411621895 - 3411621894 = 1) instead of making the complete POST to the web service.
The WEB1 receives no acknowledge from WEB2 to the packet sent at the frame #5, so it sends it again (that’s why the frame #6 and #7 are being considered retransmissions) in the frame #6, again the WEB2 doesn’t ack that, so WEB1 keeps resending that 1 byte packet in the frame #7 and then finally the WEB2 responds to that but with a RESET instead. The packet sent by the WEB2 at the frame #8 in unexpected and will cause the application to handle that as the exception mentioned in the beginning of this article.
The question is: Why the server WEB1 keeps sending a 1 byte length packet after the hand shake?
Before answering this question, you might make a mistake and think that the WEB2 is legitimately resetting the connection with the WEB1 since WEB1 has not behaved as expected (which would be send a POST to the web service and not the 1 byte packets) and this would take you to the assumption that there is something wrong with the server WEB1 and lead you to completely wrong troubleshooting line… Didn’t you see what’s wrong yet? Ok, it’s really a little tricky but the problem here happens at the hand shake (event though is was successful one)… Do you see now? No? Ok, look again to the 3 frames that compose the hand shake, more precisely to the frame #3. Notice that the WINDOW size is 0 (Win=0) which basically means the server WEB2 can not receive anything larger than 0 bytes… Interesting, isn’t it? So after receiving the 0 bytes length window size advertised by the WEB2, the WEB1 sends a 1 byte packet in attempt to make the WEB2 to advertise a new valid WINDOW size, which doesn’t happen after the first attempt, so the WEB1 keeps trying until the WEB2 resets the connection.
So the problem has been isolated to be at the WEB2 end. Another question remains: Why is the problem intermittent? After all, if this behavior of advertising an invalid window size is always happening, the problem should also always happen. It’s time to take a new network trace…
On a new network trace, we have both situations: When it works and when it doesn’t work. We already know what happens when it doesn’t work, so now let’s see what happens when it works:
|
2 {TCP:2, IPv4:1} 10.0.0.1 192.168.0.2 TCP TCP: Flags=.S......, SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591585, Ack=0, Win=16384 (scale factor 0) = 16384
3 {TCP:2, IPv4:1} 8.7.82.20 10.0.0.1 TCP TCP: Flags=.S..A..., SrcPort=HTTP(80), DstPort=4961, Len=0, Seq=1446804074, Ack=843591586, Win=8760 (scale factor 0) = 8760
4 {TCP:2, IPv4:1} 10.0.0.1 192.168.0.2 TCP TCP: Flags=....A..., SrcPort=4961, DstPort=HTTP(80), Len=0, Seq=843591586, Ack=1446804075, Win=16560 (scale factor 0) = 16560
5 {HTTP:3, TCP:2, IPv4:1} 10.0.0.1 192.168.0.2 HTTP HTTP: Request, POST /WebSrv/QueryWebSrv.asmx
6 {HTTP:3, TCP:2, IPv4:1} 192.168.0.2 10.0.0.1 HTTP HTTP: Response, HTTP/1.1, Status Code = 100
|
Ok, now we don’t see the same window size problem and 3 way hand shake is ok and everything is fine. However our WEB1 server is not connecting to the same server WEB2 as now the server WEB2 is being represented by a different IP address – 192.168.0.2 as oppose to 192.168.0.1 that we’ve see before – due to the round robin balance system in place. So, it answers our second question: The problem actually happens always when the WEB2 is the 192.168.0.1 server, however since there is a round robin balance in place, the server WEB2 will not always be the one whose the IP address is 192.168.0.1.
The solution of problem? Well, after isolating the problem we contacted the people in charge of the WEB2 end and they generated new traces on their end and found out there was a firewall in between the servers which was misconfigured regarding the IP 192.168.0.1. They reconfigure the firewall and this fixed the problem for our first scenario.
The second scenario is a IIS server (from now one WEB1 IP: 10.0.0.1) reporting the same error as above, however now the stack, even though the exception happens at the same place, is a little different as below:
|
System.Net.Sockets.TcpClient.Connect(String hostname, Int32 port)
Microsoft.AnalysisServices.AdmdCleintXmlaClient.GetTcpClient(ConnectionInfo connectionInfo) |
The difference here is just that in this scenario our .Net application being hosted by IIS is calling a SQL Server (IP: 172.16.0.1) running the Analysis Services as opposed to a web service. The architecture of this scenario is a little bit easier to work with since the problem has already been isolate to one specific SQL Server when it’s accessed from any web server.
By following the logic as before, we will first generate a network trace from the WEB1 server (again, we don’t have control over the other end – the SQL server, as it would be useful to also generate a network trace from that server). By looking at the trace we see the following:
|
7447 {TCP:534, IPv4:533} 10.0.0.1 172.16.0.1 TCP TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384
7448 {TCP:534, IPv4:533} 172.16.0.1 10.0.0.1 TCP TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0
7467 {TCP:535, IPv4:533} 10.0.0.1 172.16.0.1 TCP TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384
7468 {TCP:535, IPv4:533} 172.16.0.1 10.0.0.1 TCP TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0
7554 {TCP:538, IPv4:533} 10.0.0.1 172.16.0.1 TCP TCP: Flags=.S......, SrcPort=20531, DstPort=2383, Len=0, Seq=655839663, Ack=0, Win=16384 (scale factor 0) = 16384
7556 {TCP:538, IPv4:533} 172.16.0.1 10.0.0.1 TCP TCP: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0 |
What we see here is the WEB1 making successive attempts to connect to the port TCP:2383 (the default port for non named instances of Analysis Services running in SQL Server 2005) and the SQL Server is resetting all of them. We realized that there are two routers between the server WEB1 and the SQL Server. We can’t identify if the resets are coming from the SQL or any of the routers since the MAC address in the reset packets will always be the one of the last router in the network path between the servers. However by doing closer look at the reset packets, we notice that the TTL has been decremented by 2 (ttl=126) as below:
|
+ Ethernet: Etype = Internet IP (IPv4)
- Ipv4: Next Protocol = TCP, Packet ID = 14341, Total IP Length = 40
+ Versions: IPv4, Internet Protocol; Header Length = 20
+ DifferentiatedServicesField: DSCP: 0, ECN: 0
TotalLength: 40 (0x28)
Identification: 14341 (0x3805)
+ FragmentFlags: 0 (0x0)
TimeToLive: 126 (0x7E)
NextProtocol: TCP, 6(0x6)
Checksum: 9687 (0x25D7)
SourceAddress: 10.205.100.146
DestinationAddress: 10.205.100.200
- Tcp: Flags=..R.A..., SrcPort=2383, DstPort=20531, Len=0, Seq=0, Ack=655839664, Win=0 (scale factor 0) = 0
SrcPort: 2383
DstPort: 20531
SequenceNumber: 0 (0x0)
AcknowledgementNumber: 655839664 (0x271751B0)
+ DataOffset: 80 (0x50)
+ Flags: ..R.A...
Window: 0 (scale factor 0) = 0
Checksum: 65170 (0xFE92)
UrgentPointer: 0 (0x0)
|
The TTL being decremented by two basically means this packet has been routed twice and since we know there are two routers between the servers, we can conclude the resets are really being originated at the SQL server and not by any of the routers. Why is the SQL Server resetting these connection attempts? The answer for this question is also the solution for the problem and it is that the port TCP 2383 is the one used by SQL server 2005 but our SQL Server is a SQL Server 2000 which has the default instance for the Analysis Services running in the port TCP 2725 instead.
We confirmed that by running a “netstat –b” at the SQL Server. Here is the abbreviated version of the output:
|
TCP svp020064:2725 SVP060005:42802 ESTABLISHED 2344
TCP svp020064:2725 SVP060002:58478 ESTABLISHED 2344
TCP svp020064:2725 SVP060001:2064 ESTABLISHED 2344 |
We now use the command “tasklist” to identify which process has the PID 2344. The abbreviated version of the output is below:
|
Image Name PID Session Name Session# Mem Usage
============ ======= ============= =========== ==========
msmdsrv.exe 2344 Console 0 734,136 K
|
The solution is to change the client application so it connects to the right port which is the TCP 2725 instead of TCP 2383.
This week I'm working on this case related to IIS6 server hosting an Asp.Net application (it's a C# app) whose the performance is poor even when a very low number of users is hitting it. The architecture is the classical web layer accessing the backend DB servers and my goal is to isolate the problem, i.e pinpoint where the problem resides and try to find out why.
The first step after understanding the environment and being aware of all relevant servers and infrastructure devices, is to actually see the problem happening to have a better understanding of what is going on. We had one single user accessing the application and there are indeed some actions the user takes on the pages that take a long time to return, so the next step is to establish a baseline and agree on a SLA for the performance the application is supposed to have.
Another very important relevant point to this scenario is that since the environment counts on a NLB composed by multiple IIS servers, we need first to isolate which IIS server the client will connect to when we start trying to repro the problem. So after checking their NLB configuration and making sure they have the affinity set to single we can just open the IIS logs of the servers and search the client IP at the c-ip field. Since the affinity is single the server which has been serving the client will continue to be the same.
Once we know which server we will work with during the tests, my favorite tool to create the baseline is the System Monitor (perfrmon.exe) and we can also make some basic analysis of the IIS logs. So I set the perfmon with some specific counters that will help me to establish a baseline on how their app is currently behaving and also made sure they have the IIS log fields I need to work with when establishing the baseline.
The counters I used in the perfmon were the following:
|
ASP.NET v1.1.4322\Applications Running ASP.NET v1.1.4322\Request Execution Time ASP.NET v1.1.4322\Request Wait Time ASP.NET v1.1.4322\Requests Current ASP.NET v1.1.4322\Requests Queued ASP.NET v1.1.4322\State Server Sessions Active ASP.NET v1.1.4322\Worker Process Restarts ASP.NET v1.1.4322\Worker Processes Running
|
The fields I needed to make sure are being included in the IIS logs are the following:
|
c-bytes
s-bytes
cs-uri-stem
cs-uri-query
time-taken
|
After that we reproduced the problem, which by the way was really happening with only one user accessing it which already eliminates any hypothesis of the problem being related to the amount of load hitting the server, we started working on the baseline.
The counters in perfmon pretty much confirm what we see in the IIS logs: Their main application page takes an average time longer than 40 seconds to execute. The perfmon counter “ASP.NET v.1.1.4322\Request Execution Time” will report that as well as the IIS logs if you use the LogParser tool and run the following query against them:
First we need to consolidate all the IIS logs in a single CSV file:
|
LogParser "select date,time,cs-uri-stem,cs-uri-query,sc-status,sc-win32-status,time-taken,sc-bytes,cs-bytes from ex061024.log, ex061025.log to merged.csv" -o:CSV
|
Now we run the following query to get the average time-taken for the URLs in the logs and send the output to a new CSV file named AvgTimeTaken.txt
|
logparser "select cs-uri-stem,count(cs-uri-stem), avg(time-taken) from merged.csv to AvgTimeTaken.txt where time-taken>=10000 GROUP BY cs-uri-stem ORDER BY avg(time-taken) DESC" -o:CSV -i:CSV |
The result is as following:
|
cs-uri-stem |
COUNT(ALL cs-uri-stem) |
AVG(ALL time-taken) |
|
/App/MainPage.aspx |
15 |
48353 |
|
|
/App/MainPage.aspx |
3 |
37738 |
|
Ok, now that we see the problem happening and have our baseline it’s time to start working on isolating problem.
We don’t know what exactly the user does in order to have the mainpage.aspx taking so long to execute, so what we do is try to reproduce the problem again and keep following the perfmon counter “ASP.NET v1.1.4322\Requests Current”. This method of tracking the “Requests Current” counter will not work for all scenarios, since when you have the server under some load you will likely not be able to tell when a specific request is taking long or not because there will be many requests at the same time and they will keep changing the value of “Requests Current”. For this cases we would need to have the collaboration of the user (which would tell us when the page is taking long to return) or look at other counters like the “Request Queued”. However for this specific scenario where we have the variables under control and there is a single user testing the system for us, it will work just fine.
When the “Requests Current” shows 1 request and don’t get back to zero after, let’s say, 10 seconds, that’s the point where we know this request is taking too long to execute. Remember that our baseline is around 40 seconds… Now it’s right time to generate a hung dump of the worker process. To do that we first run the iisapp.vbs script in order to identify the PID of the worker process hosting our application pool and after that we can use any tool to simply generate multiple hang dumps (like Debugging Tools For Windows or Debug Diagnostics Tool (DebugDiag)) some seconds apart from each other. In this case I used the adplus.vbs from Debugging Tools For Windows with the following sintaxe:
|
Adplus.vbs –hang –p 2364 –quiet –o c:\dumps
|
This will generate a memory dump of the w3wp.exe whose the PID is 2364 and store it at the c:\dumps folder.
The dumps analysis will require just some basic .Net debugging steps (I used the Windbg from Debugging Tools For Windows to analyze them) as below:
This is a managed application (it’s aspx) so we want to make sure we have, besides the proper symbols, the sos.dll extension properly loaded:
|
0:000> .chain
Extension DLL chain:
aspvbscriptext.dll: API 1.0.0, built Fri Feb 06 18:20:44 2004
[path: C:\debuggers\aspvbscriptext.dll]
mdacexts.dll: image 2004.05.11.1, API 1.0.0, built Tue May 11 17:46:12 2004
[path: C:\debuggers\mdacexts.dll]
webdbg: API 3.5.5, built Thu Nov 15 18:26:28 2001
[path: C:\debuggers\webdbg.dll]
sieext: API 1.0.0, built Fri Nov 05 16:07:40 2004
[path: C:\debuggers\winext\sieext.dll]
dbghelp: image 6.6.0005.0, API 6.0.6, built Fri Apr 14 17:03:13 2006
[path: C:\debuggers\dbghelp.dll]
ext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:35 2006
[path: C:\debuggers\winext\ext.dll]
exts: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:30 2006
[path: C:\debuggers\WINXP\exts.dll]
uext: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:37 2006
[path: C:\debuggers\winext\uext.dll]
ntsdexts: image 6.0.5357.0, API 1.0.0, built Fri Apr 14 17:21:31 2006
[path: C:\debuggers\WINXP\ntsdexts.dll]
clr10\sos: image 6.6.0005.0, API 1.0.0, built Fri Apr 14 17:02:47 2006
[path: C:\debuggers\clr10\sos.dll]
|
Ok, so once we know we have the right extension loaded, let’s start finding which threads are being used to process the Asp.Net mainpage.aspx. We expect to see only one thread here since our user is simply accessing this single page. By looking at the first dump we see the following:
|
0:000> !aspxpages
Loaded Son of Strike data table version 5 from "C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\mscorsvr.dll"
Going to print all threads with ASP.NET Pages running.
Thread 11: /App/Mainpage.aspx
|
Ok, as we suspected there is only one thread processing the Asp.Net request and it’s the thread 11. Now let’s look on what this thread is currently doing:
|
0:011> !clrstack Thread 11 ESP EIP 0x0604f4dc 0x7ffe0304 [FRAME: NDirectMethodFrameStandalone] [DEFAULT] UI2 System.Data.Common.UnsafeNativeMethods/Dbnetlib.ConnectionRead(ValueClass System.Runtime.InteropServices.HandleRef,SZArray UI1,UI2,UI2,UI2,ByRef I) 0x0604f4f8 0x069cf5e4 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadNetlib(I4) ß it seems we have some SQL stuff on the stack… 0x0604f558 0x069cf531 [DEFAULT] [hasThis] Void System.Data.SqlClient.TdsParser.ReadBuffer() 0x0604f560 0x069cf463 [DEFAULT] [hasThis] UI1 System.Data.SqlClient.TdsParser.ReadByte() 0x0604f568 0x069ced7a [DEFAULT] [hasThis] Boolean System.Data.SqlClient.TdsParser.Run(ValueClass System.Data.SqlClient.RunBehavior,Class System.Data.SqlClient.SqlCommand,Class System.Data.SqlClient.SqlDataReader) 0x0604f5b0 0x06aa3c79 [DEFAULT] [hasThis] Void System.Data.SqlClient.SqlDataReader.ConsumeMetaData() 0x0604f5c4 0x06aa3c0e [DEFAULT] [hasThis] SZArray Class System.Data.SqlClient._SqlMetaData System.Data.SqlClient.SqlDataReader.get_MetaData() 0x0604f5cc 0x06aa1be7 [DEFAULT] [hasThis] Class System.Data.SqlClient.SqlDataReader System.Data.SqlClient.SqlCommand.ExecuteReader(ValueClass System.Data.CommandBehavior,ValueClass
<cut for brevity>
|
So, the managed stack for the thread 11 seems to be waiting on SQL connection. The native stack would be at a ntdll!NtWaitingForSingleObject() function which basically means this thread is doing nothing but waiting. So let’s examine a little further to see if we can identify which results are we waiting from the SQL Server:
|
0:011> !dso Thread 11 ESP/REG Object Name eax 0x10073b34 System.Collections.Hashtable ecx 0x10073b34 System.Collections.Hashtable 0x604f4c8 0x1801dad4 System.Byte[] 0x604f4d0 0x28061840 System.Data.SqlClient.TdsParser 0x604f4d4 0x1801dad4 System.Byte[] 0x604f4f8 0x28061840 System.Data.SqlClient.TdsParser 0x604f500 0x28090704 System.Data.SqlClient.SqlDataReader 0x604f504 0x28061840 System.Data.SqlClient.TdsParser 0x604f51c 0x28061840 System.Data.SqlClient.TdsParser 0x604f54c 0x2808fd28 System.Data.SqlClient.SqlCommand ß this is what we’re looking for
<cut for brevity>
|
Let’s taker a closer look at this specific object:
|
0:011> !do 0x2808fd28 Name: System.Data.SqlClient.SqlCommand MethodTable 0x06854dd8 EEClass 0x0657fb60 Size 64(0x40) bytes GC Generation: 0 mdToken: 0x02000194 (c:\windows\assembly\gac\system.data\1.0.5000.0__b77a5c561934e089\system.data.dll) FieldDesc*: 0x068545c4 MT Field Offset Type Attr Value Name 0x79b80d44 0x40000b6 0x4 CLASS instance 0x00000000 __identity 0x0649824c 0x4000502 0x8 CLASS instance 0x00000000 site 0x0649824c 0x4000503 0xc CLASS instance 0x00000000 events 0x0649824c 0x4000501 0 CLASS shared static EventDisposed >> Domain:Value 0x00099230:NotInit 0x000dfad8:0x100144d8 << 0x06854dd8 0x4000eac 0x10 CLASS instance 0x10032b90 cmdText ß this string seems promising 0x06854dd8 0x4000ead 0x24 System.Int32 instance 4 cmdType 0x06854dd8 0x4000eae 0x28 System.Int32 instance 3600 _timeout 0x06854dd8 0x4000eaf 0x2c System.Int32 instance 3 _updatedRowSource 0x06854dd8 0x4000eb0 0x38 System.Boolean instance 0 _inPrepare 0x06854dd8 0x4000eb1 0x30 System.Int32 instance -1 _handle 0x06854dd8 0x4000eb2 0x14 CLASS instance 0x28090450 _parameters 0x06854dd8 0x4000eb3 0x18 CLASS instance 0x2808f624 _activeConnection 0x06854dd8 0x4000eb4 0x39 System.Boolean instance 0 _dirty 0x06854dd8 0x4000eb5 0x3a System.Byte instance 0 _execType 0x06854dd8 0x4000eb6 0x1c CLASS instance 0x00000000 _cachedMetaData 0x06854dd8 0x4000eb7 0x34 System.Int32 instance -1 _rowsAffected 0x06854dd8 0x4000eb8 0x3b System.Boolean instance 0 designTimeVisible 0x06854dd8 0x4000eb9 0x20 CLASS instance 0x00000000 _transaction
|
Looking at such an instance:
|
0:011> !do 0x10032b90 Name: System.String MethodTable 0x79b7daf8 EEClass 0x79b7de44 Size 80(0x50) bytes GC Generation: 2 mdToken: 0x0200000f (c:\windows\microsoft.net\framework\v1.1.4322\mscorlib.dll) String: dbo.[spGenerateOptions] ß Bingo!!! This is the stored procedure we’ve been waiting on to return FieldDesc*: 0x79b7dea8 MT Field Offset Type Attr Value Name 0x79b7daf8 0x4000013 0x4 System.Int32 instance 31 m_arrayLength 0x79b7daf8 0x4000014 0x8 System.Int32 instance 30 m_stringLength 0x79b7daf8 0x4000015 0xc System.Char instance 0x64 m_firstChar 0x79b7daf8 0x4000016 0 CLASS shared static Empty >> Domain:Value 0x00099230:0x28000224 0x000dfad8:0x28000224 << 0x79b7daf8 0x4000017 0x4 CLASS shared static WhitespaceChars >> Domain:Value 0x00099230:0x28000238 0x000dfad8:0x2000364c <<
|
So we know that at the point this dump was taken we were just waiting at the stored procedure spGenerateOptions to complete it’s execution at a SQL Server.
We are going now to look at the subsequent dump (taken 30 seconds later) to verify whether we have any progress or not on that thread, however let’s take a loot at one last thing on this dump before starting with the other one (this can save us a lot of time here):
|
0:011> !runaway
User Mode Time
Thread Time
13:d5c 0 days 0:00:37.500
14:c00 0 days 0:00:24.718
16:9c8 0 days 0:00:16.406
15:9e8 0 days 0:00:13.562
11:bb8 0 days 0:00:12.015 ß This is the CPU time spent by the thread 11 at this point
|
So we know that the thread 11 have had the CPU for the aprox. 12.015 seconds when this dump was generated. So after opening the second dump, before jumping on any deeper analysis, lets just run the same command and verify if the thread 11 made any progress.
|
0:011> !runaway
User Mode Time
Thread Time
13:d5c 0 days 0:00:37.500
14:c00 0 days 0:00:24.718
16:9c8 0 days 0:00:16.406
15:9e8 0 days 0:00:13.562
11:bb8 0 days 0:00:12.015 ß The time is exactly the same 30 seconds later
|
Ok, so there is not a single additional millisecond being spent by the thread 11 on the second dump. At this point we don’t even need to take a look at the thread stack to know it’s exactly the same. In other words this thread has been doing nothing but waiting on that stored procedure to execute at the SQL Server.
The problem has been isolated as a database performance issue (mission accomplished!!! J). The next step is work on the database server (running in another box) to understand why this sp is taking too long to execute and, of course, properly addressing the issue to provide a definitive solution. IIS ended up just being the server where the symptom is faced and not the one where the root cause of the problem is really happening.