I had a great call yesterday. Great meaning I was able to empower a customer to debug a memory dump so he could potentially resolve a STOP error in the future more quickly. Everyone wins in that scenario: end customer is back up and running quickly, consultant can go off and make more service calls to more customers and Microsoft wins because he does not have to call for support (we lose money on each call).
Here's the issue: Server is randomly rebooting about once every 4-5 days. No bluescreen, no memory.dmp. The behavior seemed like someone was simply pulling the plug on the server! Not good!!
Well, checking System Properties -> Advanced -> Settings for Startup and Recovery showed the server was set to "Small Memory Dump (64k)" under "Write debugging information" and to "Automatically restart". A quick search on the server revealed a mini-dump of 64k dated the time of the last "crash". What this tells me is that the server IS actually performing a bugcheck (Blue Screen of Death or BSOD) and rebooting but since it is only configured for a "Small Memory Dump", it dumps so quickly that the BSOD is never presented. So, we set the "Write debugging information" to "Kernel Memory Dump" and rebooted. Now we have to wait for the server to crash.
After several days, it did and we now have a Memory.dmp file. Here is the steps we performed to debug the memory dump:
Note the debugger does not *have* to be installed on the server itself. All you have to do is have local access to the dump file. You could copy the dump file to a Windows XP workstation and install the debugging tools on the workstation rather than the server.
Here's the output after loading the dump file (I did not run a single command). Around 80% of the calls we get (PSS SBS) regarding memory dumps are resolved by simply loading the dump in the debugger, as illustrated below.
Microsoft (R) Windows Debugger Version 6.6.0003.5Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Documents and Settings\petergal\My Documents\MEMORY.DMP]Kernel Summary Dump File: Only kernel address space is available
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbolsExecutable search path is: Windows Server 2003 Kernel Version 3790 MP (2 procs) Free x86 compatibleProduct: LanManNt, suite: SmallBusiness TerminalServer SmallBusinessRestricted SingleUserTSBuilt by: 3790.srv03_gdr.050225-1827Kernel base = 0x804de000 PsLoadedModuleList = 0x8057b6a8Debug session time: Wed Mar 22 02:59:01.750 2006 (GMT-6)System Uptime: 1 days 9:42:01.500Loading Kernel Symbols.............................................................................................................Loading User SymbolsPEB is paged out (Peb.Ldr = 7ffdf00c). Type ".hh dbgerr001" for detailsLoading unloaded module list.....******************************************************************************** ** Bugcheck Analysis ** ********************************************************************************
Use !analyze -v to get detailed debugging information.
BugCheck D1, {8a400000, 2, 0, f77e00a9}
*** ERROR: Module load completed but symbols could not be loaded for CSTDI50.sysProbably caused by : CSTDI50.sys ( CSTDI50+10a9 )
Followup: MachineOwner---------
Notice the "Probably caused by: CSTDI50.sys" Ok, what the heck is that file? Find the file -> properties -> version -> who's file is this anyway? The file belongs to Colasoft. As soon as we determined who the file belongs to, it was determined that this software was installed last November and a quick scroll through Event Viewer showed the problem started around November. A quick search on the internet for CSTDI50.sys confirmed that Colasoft has a "known issue".
The action is to uninstall the Colasoft software.
With the steps above, you *should* be able to hopefully determine the cause of the crash!
To be really geeky, "!analyze -v" (without quotes) can be ran in the debugger to give additional (in this case, pretty useless as we already know the cause) information:
0: kd> !analyze -v******************************************************************************** ** Bugcheck Analysis ** ********************************************************************************
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)An attempt was made to access a pageable (or completely invalid) address at aninterrupt request level (IRQL) that is too high. This is usuallycaused by drivers using improper addresses.If kernel debugger is available get stack backtrace.Arguments:Arg1: 8a400000, memory referencedArg2: 00000002, IRQLArg3: 00000000, value 0 = read operation, 1 = write operationArg4: f77e00a9, address which referenced memory
Debugging Details:------------------
READ_ADDRESS: 8a400000
CURRENT_IRQL: 2
FAULTING_IP: CSTDI50+10a9f77e00a9 f3a5 rep movsd
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0xD1
LAST_CONTROL_TRANSFER: from 804e2f58 to 80543ac9
STACK_TEXT: b87be8b0 804e2f58 0000000a 8a400000 00000002 nt!KeBugCheckEx+0x19b87be8b0 f77e00a9 0000000a 8a400000 00000002 nt!KiTrap0E+0x224WARNING: Stack unwind information not available. Following frames may be wrong.b87be94c f77e02d5 888c0630 887e30c8 01ed6000 CSTDI50+0x10a9b87be974 f77e0747 884db750 8000089c 888b8200 CSTDI50+0x12d5b87be990 f77e0c58 00000001 00000000 88b8cf68 CSTDI50+0x1747b87be9bc f77e0ea3 00000001 00000000 88b8cf68 CSTDI50+0x1c58b87bea08 804f0154 00000000 88b8cf68 018b81c8 CSTDI50+0x1ea3b87bea38 ad82394a 88843288 88b8cfd8 f77dffdb nt!IopfCompleteRequest+0xa0b87bea44 f77dffdb 8982b200 88b8cf68 88b8cfd8 tcpip!TCPDispatchInternalDeviceControl+0x134b87beaa0 f77e10f1 8982b200 88b8cf68 f77e3e40 CSTDI50+0xfdbb87beafc ad6b6851 884db750 b87beb84 00000004 CSTDI50+0x20f1b87beb74 ad6b80c7 89924e40 8000089c 8000089c afd!AfdCreateConnection+0x195b87beba0 ad6b80f7 898bfab0 8831f0e4 8831f008 afd!AfdAddFreeConnection+0x37b87bebb4 ad6c5997 00000000 00012083 ad6c57f0 afd!AfdReplenishListenBacklog+0x13b87bec30 ad6c0043 8831f008 89a27a60 804f0473 afd!AfdSuperAccept+0x1cfb87bec3c 804f0473 89927030 8831f008 883175f0 afd!AfdDispatchDeviceControl+0x4fb87bec4c 80585208 8831f0e4 897c0e90 8831f008 nt!IofCallDriver+0x3fb87bec60 805860e6 89927030 8831f018 897c0e90 nt!IopSynchronousServiceTail+0x6fb87bed00 80586128 000006e4 00000000 00000000 nt!IopXxxControlFile+0x607b87bed34 804dfd24 000006e4 00000000 00000000 nt!NtDeviceIoControlFile+0x28b87bed34 7ffe0304 000006e4 00000000 00000000 nt!KiSystemService+0xd00545ff00 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4
STACK_COMMAND: .bugcheck ; kb
FOLLOWUP_IP: CSTDI50+10a9f77e00a9 f3a5 rep movsd
FAULTING_SOURCE_CODE:
SYMBOL_STACK_INDEX: 2
FOLLOWUP_NAME: MachineOwner
SYMBOL_NAME: CSTDI50+10a9
MODULE_NAME: CSTDI50
IMAGE_NAME: CSTDI50.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 42538bbd
FAILURE_BUCKET_ID: 0xD1_CSTDI50+10a9
BUCKET_ID: 0xD1_CSTDI50+10a9