My last blog post I talked about the various ways to gather memory dumps. That came about as a result of me trying to get my mind around all the different options for dumps and ended up as a living document that consolidates pieces of several articles into one place. And it just grew from there.
Once I got the 'How to get a memory dump' down I took a step back and thought about the other aspects of memory dumps. Why gather that information? How does it work? What if we don’t need to crash the entire box? What configuration options do we have? Every question answered asked two more.
Why do we need to get a memory dump?
I've found that typically when a server becomes unresponsive (I'm going to leave that purposely vague) it does so for one of 4 reasons:
"I'm given her all she's got, Captain!"
The first reason is resource depletion. In some way, shape or form we've run out of steam.
Symptoms of this can include (but are not limited to):
How to crash the box:
If the system is completely locked up a control-scroll-scroll or NMI dump is the best option. If its just extremely slow, NotMyFault may also be an option. Perfmon and Poolmon are two other good tools to use to monitor the server over time. They can show trends with memory but may not help to pinpoint the exact cause.
What to look for:
If you load the memory dump in the debugger and do a !vm 21 you want to look for stars!
Typically that is a good indication that memory - virtual or physical - is being abused.
"You hang up first!" "No you hang up first!"
The second reason is a deadlock condition. This can happen when one thread needs exclusive access to code, gets it, and makes everyone else wait until the lock is released. In some cases that never happens, causing a deadlock situation.
Symptoms of this can include (but are not limited to)
If the system is completely locked up a control-scroll-scroll or NMI dump is the best option.
One indicator is seeing threads with a high tick count - indicating that they've been waiting a long time:
!THREAD fffffa8007d55b60 Cid 06d4.1780 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (UserRequest) KernelMode Alertable
Owning Process fffffa8007ab7060 Image: svchost.exe
Attached Process N/A Image: N/A
Wait Start TickCount 66759 Ticks: 2339383 (0:10:09:12.859)
A second indication is seeing a long queue of items to be processed. Ex - disk items queued, processor items, etc
Scanning Transfer Packets List for outstanding disk requests
No. Transfer Num Original Original
Packet Retries IRP Driver
1 fffffadf`feea7450 8 fffffadf`f3277af0 \Driver\Disk
2 fffffadf`f93aba30 8 fffffadf`aadc1010 \Driver\Disk
3 fffffadf`fef36690 8 fffffadf`a12beaf0 \Driver\Disk
7 fffffadf`fef943c0 8 fffffadf`9db12af0 \Driver\Disk
9 fffffadf`ed6cbb00 8 fffffadf`a235a010 \Driver\Disk
10 fffffadf`fc7e6ad0 8 fffffadf`f996baf0 \Driver\Disk
14 fffffadf`eca7ce70 8 fffffad9`bd88eaf0 \Driver\Disk
187 blocked IRPs
Shampoo, Rinse, Repeat
Next we run into issues where the server will slow down or hang due to code doing the same thing over and over again. An infinite loop like a dog chasing its tail. This can happen when a piece of code gets pointed back to itself and doesn’t have a way out.
If you've narrowed it down to a single process you can opt to crash that process and not the server.
Any method would work to crash the box but in the case of high CPU, Performance Monitor or Process Explorer may help narrow down the scope.
With this its all about following the bread crumbs. It could be similar to a deadlock situation and it’s a good practice to draw out the flow of each process. Thread A to Thread B to Thread C back to A and all over again.
The Honey-Do list
The last reason is that the server is just BUSY. Busy doing an amazingly long list of things. For example, I've run in to several issues where either software, script, etc, mistakenly adds in tens of thousands of entries into the application log registry key by mistake. On logon winlogon.exe will check the registry for the list of Event Sources (the Source list that appears in the drop down box when filtering Event logs.) If some bad code messes up and enters 50,000 registry entries, the server will be running through that list a while. End result: extremely long logon times.
Sometimes in this case is its not how to crash the box - but when. If the system hangs at logon for 20 minutes then it may be good to get a memory dump at 15 minutes. Also, it may be good to get several memory dumps. This is where having a virtual environment is very handy as we can take snapshots at certain points during the hang without causing the system to reboot. The multiple points help to tell if the server is stuck doing one thing for a very long time or if we're doing different tasks very slowly.
In the Registry entry issue winlogon had stalled threads all waiting on the event log. A check of the active objects confirmed this and pointed to the affected key:
05b4: Object: e166e1c8 GrantedAccess: 0002001b Entry: e46ccb68
Object: e166e1c8 Type: (8c78d330) Key
ObjectHeader: e166e1b0 (old version)
HandleCount: 1 PointerCount: 1
Directory Object: 00000000 Name: \REGISTRY\MACHINE\SYSTEM\CONTROLSET001\SERVICES\EVENTLOG\APPLICATION
There were almost 50,000 bad entries here.
Server hard hangs and extremely slow or unresponsive systems can have a variety of triggers and yet have the exact same symptoms. But when digging through the data gathered the majority fit into the four causes above. I have found that when dipping that toe into the vast ocean of data that is the debugger its good to be familiar with the those causes. They help in identifying the road signs that will lead you to your destination.