My last blog post I talked about the various ways to gather memory dumps.  That came about as a result of me trying to get my mind around all the different options for dumps and ended up as a living document that consolidates pieces of several articles into one place.   And it just grew from there.

 

Once I got the 'How to get a memory dump' down I took a step back and thought about the other aspects of memory dumps.  Why gather that information?  How does it work?  What if we don’t need to crash the entire box? What configuration options do we have?  Every question answered asked two more.

 

Why do we need to get a memory dump?

 

I've found that typically when a server becomes unresponsive (I'm going to leave that purposely vague) it does so for one of 4 reasons:

 

"I'm given her all she's got, Captain!"

The first reason is resource depletion.  In some way, shape or form we've run out of steam.

 

Symptoms of this can include (but are not limited to):

  • Computer is completely unresponsive : ex - you can see the screen but the mouse/keyboard doesn’t move, Numlock or Caps keys don’t work
  • Computer is very sluggish:  ex - Start menu takes a long time to respond, applications don’t open
  • Processes stop working: ex - Exchange servers may stop sending mail or file shares wont open
  • Application Pop-ups reference out of memory or low resource errors
  • The System event log lists SRV 2019, 2020 or 333 events
  • This typically happens over time, and if it repeats the time is usually consistent. Ex - the issue happens 4 days after rebooting, the issue happens every night at 10pm, the issue happens 20 minutes after printing to  this one printer.

 

How to crash the box:

If the system is completely locked up a control-scroll-scroll or NMI dump is the best option. If its just extremely slow, NotMyFault may also be an option.  Perfmon and Poolmon are two other good tools to use to monitor the server over time.   They can show trends with memory but may not help to pinpoint the  exact cause.

 

What to look for:

If you load the memory dump in the debugger and do a !vm 21 you want to look for stars!

  • ********** 61783 pool allocations have failed **********
  • ******* 53 system cache map requests have failed ******
  • ********** 17206 commit requests have failed  **********
  • ********** Excessive NonPaged Pool Usage *****
  • ********** Running out of physical memory **********

Typically that is a good indication that memory - virtual or physical  - is being abused.

 

"You hang up first!"  "No you hang up first!"

The second reason is a deadlock condition.  This can happen when one thread needs exclusive access to code, gets it, and makes everyone else wait until the lock is released.  In some cases that never happens, causing a deadlock situation.

 

Symptoms of this can include (but are not limited to)

  • The system stops responding suddenly. (No gradual slow down)
  • The system stops responding randomly (leaks tend to repeat in similar intervals)
  • Processes may stop working/responding.

 

How to crash the box:

If the system is completely locked up a control-scroll-scroll or NMI dump is the best option.

 

What to look for:

One indicator is seeing threads with a high tick count - indicating that they've been waiting a long time:

 

!THREAD fffffa8007d55b60  Cid 06d4.1780  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (UserRequest) KernelMode Alertable

    fffffa8007d60ed0  NotificationEvent

Not impersonating

DeviceMap                 fffff8a002db7a00

Owning Process            fffffa8007ab7060       Image:         svchost.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      66759          Ticks: 2339383 (0:10:09:12.859)

A second indication is seeing a long queue of items to be processed. Ex - disk items queued, processor items, etc

Scanning Transfer Packets List for outstanding disk requests

 

                    No.  Transfer           Num      Original           Original   

                         Packet             Retries  IRP                Driver        

 

                      1  fffffadf`feea7450        8  fffffadf`f3277af0  \Driver\Disk             

                      2  fffffadf`f93aba30        8  fffffadf`aadc1010  \Driver\Disk             

                      3  fffffadf`fef36690        8  fffffadf`a12beaf0  \Driver\Disk             

                      7  fffffadf`fef943c0        8  fffffadf`9db12af0  \Driver\Disk             

                      9  fffffadf`ed6cbb00        8  fffffadf`a235a010  \Driver\Disk              

                     10  fffffadf`fc7e6ad0        8  fffffadf`f996baf0  \Driver\Disk             

                     14  fffffadf`eca7ce70        8  fffffad9`bd88eaf0  \Driver\Disk             

........

187 blocked IRPs

 

Shampoo, Rinse, Repeat

Next we run into issues where the server will slow down or hang due to code doing the same thing over and over again.  An infinite loop like a dog chasing its tail.   This can happen when a piece of code gets pointed back to itself and doesn’t have a way out.

 

Symptoms of this can include (but are not limited to)

  • High CPU utilization for a process, on a processor, or for the entire system
  • Computer is very slow to respond
  • Processes stop working
  • The server never recovers on its own, unless a limit is reached.  Example: a program may count how many laps it has done and break out at lap 20.
  • There is typically a pattern to the issue "when we push this button, X happens"

 

How to crash the box:

If you've narrowed it down to a single process you can opt to crash that process and not the server.

Any method would work to crash the box but in the case of high CPU, Performance Monitor or Process Explorer may help narrow down the scope.

 

What to look for:

With this its all about following the bread crumbs.  It could be similar to a deadlock situation and it’s a good practice to draw out the flow of each process. Thread A to Thread B to Thread C back to A and all over again.

 

The Honey-Do list

The last reason  is that the server is just BUSY. Busy doing an amazingly long list of things.  For example, I've run in to several issues where either software, script, etc, mistakenly adds in tens of thousands of entries into the application log registry key by mistake.  On logon winlogon.exe will check the registry for the list of Event Sources (the Source list that appears in the drop down box when filtering Event logs.)  If some bad code messes up and enters 50,000 registry entries, the server will be running through that list a while. End result: extremely long logon times.

 

Symptoms of this can include (but are not limited to)

  • Long delays in common processes: ex - logon, logoff
  • Computer is very slow to respond
  • Eventually, in many instances, the server may return to normal behavior.

 

How to crash the box:

Sometimes in this case is its not how to crash the box - but when.  If the system hangs at logon for 20 minutes then it may be good to get a memory dump at 15 minutes.  Also, it may be good to get several memory dumps.  This is where having a virtual environment is very handy as we can take snapshots at certain points during the hang without causing the system to reboot.  The multiple points help to tell if the server is stuck doing one thing for a very long time or if we're doing different tasks very slowly.

 

What to look for:

In the Registry entry issue winlogon had stalled threads all waiting on the event log. A check of the active objects confirmed this and pointed to the affected key:

05b4: Object: e166e1c8  GrantedAccess: 0002001b Entry: e46ccb68

Object: e166e1c8  Type: (8c78d330) Key

    ObjectHeader: e166e1b0 (old version)

        HandleCount: 1  PointerCount: 1

        Directory Object: 00000000  Name: \REGISTRY\MACHINE\SYSTEM\CONTROLSET001\SERVICES\EVENTLOG\APPLICATION

There were almost 50,000 bad entries here.

 

The Roundup

Server hard hangs and extremely slow or unresponsive systems can have a variety of triggers and yet have the exact same symptoms.  But when digging through the data gathered the majority fit into the four causes above.    I have found that when dipping that toe into the vast ocean of data that is the debugger  its good to be familiar with the those causes.   They help in identifying the road signs that will lead you to your destination.