Blogs

The Cases of the Blue Screens: Finding Clues in a Crash Dump and on the Web

  • Comments 21
  • Likes

imageMy last couple of posts have looked at the lighter side of blue screens by showing you how to customize their colors. Windows kernel mode code reliability has gotten better and better every release such that many never experience the infamous BSOD. But if you have had one (one that you didn’t purposefully trigger with Notmyfault, that is), as I explain in my Case of the Unexplained presentations, spending a few minutes to investigate might save you the inconvenience and possible data loss caused by future occurrences of the same crash. In this post I first review the basics of crash dump analysis. In many cases, this simple analysis leads to a buggy driver for which there’s a newer version available on the web, but sometimes the analysis is ambiguous. I’ll share two examples administrators sent me where a Web search with the right key words lead them to a solution. 

Debugging a crash starts with downloading the Debugging Tools for Windows package (part of the Windows SDK – note that you can do a web install of just the Debugging Tools instead of downloading and installing the entire SDK), installing it, and configuring it to point at the Microsoft symbol server so that the debugger can download the symbols for the kernel, which are required for it to be able to interpret the dump information. You do that by opening the symbol configuration dialog under the File menu and entering the symbol server URL along with the name of a directory on your system where you’d like the debugger to cache symbol files it downloads:

image

The next step is loading the crash dump into the debugger Open Crash Dump entry in the File menu. Where Windows saves dump files depends on what version of Windows you’re running and whether it’s a client or server edition. There’s a simple rule of thumb you can follow that will lead you to the dump file regardless, though, and that’s to first check for a file named Memory.dmp in the %SystemRoot% directory (typically C:\Windows); if you don’t find it, look in the %SystemRoot%\Minidumps directory and load the newest minidump file (assuming you want to debug the latest crash).

When you load a dump file into the debugger, the debugger uses heuristics to try and determine the cause of the crash. It points you at the suspect by printing a line that says “Probably caused by:" with the name of the driver, Windows component, or type of hardware issue. Here’s an example that correctly identifies the problematic driver responsible for the crash, myfault.sys:

image

In my talks, I also show you that clicking on the !analyze -v hyperlink will dump more information, including the kernel stack of the thread that was executing when the crash occurred. That’s often useful when the heuristics fail to pinpoint a cause, because you might see a reference to a third-party driver that, by being active around the site of the crash, might be the guilty party. Checking for a newer version of any third-party drivers displayed in this basic analysis often leads to a fix. I documented a troubleshooting case that followed this pattern in a previous blog post, The Case of the Crashed Phone Call.

When you don’t find any clues, perform a Web search with the textual description of the crash code (reported by the !analyze -v command) and any key words that describe the machine or software you think might be involved. For example, one administrator was experiencing intermittent crashes across a Citrix server farm. He didn’t realize he could even look at a crash dump file until he saw a Case of the Unexplained presentation. After returning to his office from the conference, he opened dumps from several of the affected systems.  Analysis of the dumps yielded the same generic conclusion in every case, that a driver had not released kernel memory related to remote user logons (sessions) when it was supposed to:

image

Hoping that a Web search might offer a hint and not having anything to lose, he entered “session_has_valid_pool_on_exit and citrix” in the browser search box. To his amazement, the very first result was a Citrix Knowledge Base fix for the exact problem he was seeing, and the article even displayed the same debugger output he was seeing:

image

After downloading and installing the fix, the server farm was crash-free.

In another example, an administrator saw a server crash three times within several days. Unfortunately, the analysis didn’t point at a solution, it just seemed to say that the crash occurred because some internal watchdog timer hadn’t fired within some time limit:

image

Like the previous case, the administrator entered the crash text into the search engine and to his relief, the very first hit announced a fix for the problem:

image

The server didn’t experience any more crashes subsequent to the application of the listed hotfix.

These cases show that troubleshooting is really about finding clues that lead you to a solution or a workaround, and those clues might be obvious, require a little digging, or some creativity. In the end it doesn’t matter how or where you find the clues, so long as you find a solution to your problem.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Good to find you driving people to open up crash dumps and give it a shot. Which book in your opinion is good for learning WInDbg. I mean do it the big way.. to hunt down any bluescreen that comes your way.

    Thanks.

  • www.dumpanalysis.org and the Advanced Windows Debugging book.

  • Anyone have any recommendations about how to diagnose hard locks that happen so randomly that removing/selectively re-adding hardware is no help and every device has the most up-to-date drivers available? I'd kill (okay, maybe not) for a BSOD here, since at least it'd give me something useful to work with.

    It also doesn't help that the lock up never triggers anything in Event Viewer and happens under non-MS OSes too.

  • @Raj:

    Addison Wesley  - Advanced Windows Debugging:

    www.amazon.com/.../0321374460

  • Mark, is windbg 6.13 a public release?

  • @Raj: The NT Debugging blog is also a good reference (blogs.msdn.com/.../ntdebugging)

    ---------

    @JohnW: 6.12.0002.633 is latest public release.

    ---------

    @benjamin: Have you tried getting a Ctrl+Scroll Lock+Scroll Lock (or NMI) dump? Since the handler is at a high IRQ, it usually works on a hung server. Refer: support.microsoft.com/.../244139

    Here's the .reg file for PS/2 & USB support; note that it sets the dump type to complete (CrashDumpEnabled=1), instead of just kernel.

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

    Windows Registry Editor Version 5.00

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters]

    "CrashOnCtrlScroll"=dword:00000001

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters]

    "CrashOnCtrlScroll"=dword:00000001

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl]

    "AutoReboot"=dword:00000001

    "CrashDumpEnabled"=dword:00000001

    "Overwrite"=dword:00000001

    "LogEvent"=dword:00000001

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Great topic.  How would an IT troubleshoot the BSOD, 'An attempt to release a mutant object was made by a thread that was not the owner of the mutant object' ?  It happens during logout, and I know which processes are using mutants.  But the vendor is taking a long time to figure it out.  In my dreams, I could just turn on procmon boot logging, and point to which thread released the mutant, but I don't think it's possible.

  • Excellent article.

    How does one obtain the windbg version 6.13?

  • @karl

    It's an internal version that will eventually make its way into the SDK.

    Thanks for the feedback.

  • Excellent Article as always Mark.

    With regards to Symbol files, is it based on the client with WinDBG installed or the type of dump file I'm analyzing?

  • @ Glenn:

    Symbol files are based on the dump file you are analyzing (to be more precise, they are based on the  type and version of each specific module located in the dump). You can use x86 WinDbg to look at x64 crash dumps and vice versa.

  • Great post as usual, and nice information by Andrew in the comments! I will have to remember this manual crash dump generation procedure to help try and find problems in a system.

    It's always hard finding why an application isn't working, or a crash isn't working, etc.

    Mark, I'd love to see a post about how to diagnose and troubleshoot installation/windows installer errors and how to do some analysis for cleanup. I've run into a lot of situations where for some reason certain applications won't install properly and aren't uninstalled cleanly. Trying to repair the app fails as well.

  • @ Great post:

    When diagnosing MSI issues, enable the 'voicewarmup' logging (support.microsoft.com/.../314852) and do a ProcMon capture. The MSI log will tell you the reason, or will help you identify what part of the ProcMon to review.

  • Thanks for the post Mark. I'm working to wrap my brain around Windbg. Can someone answer the following?   When looking at a Minidump file, what does the PROCESS_NAME field refer to?

    Case in point. We have some HP Notebooks and desktops that recently started crashing after KB2393802 was applied. The culprit was an an Intel Graphics Driver. Once updated the issue was resolved.  The PROCESS_NAME in each of the notebook dumps refered to HPWA_Main.exe, while the desktops referred to iexplore.exe. I can draw the link between IE and the video driver but where does the Wireless Card fit into this?  Thanks for any insight!

  • @ MickC:

    The PROCESS_NAME is the process that was scheduled at the time of the bugcheck.  The process itself may not have been the instigator though if it was pre-empted. That is, an interrupt, DPC or APC occurred. In these cases, the process's thread stack context is stored (in a trap) and the pre-empting code takes over control (i.e. it is run the CPU core).

    In general, the PROCESS_NAME rarely relates to the cause.