Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
When I experienced a crash in Internet Explorer (IE) on my home 64-bit gaming system one day, I chalked it up to random third-party plug-in memory corruption. I moved on, but a few days later had another crash in IE. Then, Windows Media Player (WMP) started crashing every third or fourth time I used it:
Crashes in different programs seemed to point at a more fundamental problem. I had over-clocked the CPU, so I speculated that the rash of crashes were a side-effect of CPU overheating and reluctantly dialed back the clock multiplier to the factory specification. To my dismay, however, the crashes continued. My next theory was that I had bad RAM, but the Windows Vista Memory Diagnostic failed to identify any problems.
Hardware problems seemingly cleared, my next move was to look at the process crash dumps to see if they held any clues. But first I had to find a crash dump to look at. Windows XP’s Application Error Reporting process always generates a dump before showing you the application crash dialog, and you can find the location of the dump by clicking to see the report details and then viewing the report’s technical information:
Windows Vista’s corresponding dialog doesn’t offer a way to get at a report’s technical information and it doesn’t generate a dump unless Microsoft’s Windows Error Reporting (WER) servers request it, which they only do for crashes reported in high volumes. Fortunately, WerFault, the process that presents the dialog, keeps the crashed process around until you press the Close Program button, which offers an opportunity to attach to the process with a debugger and examine it. You can see WerFault’s handle to a crashed Windows Media Player process in Process Explorer:
The next time I had a crash, I launched WinDbg, the Windows Debugger from the Debugging Tools for Windows package that’s available for free download from Microsoft. After making sure that I had the symbol configuration set to point at the Microsoft public symbol server (e.g. srv*c:\symbols*http://msdl.microsoft.com/download/symbols) in the Symbol File Path dialog, I went to the File menu and selected the “Attach to a Process...” menu entry:
That opens the WinDbg process selection dialog, which I scrolled through to find the crashed process. When I selected the process, WinDbg opened it and presented the same interface it does when it loads a crash dump, except that when you load a crash dump, you can execute the !analyze debugger command that uses heuristics to try and pinpoint the cause of the crash; when you perform a debugger attach, an analysis will just tell you what you already know, that you attached with a debugger:
Looking for a potential cause of a crash when attached requires looking at the stack of each thread in the process, so I opened the Processes and Threads and Call Stack dialogs in the View menu:
I started examining threads by selecting the first entry in the threads dialog:
The WinDbg command window usually grays and says “Busy” as WinDbg pulls symbols from the symbol server, after which the call stack dialog populates with the function nesting of the selected thread at the time of the crash. I examined each thread’s stack in turn, moving between threads by pressing the down arrow and then the enter key, hunting for a stack that had function names with the words “exception” or “fault” in them. Near the end of the list I came across this one:
I noticed that the top of the list is full of functions with “Exception” in their names. Looking down the list (up the stack), I saw that a function in Nvappfilter called Kernel32.dll’s HeapFree function, leading to the crash. The exception in the heap’s free routines meant that either the caller passed a bogus heap address or that the heap was already corrupted when the function executed. If a Windows DLL had been the caller I would have suspected the latter, but in this case the caller was a third-party DLL, which I could tell by the fact that WinDbg couldn’t locate symbol information for it and hence didn’t know the names of the functions within it. I confirmed that by issuing the lm (list module) command to look at its version information:
Nvappfilter was now my primary suspect, but I didn’t have direct evidence that it was responsible. I continued to use the system and followed the same debugging steps on the next several crashes. Whether it was IE, WMP or a game, the faulting stack was always the same, with Nvappfilter calling HeapFree. That’s still not conclusive proof, but the anecdotal evidence was pretty compelling.
At that point I went to see if there were updates for Nvappfilter, but I wasn’t sure what software package it was associated with. I entered its name in a Web search and discovered that it’s part of the nVidia’s FirstPacket feature that prioritizes game traffic and that’s included in the nForce motherboard’s software:
I went to nVidia’s site and downloaded the most recent nForce driver package, but it failed to update Nvappfilter.dll and I continued to have the crashes.
The nVidia control panel offers no way that I could find to prevent Nvappfilter from loading, so my only recourse was to manually disable it. I wasn’t using the FirstPacket feature, which I had previously been unaware of, so I wouldn’t miss it, but first I had to figure out how it configured Windows to load it. For that I turned to Autoruns, where I found references to Nvappfilter’s 32-bit and 64-bit versions in the Winsock Layered Service Provider (LSP) section:
I deleted all of Nvappfilter’s entries, rebooted the system and have been crash-free since. While I was writing this post, I checked again for nForce software updates to see if Nvappfilter had been updated. The latest version doesn’t look like it includes Nvappfilter or any other Winsock LSP, so assuming Nvappfilter was at fault, it’s no longer an issue.
One other thing I’ve done since I investigated these crashes is take advantage of Vista SP1’s “local dumps” functionality so that I'll automatically get a crash dump to investigate for any application crash I experience. If you create a key named HKLM\Software\Microsoft\Windows\Windows Error Reporting\LocalDumps, WerFault will always save a dump. Crashes go by default into %LOCALAPPDATA%\Crashdumps, but you can override that with a Registry value and also specify a limit on the number of crashes WerFault will keep.
Thank you for the great post.
I'm looking forward to the next one. :-)
I've got an nForce 4 motherboard (ASUS A8N SLI deluxe) with nVidia's "onboard firewall" and had many problems with it, mostly corrupted downloads. The problem even manifested itself in corrupted .JPEG's in webpages, which I at first thought was the webservers' fault...
It would also block certain communication.
I hate it when they use Apache to run a webbased configuration UI, which is only intended to be used locally anyway :|
I removed nVidia's "onboard firewall" software and switched to Windows' built-in firewall. It has worked very well.
Thanks for the post, appreciate it! Those damn 1st person shooters, UT2004 has taken so much of my time.
telling mark that his post is great is as normal as saying "you know, it kinda rains in redmond" :-) obsolutely great post, as always! to have mark posting from microsoft is a great asset to all of us (more or less engaged in system development)
in the (far) future, it would be nice to see something "the case of windows kernel" where you will argue (on behalf of the best of us) about how great would be to be able to fully debug the windows kernel... :-)
great, thanks again mark!
i am a little bit amazed about how many people said "oh, this is so difficult, for the rest of us" when in fact, in os shops (not even at kernel32, user32, gdi32 level but higher, let's say ntdll) qa engineers are actually __required__ to narrow down issues using (business as usual) tools like windbg and so on... these cases are normal engineering procedures, most programmers (not even from r&d should instinctively follow) - except that mark is "dressing" them in inspired sh holmes paragraphs and nice screen shots (that's his unique talent! no doubts about that)
but the fact reamains that most of you expressed perplexity about normal engineering processes... and of course, it makes me wonder... of course, in a positive and productive way (as always :-)
As Ian Boyd says;
Yes, it's possible that the root-cause of this problem lies at a deeper level - and that's likely true for a lot of issues you'd care to troubleshoot in this manner.
On the other hand, between the root-cause, and the top-level "Thing You Do That Causes The Crash", there are only so many opportunity-points, where the user or admin (or 3rd-party app developer) can intervene to fix a problem like this.
An admin can hack the registry like Marc did, to block the dll from loading, thus preventing THIS pathway from root-cause to use-case. Other pathways can exist, and you might find that the user can do something else that will cause the same error.
An admin (and sufficiently-privileged user) could also uninstall the nvidia firewall software, which (don't know about this specific case, but), in theory, would probably wipe out a wider range of use-cases that would trigger this crash.
The only way to eliminate all possible pathways to a root-cause, is to fix the root-cause, and in most cases, only the OS developer can do that. (or, in some cases, the problem may be a bug in lower-level software, but it's only exposed by say, a bad parameter written into it's registry key by a third-party app installer). Often, this is not a viable or available fix, so we go for the intermediate solution: Least Amount of Cost, for the Greatest Amount of Benefit.
And as Daniel says:
Yes - this kind of debugging is often found among QA engineers (who couldn't otherwise code a fix if the source were opened in an editor in front of their face!) - for some reason, this kind of troubleshooting skill seems more prevalent in your higher-end QA guys. (maybe they get to practice it more often?)
Unfortunately, in some organizations, this is seen as "stepping on developer's toes". In other organizations, it's seen as "making most effective use of resources. . . " how many communication cycles back-and-forth between the QA guy who encounters the error, and the developer who fumbles around with the higher-level troubleshooting first, will occur at every stage of the process, thus slowing down the finding of the solution? (or wasting the developer's time, when it turns out that the cause was actually pilot-error - because you're testing pre-release code that may not yet have a fully-developed UI?
Personally, I'd really like to see this attitude in the industry change, because it's one of the very few tasks a QA engineer gets to do that is not mind-numbingly boring, and can even forestall career-burnout! :)
Some of the random "NVIDIA" crashes can be attributed to their partners that put the chips on the PCI express cards. Some of these (OEMs?) have even admitted this problem by suggesting that if you have problems you should 'clean' the pins. However that's not the whole extent of the issue.
I've verified on different brand (asus,gigabyte) Intel P35 motherboards and two different Nvidia 8800 GT units that that atleast certain of these boards are either too thin or the pins have otherwise poor contact.
The cards sway sideways so much that even if the card is "properly" seated and locked in, the computer can either not boot or you can have random crashes or even just random flickering to blank screen without any crash etc. Just supporting the card slightly so it sits more straight can resolve the issue.
Another part of the issue that the PCI Express connector is shorter in height compared to say regular PCI and thus makes this swaying more likely.
As someone coming from the administration/operations side of the tech field, not from the developer side, I very much appreciate the specific details of how to troubleshoot at this level, especially the WinDbg instructions.
I have to admit, until this post I had assumed (not sure why) that WinDbg and Symbols access were something that required a MSDN subscription. Now I know, thanks!
knowledge power, excellent teaching, simplificative, etc. etc. thanks for your postings
I did a review of your blog for my technical program at school. Very well done.
Here's the text
Troubleshooting skills are always valuable when any computer stops working. Especially when computers continually fail to work by crashing.
I found Mark's Blog interesting in that he identified the problems very clearly. It was interesting to read how he came to view certain things about services and memory dumps.
The tendency for users to blame the operating system because of incompatible drivers or services seems to be way too rampant. I realized this after reading some comments on this blog.
I'm not sure if my troubleshooting skills will reach the level of Mark's. However, with the right experience and tools, I can hope for success in similar problems.
Writing about technical complexites are much easier with print screen and highlighted text. I did appreciate the screen which broke up a possible dull read.
Whether or not this article was entertaining, remains the responsibility of the reader. Teaching others the methodologies of fixing certain items will always be more important than just dishing out entertainment.
Quick note for guys that talked about a "skill gap". Do yourself a big favor and just try to repeat Mark's steps (attach to say IE, set path to public symbols, browse through threads). I'll bet a 6-pack that you are going to find it almost as easy as peeking at threads in Process Explorer.
The reason is that while windbg does have an expert level with command line etc. it also has a few pretty simple windows for just browsing around. Even if you don't get the symbol server (say your net is hosed) it will still nail some basic symbols and more often then not that's enough to spot potential offender.
I'm having the same problem. Can you give some simple instructions to a novice on how to fix this problem. Neither HP or Microsoft is able to help me. I'm running Vista.
crash of window, errors not capability of download, now slow and hard to end all programs . not able to use explore any more. sorry I'm not able to e-mail any one can receive just can't if exploer is needed. windows not running right. did ugrade memerory, old to 512mb 64x64 borad.
Mark - great post. I have always wondered how to debug crashes in Windows, this will get me started correctly. Thanks.
i have recently purchased a Vista-loaded Sony notebook. after starting it up, Vista proceeded to install WMP, after which WMP crashed at the first invocation.
this blog is very timely, as i can't get any information on the web - and i'm no Windows programmer to debug the details.
i have avoided using WMP by installing VideoLan's VLC, and SMPlayer from http://smplayer.sourceforge.net/
thank you, Mark!