Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
While I long for the day when I no longer experience the effects of buggy software, there’s something rewarding about solving my own troubleshooting cases. In the process, I often come up with new techniques to add to my bag of tricks and to share with you in my “Case of the Unexplained…” presentations and blog posts. The other day I successfully closed an especially interesting case that opened when Internet Explorer (IE) crashed as I was reading a web page:
Whenever I experience a crash, whether it’s the system or an application, I always take a look at it. There’s no guarantee, but many times after spending just a few minutes I find clues that point at an add-on as the cause and ultimately a fix or workaround. In most cases when it’s an application crash, the faulty process is obvious and I simply launch Windbg (from the free Debugging Tools for Windows package that comes with the Windows SDK and Windows DDK), attach it to the process, and start investigating.
Sometimes however, the faulting process isn’t obvious, like was the case when I saw the IE crash dialog. That’s because I was running IE8, which has a multi-process model where different tabs are hosted in different processes:
I had multiple tabs open as usual, so I had to figure out which IE process of the four that were running (in addition to the parent broker instance) was the one that had crashed. I could have taken the brute-force approach of attaching to each process in turn and searching for the faulting thread, but there’s fortunately a simpler and more direct way to identify the target process.
When a process crashes, the Windows Error Reporting (WER) service launches its own process, called WerFault, in the session of the crashed process to display the error dialog to the user running the session and to generate a crash dump file. So that WerFault knows which process is the one that crashed, the WER service passes the process ID (PID) of the target on WerFault’s command line. You can easily view the command line with Process Explorer. Because I always have Process Explorer running with its icon visible in the tray area of the taskbar, I clicked on the icon to open it and found the WER process in the process tree:
I double-clicked on it to open the process properties dialog and the command line revealed the process ID of the problematic IE process:
Now that I knew it was process 4440 in which I was interested, I started Windbg, pressed F6 to open the process selection dialog, and double-clicked on Iexplore.exe process 4440. With Windbg attached, my next step was to locate the thread that had faulted so that I could examine its stack for signs of a buggy add-on. In some cases, relying on Windbg’s built-in crash analysis heuristics, which you can trigger with the !analyze command, will do the job for you, but it didn’t this time. Finding the faulting thread is fairly straightforward, though.
First, go to Windbg’s View menu and open both the Processes and Threads and the Call Stack dialogs, arranging them side by side. The goal is to find the thread that has functions with the words fault, exception, or unhandled in their names. You can quickly do this by selecting each thread in the Processes and Threads window, pressing Enter, and then scanning the stack that appears in the Call Stack window. After doing this for the first few threads, I came across the thread I was looking for, revealed by functions all over its stack containing the telltale strings:
Unfortunately, I was at an apparent dead end as far as fingering an add-on: all the DLLs shown in the call stack were Microsoft’s. There was one indicator that there might be an add-on hidden from view though, and that was the text reporting that Windbg couldn’t find symbols for at least some of the stack’s frames, so was forced to make guesses about the stack’s layout and was showing an address that didn’t lie within any DLL:
This happens when a DLL uses frame pointer omitted (FPO) calling conventions, which in the absence of symbolic information for the DLL prevents the debugger from finding stack frames just by following the frame-pointer chain. The return addresses for the functions the thread invoked must be on the stack (unless they were overwritten by the bug that caused the crash), but Windbg’s heuristics couldn’t locate them.
There’s a Windbg command that you can use in these cases to hunt for the missing frame function addresses, the Display Words and Symbols command. If you’re debugging a 32-bit process, use the dds version of the command and if it’s a 64-bit process use dqs. You can also use dps (Display Pointer Symbols), which will interpret the function addresses as the appropriate size for a 32-bit or 64-bit process. The address to give to the command as the starting point should be the address of the stack frame immediately above the one where Windbg got lost. To see the address, click on the Addrs button in the call stack dialog:
The address on the frame in question was 2cbc5c8:
I passed it to dds as the argument and pressed enter:
The first page of results didn’t list any functions besides the expected one, KiUserException. I hit the enter key again without typing another command, because for address-based commands like dds, that tells Windbg to repeat the last the last command at the address where it left off. The second page of results yielded something more interesting, the name of a DLL I wasn’t familiar with:
An easy way to see version information for a module without leaving Windbg is to use the lm (List Modules) command. The output of that command told me that Yt.dll (the name of the DLL is the text to the left of the “!”) was part of the Yahoo Toolbar:
This came as a surprise because the system on which the crash occurred was my home gaming system, a computer that I’d only had for a few weeks. The only software I generally install on my gaming systems are Microsoft Office and games. I don’t use browser toolbars and if I did, would obviously use the one from Bing, not Yahoo’s. Further, the date on the DLL showed that it was almost two years old. I’m pretty diligent about looking for opt-out checkboxes on software installers, so the likely explanation was that the toolbar had come onto my system piggybacking on the installation of one of the several video-card stress testing and temperature profiling tools I used while overclocking the system. I find the practice of forcing users to opt-out annoying and not giving them a choice even more so, so was pretty annoyed at this point. A quick trip to the Control Panel and a few minutes later and my system was free from the undesired and out-of-date toolbar.
Using a couple of handy troubleshooting techniques, within less than five minutes I had identified the probable cause of the crash I experienced, made my system more reliable, and probably even improved its performance. Case closed.
This was very interesting. But to me it just highlights the much bigger problem - what would my 70 year-old mother have done? "Brian, I got this 'toast' that says something about the program not responding." She calls anything that pops up 'toast' but she would have NO idea what to do here.
How can we, as an industry, help her and the growing number of folks like her who just want to use the 'dang puter' to buy something online or get email from her grandkids?
Mark, thanks for sharing! Next step: interview Carol Bartz.
Color me impressed (as usual)!
Now, besides reading this blog, where can I find this type of troubleshooting information? I have used some of your tools in the past (ProcessExplorer, ProcessMonitor, etc.) but where can I find a learning guide(s) to this high level troubleshooting?
Thanks in advance!
Good info Mark! Have you looked over the tools you installed to see if there was an obvious culprit? If you still have the install packages, you could probably just rip them open and look for the Yahoo cab files. I'm sure even the Yahoo folks would like to know who's distributing a 2 year old version of their toolbar with these kinds of bugs in it...
drive by installs like this should be treated like a virus and all av software manufactures should be made aware of these issues. Did you figure out what software did the rogue install?
Once a again a great read with some great tips! I just don't how the home user is expected to get remotely close to solving a problem like that. The vast majority of people would probably just re-install windows wasting a valuable amount of time. Its a shame that even though it is a third party dll IE just crashes with no useful information whatsoever. I long for the day when Windows apps crash they give me some useful information to work with.
Your technique is very interesting, but it's very hard to reach this level of knowledge! Congrats!
You could have just looked in Add/Remove programs for anything browser related :)
But thanks for this tip. I wasn't familiar with the FPO issue and the dds/dqs commands.
But you have to ask why after all these years doesn't microsoft write code that identifies the cause of the problem to a typical computer user in language that they could understand, and offer to uninstall the Yahoo Toolbar for them? This would be a valuable addition to the operating system, and is the type of thing that Microsoft should be doing if they want to see Windows retain its relevance for the typical home user. I'm sure Mark could whip off something like this in a week or two.
We need to implement the intelligence and programming skills of Mark in the operating system, so the typical user doesn't require Mark's skills to keep their home installation of Windows running.
Another awesome entry! Thank you for sharing your knowledge.
Thanks for this fantastic article, Mark! I could keep reading and reading and reading.
And you always make it sound so simple. :-)
> Now, besides reading this blog, where can I find this type of troubleshooting information?
Check out the videos that Mark and David Solomon did a few years back, known as the SysInternals Video Library. They are presented by Mark and Dave and cover Server 2003 and Windows XP analysis and troubleshooting.
In my opinion, these videos are the next best thing to taking an in-person class from Mark and/or Dave:
Another great post. Thanks, Mark!
I have to chime in with some of the other commenters here though. How is an "ordinary user" supposed to resolve this problem? While I have all of the Sysinternals tools and Windbg installed on my boxes, I'm quite certain that my neighbor doesn't. He also wouldn't have the patience or knowledge to step through a troubleshooting procedure like this.
He'd probably just curse Microsoft for making shoddy software (undeserved in this case) and then install Firefox or Chrome. In today's world of stealth installers and add-ons, how can we get Windows to report errors more clearly? That "IE has stopped working" message should say: "Not my fault! Might be the fault of add-on Yahoo! Toolbar." (insert your own joke about NotMyFault.sys here :) )
At least that would point someone in a general troubleshooting direction.
You can use 'dps' which dumps 'pointer-size'.
@Brian: This, again, drives home the point that the 70 year old grandmother is woefully out of luck with Windows unless his grandson happens to be Mark Russinovich. Buy her an Apple.
The same goes for the tales about Mark's wife's computer: Each time something goes wicked in it, his wife hollers him to help, and each time he supermanly, if not batmanly, manages to hunt down the cunning driver or spyware problem. The point is not that Mark is a Super Hero (we all knew that already), but that just about anybody elses wife would've been out of luck. Not necessarily because their husbands are wimps, but because Windows is doomed to suffer from spyware, viruses and sucky drivers no matter how cleanly you try to use it,
(For the record, I'm not an Apple nor Batman fan. I use Windows at work and linux at home. Linux is sucky in other ways, but is good at doing what it's told and nothing else (I haven't seen an unwanted Yahoo!! bar on my linux boxes). MacOS X I don't have enough experience to talk about, but I would guess it is not perfect either.)