Mark Russinovich’s technical blog covering topics such as Windows troubleshooting, technologies and security.
While I long for the day when I no longer experience the effects of buggy software, there’s something rewarding about solving my own troubleshooting cases. In the process, I often come up with new techniques to add to my bag of tricks and to share with you in my “Case of the Unexplained…” presentations and blog posts. The other day I successfully closed an especially interesting case that opened when Internet Explorer (IE) crashed as I was reading a web page:
Whenever I experience a crash, whether it’s the system or an application, I always take a look at it. There’s no guarantee, but many times after spending just a few minutes I find clues that point at an add-on as the cause and ultimately a fix or workaround. In most cases when it’s an application crash, the faulty process is obvious and I simply launch Windbg (from the free Debugging Tools for Windows package that comes with the Windows SDK and Windows DDK), attach it to the process, and start investigating.
Sometimes however, the faulting process isn’t obvious, like was the case when I saw the IE crash dialog. That’s because I was running IE8, which has a multi-process model where different tabs are hosted in different processes:
I had multiple tabs open as usual, so I had to figure out which IE process of the four that were running (in addition to the parent broker instance) was the one that had crashed. I could have taken the brute-force approach of attaching to each process in turn and searching for the faulting thread, but there’s fortunately a simpler and more direct way to identify the target process.
When a process crashes, the Windows Error Reporting (WER) service launches its own process, called WerFault, in the session of the crashed process to display the error dialog to the user running the session and to generate a crash dump file. So that WerFault knows which process is the one that crashed, the WER service passes the process ID (PID) of the target on WerFault’s command line. You can easily view the command line with Process Explorer. Because I always have Process Explorer running with its icon visible in the tray area of the taskbar, I clicked on the icon to open it and found the WER process in the process tree:
I double-clicked on it to open the process properties dialog and the command line revealed the process ID of the problematic IE process:
Now that I knew it was process 4440 in which I was interested, I started Windbg, pressed F6 to open the process selection dialog, and double-clicked on Iexplore.exe process 4440. With Windbg attached, my next step was to locate the thread that had faulted so that I could examine its stack for signs of a buggy add-on. In some cases, relying on Windbg’s built-in crash analysis heuristics, which you can trigger with the !analyze command, will do the job for you, but it didn’t this time. Finding the faulting thread is fairly straightforward, though.
First, go to Windbg’s View menu and open both the Processes and Threads and the Call Stack dialogs, arranging them side by side. The goal is to find the thread that has functions with the words fault, exception, or unhandled in their names. You can quickly do this by selecting each thread in the Processes and Threads window, pressing Enter, and then scanning the stack that appears in the Call Stack window. After doing this for the first few threads, I came across the thread I was looking for, revealed by functions all over its stack containing the telltale strings:
Unfortunately, I was at an apparent dead end as far as fingering an add-on: all the DLLs shown in the call stack were Microsoft’s. There was one indicator that there might be an add-on hidden from view though, and that was the text reporting that Windbg couldn’t find symbols for at least some of the stack’s frames, so was forced to make guesses about the stack’s layout and was showing an address that didn’t lie within any DLL:
This happens when a DLL uses frame pointer omitted (FPO) calling conventions, which in the absence of symbolic information for the DLL prevents the debugger from finding stack frames just by following the frame-pointer chain. The return addresses for the functions the thread invoked must be on the stack (unless they were overwritten by the bug that caused the crash), but Windbg’s heuristics couldn’t locate them.
There’s a Windbg command that you can use in these cases to hunt for the missing frame function addresses, the Display Words and Symbols command. If you’re debugging a 32-bit process, use the dds version of the command and if it’s a 64-bit process use dqs. You can also use dps (Display Pointer Symbols), which will interpret the function addresses as the appropriate size for a 32-bit or 64-bit process. The address to give to the command as the starting point should be the address of the stack frame immediately above the one where Windbg got lost. To see the address, click on the Addrs button in the call stack dialog:
The address on the frame in question was 2cbc5c8:
I passed it to dds as the argument and pressed enter:
The first page of results didn’t list any functions besides the expected one, KiUserException. I hit the enter key again without typing another command, because for address-based commands like dds, that tells Windbg to repeat the last the last command at the address where it left off. The second page of results yielded something more interesting, the name of a DLL I wasn’t familiar with:
An easy way to see version information for a module without leaving Windbg is to use the lm (List Modules) command. The output of that command told me that Yt.dll (the name of the DLL is the text to the left of the “!”) was part of the Yahoo Toolbar:
This came as a surprise because the system on which the crash occurred was my home gaming system, a computer that I’d only had for a few weeks. The only software I generally install on my gaming systems are Microsoft Office and games. I don’t use browser toolbars and if I did, would obviously use the one from Bing, not Yahoo’s. Further, the date on the DLL showed that it was almost two years old. I’m pretty diligent about looking for opt-out checkboxes on software installers, so the likely explanation was that the toolbar had come onto my system piggybacking on the installation of one of the several video-card stress testing and temperature profiling tools I used while overclocking the system. I find the practice of forcing users to opt-out annoying and not giving them a choice even more so, so was pretty annoyed at this point. A quick trip to the Control Panel and a few minutes later and my system was free from the undesired and out-of-date toolbar.
Using a couple of handy troubleshooting techniques, within less than five minutes I had identified the probable cause of the crash I experienced, made my system more reliable, and probably even improved its performance. Case closed.
To all the people who say "why doesn't Microsoft just...": first of all, correctly diagnosing these problems without prior knowledge is very hard to automate. The general problem of figuring out who *exactly* is to blame for a crash is practically unsolvable -- modules can cause problems that won't show up for a long time and then in somebody *else's* code. It takes a good deal of sleuthing to identify the real culprit in such cases -- Mark's example here was fairly trivial, the only hurdle being the FPO. Misblaming someone on an automated basis would be a costly mistake.
The best you could hope for is an application compatibility database that would say "this version of the Yahoo Toolbar is known to cause instability in IE8, warn the user about that". There already is such a compatibility database, actually (you may have seen it in action in the early days of Vista and Win7), but obviously keeping it up to date is lots of work. I have no idea what Microsoft's policies on it are; if I were Microsoft I'd have a team assigned to dissecting crash reports sent by WER not just for problems in my code but especially for problems in third-party code -- they probably have such a team. Even then you'd want to get permission from Yahoo to warn about incompatibilities with Yahoo's stuff -- otherwise they're opening themselves up to defamation and monopoly lawsuits.
In other words -- this stuff costs a lot of money, money Microsoft is probably already investing if they're smart, but a system like that will never be perfect.
Good point. I've updated the text to note that 'dps' will do the right thing.
Nice Job! - Reading your articles makes so much fun!
I have to agree with the others, even advanced users would struggle to figure this one out. Surely a lot of what Mark did could be automated: (1) find which thread caused the exception (2) get its callstack (3) resolve symbols, as much as is possible, automatically via Microsoft's symbol server (4) If an address doesn't lie within any DLL, find the missing frame functions by doing what dds/dqs does automatically (5) show any non-MS modules as suspicious, and as much of a callstack as possible.
So why can't an advanced dialog in WER do this?
@Jeroen, that's simply not true. Everything Mark described in this post could have been automated. Don't let perfect be the enemy of good.
Mark, you take our attention to one of major tradeoff on current complex OS. More, maybe overlapped, functions loosly controllable by average users versus more rigid but efficient approach. I guise we need to Add profile capabilities to OS (instead make Windows more closed) that CUT off all unneeded
Mac os reports always shows you the thread where the app crash along with all the other threads from the same process some times it can even tell you the line and file in the source where it crashed.
We all know that nothing is imposible, is just a matter of comercial priorities I guess.
"...I find the practice of forcing users to opt-out annoying and not giving them a choice even more so."
Please have a word or two with your colleagues on the Windows Live team.
Unless the user opts-out, the Windows Live Essentials "all-in-one installer" will install WL Messenger, WLMail, WLToolbar, WLWriter, WLPhoto Gallery, WLMovie Maker, Silverlight, (and if Outlook is installed) Outlook Connector, and Office Live Add-in by default.
MS MVP-IE, Mail, Security, Windows Client - since 2002
@Tilly: Windows is only targeted because it's a large well known corporation and platform. I wouldn't want to be a bank robber and rob a convenient store with only $50 in the register when i can rob the bank right next door and not be caught either way (in the best case scenario!) that same thought goes for hackers and programmers. We don't go about doing our job as them because we want to annoy users. We do it because we enjoy it. We do NOT however always intend the software to be "buggy" and users not reporting it in detail DOES NOT HELP US to fix the problem for the new update on it. Hackers and malicious programmers write their code to do what they need to and get what information they need from the target. They don't worry if it doesnt work on a few computers due to some minor glitch they didn't see about. They most likely only tested their application on their own virtual network or something under one, maybe two, operating systems and was done with it and started attacking people. Some cases not even the test takes place.
@Jon: the problem here is what Microsoft should present to the end user. "This analysis indicates that the Yahoo Toolbar may be the problem, but this analysis is not perfect so in case we're wrong, please don't blame Yahoo"? Remember, we're talking about an application that's supposed to automate this for end users with no deeper understanding of the issues.
In particular, Mark's conclusion that the Yahoo Toolbar was probably to blame was exactly that -- an educated guess that the Yahoo Toolbar was probably to blame. Even ignoring the possibility of stack corruption, another plugin could have corrupted IE's state in such a way to make YT crash IE -- without the possibility of debugging YT effectively, this would have been very hard to show except through rigorous analysis. It would be a mistake to think scenarios like this are rare.
Don't get me wrong -- automated debugging in general is awesome, and it should be used more. But you cannot very well say "we mustn't let perfect be the enemy of good" to the company lawyers when your new automated crash analysis tool is unjustly fingering the wrong third-party product(s) as unstable to your end users. "Show any non-MS modules as suspicious", indeed.
This sort of thing *can* be built, it's just a question of cost. My point is just that it's more costly than you seem to think. For people who want to script stuff like this through WinDbg to play with yourself, you know how to go about it, and lots of Microsoft people (including Mark) are making this information available. That's miles from something you can give your millions of end users, though.
That said, if they do want to make something out of this, people like Mark would be prime candidates for recruitment into the team for Project Fingerpointing. :-)
This is an amazing article!! As always, many thanks Mark!!
@ Brian -> For sure you will find interesting to read one of the gems of MSPress: Windows Internals 5th edition, Mark Russinovich & David Solomon with Alex Ionescu
Robear: This is a completely different case. Mark is referring to an installer for App X that installs, along the way, unrelated App Y, usually for commercial gain. The Windows Live Installer is an installer for the Windows Live tools - it's very upfront about what it's doing - it's doing what you expect, installing Windows Live tools.
Would I prefer it if all checkboxes were unchecked by default? Probably. But there's a big difference between an unsatisfactory default configuration, like your Windows Live example or Microsoft Office, which automatically opts into installing Word and Excel and many other tools, to an overclocking app that sneaks in a completely unrelated tool.
Post by post your'e cases are being more interesting.
@Jeroen Mostert , " if I were Microsoft I'd have a team assigned to dissecting crash reports sent by WER not just for problems in my code but especially for problems in third-party code -- they probably have such a team"
You're right, they do have such a team. It's possible that this issue hasn't occurred before, or if it had, it was still under investigation (as you can imagine, there's a lot more bad code than hours to investigate them).
And @Jon, a lot of this is likely automated, but not presented to the user. For example, the Watson that is sent to MS obviously has the callstack for the relevant process. No one has to figure out which process to look at. The FPO issue is trickier. Mark knew he was looking for an addin, but an automated tool wouldn't necessarily know to look for an addin. And even if it used a heurstic to favor non MS DLL locations, there could be multiple of them on the stack.
I agree that dumping callstacks for the faulting process is useful, and even running things like dds for the user would be a plus. But I'd stop short of making a diagnosis unless the callstack matched a known issue.