Thoughts from the EPS Windows Server Performance Team
Almost everyone that has ever used Windows has either heard of or experienced a bugcheck - the infamous "Blue Screen of Death." A system may bugcheck for different reasons, but the bottom line is that the operating system has experienced a catastrophic fault that prevents the system from continuing to run. We're going to cover some basic information about why a server may crash, explain how to configure and capture crash dumps and review some basic debugging of a crash dump.
Before we get started however, remember that there is a difference between a bugcheck and an application crash. A bugcheck is a kernel-mode crash, whereas an application crash is a user-mode event. We covered the differences between kernel- and user-mode memory in our Memory Management 101 post several months ago. So what are some common reasons why you may experience a bugcheck?
OK - so if Windows knows that something is wrong why does it crash? Wouldn't it be better to ignore the failure and carry on working? In some cases, there is a possibility that the problem is isolated and that the failing component will recover on its own. However it is more likely that there is a deeper issue, such as memory corruption or a hardware failure. If the system simply ignored these issues and continued to run, then the risk of further errors and data corruption would increase - a risk that is too high to take.
An analogy of this would be the "Check Engine Light" in your car suddenly coming on. When this light comes on there, you don't immediately know how serious the problem is. It could be something as simple as the fact that your gas cap has not been tightened properly. In this instance, pulling over and tightening the gas cap would resolve the issue. However, there could be a far more serious issue that you won't be able to resolve until you have the diagnostic trouble codes in your car's on-board computer memory reviewed. In either case, it would be inadvisable to ignore the "Check Engine Light."
So what actually happens on a system when it bugchecks? There is a function that is documented in the Windows DDK called KeBugcheckEx. This function brings down the system in a controlled manner. After this function masks out all interrupts on all processors on the system, it switches the display into VGA-mode, paints the blue background and displays the STOP code, along with four parameters that are interpreted based on the nature of the STOP code. There may also be text displayed that provides standard suggestions for the user. Windows XP Service Pack 1 and higher, as well as Windows Server 2003 introduced a new function - KeRegisterBugCheckReasonCallback. Drivers use this function to register routines that execute during system bugcheck. These additional routines may include drivers appending their data to the crash dump or writing crash dump data to alternate devices. Although there are over one hundred unique STOP codes, there are a few common ones which represent the majority of bugchecks on Windows systems. The Help file included with the Windows Debugging Tools contains information on the different STOP codes. The help file can assist you in interpreting the errors, however it may be necessary to review the crash dump file that is created when the system bugchecks.
Bugchecks most often occur after a change has been made to the system - for example the installation of new software or hardware. If you have just added a driver, rebooted the machine and the system bugchecks early during the system initialization process, then using the Last Known Good Configuration option can sometimes bring the system back online so that troubleshooting can be performed, and the offending driver removed (if necessary). This is because the installation of a new driver creates the associated registry entries that determine the driver startup type and file path. Until the system reboots successfully after this installation, the entry is not committed to the ControlSet number referenced in the LastKnownGood value in the HKLM\System\Select key. However, this same troubleshooting does not work if you update an existing driver because the associated registry entries that call for that driver to be loaded will already be present on the system as a result of the last successful boot. Since the actual files have changed, the Last Known Good Configuration option will not work.
And that brings us to the end of our quick look at Understanding Bugchecks. In our next post on this topic, we will cover the properties of Crash Dump Files. Until next time ...
- CC Hameed
PingBack from http://geeklectures.info/2007/12/18/understanding-bugchecks/
That was useful! Meanwhile, I'm trying to shoot a bugcheck crash that happens whenever I use a backup program--like ntbackup, but others too--that uses the volume shadow copy service. They bomb with a stop code indicating a bad_pool_header. From what I've read this is most likely caused by a device driver that has corrupted storage pool management data. I have a kernel mode dump and have poked at it with windbg but do not know how to trace the pool structure to find out what driver is at fault. Any help much appreciated : )
At the end of 2007 we talked about Bugchecks and why they happen . Today we're going to talk about
Having all of this information gathered together in one place with a friendly overview was really helpful. There were a couple odd things about memory dumps that were never quite clear before and this cleared it right up. Nice work!
Jason - I recommend opening up a Support Incident with Microsoft Product Support. Two reasons - first, they can help you review the dump file. Second, I suspect that you might have some older versions of volsnap or other VSS related files - they can help you get those files up to date. Take a look at KB 940349 - there's a VSS Update Rollup package available.
Ed - Glad we could help!