In case you aren’t familiar with the word, a bugcheck is one of several technical terms used to describe the situation in which an operating system halts because it has encountered an error that prevents it from safely continuing to operate.  Other technical terms we used to describe this condition include:

  • Kernel panic
  • System halt
  • Fatal system error
  • Stop error

And some non-technical terms to describe this condition include:

  • System crash
  • Blue screen of death (BSOD)

When this condition occurs, the system creates a system dump (also known as memory dump or crash dump), which provides information about what the system was doing at the time, which can be very useful in debugging the problem and determining why the bugcheck occurred in the first place.  Depending on how the administrator has configured the operating system, after the system dump is written to disk (if possible), the operating system may restart itself as a form of self-corrective action.

Exploiting Bugcheck Behavior

I sometimes hear administrators describe a bugcheck as a bad thing.  The bugcheck behavior itself is a good thing.  It’s the problem that caused the bugcheck to occur that is the bad thing.  Simply put, bugchecking is there because it enables the system to try to recover from an otherwise unrecoverable error.  Understanding bugchecks for what they are lends itself to understanding how an application might exploit this behavior to its own advantage. For example, in Windows Server 2008 R2, new logic was added to Windows Failover Clustering (WFC) that enabled WFC to self-recover under specific conditions.  When certain errors occur in a cluster running Windows 2008 R2 that are catastrophic and unrecoverable, WFC will intentionally bugcheck the server as a last resort method of recovery.

Exchange 2010 SP1 Bugcheck Behavior

In Exchange 2010 SP1, we added logic to the system that leverages bugcheck behavior when certain conditions occur.  Specifically, when hung IO occurs.  In SP1, Extensible Storage Engine (ESE) has been updated to detect hung IO and to take corrective action to automatically recover the server.  ESE keeps an IO watchdog thread that will detect when an IO has been outstanding. If the IO is outstanding for more than one minute, ESE will log an event. If an Exchange database has an IO outstanding for greater than 4 minutes, it will log a specific failure event, if it is possible to do so. ESE event 507, 508, 509 or 510 may or may not be logged, depending on the nature of the hung IO.  Obviously, if the nature of  the problem is such that the OS volume is affected or the ability to write to the event log is affected, the events will not be logged. If the events are logged, the Microsoft Exchange Replication service (MSExchangeRepl.exe) will detect those failure events and intentionally cause a bugcheck of Windows by terminating the wininit.exe process.

In many of the hung IO incidents we have seen, the entire stack has been affected by the hang, making it impossible to write failure events to the crimson channel or any other area of the event log.  So ESE also monitors the crimson channel by verifying that the event log can be written to. If writing to the event log fails for a long period of time, MSExchangeRepl will intentionally cause a bugcheck of Windows by terminating wininit.exe. When this condition occurs, obviously the system is unable to write any ESE events to the event log.

When the bugcheck does occur, it will always be as follows:

CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been terminated.

NOTE: the presence of this bugcheck does not necessarily mean Exchange was the cause.  Any termination of wininit.exe, including one performed by an administrator using Task Manager or some other task management tool, will cause this bugcheck error code.

Conclusion

The hung IO detection feature in Exchange 2010 is designed to make recovery from hung IO or a hung controller fast, rather than re-trying or waiting until the storage stack raises an error that causes failover.  It’s a great addition to the set of high availability features built into Exchange 2010.