Thoughts from the EPS Windows Server Performance Team
When customers call us with issues – in particular application or program failures, one of the first questions that we ask is, “What changed in the environment”. More often than not, the answer is, “Nothing”. In some cases, that may be true, however in a majority of cases, there has been some change of which the system administrator that we are working with is unaware. Tim Newton discussed some aspects of program crashes in his recent post, Access Violation? How dare you …, but let’s go ahead and recap some of them. The most common cause for an application crash is when a program tries to read or write memory that is not allocated for reading or writing by the application – a general protection fault. Some other causes are listed below:
At this point, let’s digress a little bit and introduce a couple of quirky terms that we use to discuss “bugs”.
Heisenbug: The Heisenbug takes its name from the Heisenberg Uncertainty Principle. A Heisenbug is a bug that disappears or alters its characteristics when it is observed. The most common example of a Heisenbug is being unable to reproduce a problem when running a program in debug mode. In debug mode, memory is often cleaned before the program starts. Variables may be forced onto stack locations as opposed to being kept in registers. Another reason that you may see a Heisenbug in debug mode is that debuggers commonly provide watches or other user interfaces that cause code (such as property accessors) to be executed, which in turn may alter the state of the program.
Bohrbug: The Bohrbug takes its name from the Bohr Atomic Model. A Bohrbug is a bug that manifests reliably under a well-defined (but possibly unknown) set of conditions. Thus, in contrast with Heisenbugs, a Bohrbug does not disappear or alter its characteristics when it is researched. These include the easiest bugs to fix (where the nature of the problem is obvious), but also bugs that are hard to find and fix and remain in the software during the operational phase.
Most of the application issues that we deal with are Bohrbugs, although we often encounter Heisenbugs when dealing with applications that exhibit Heap Corruption. In some cases, enabling Pageheap on an application causes the problem to no longer occur. OK, getting back to our original discussion, let’s take a look at a couple of common scenarios:
Scenario One: The Spooler Service is crashing on a print cluster that has been online “since forever” (yes, that’s actually how some administrators may describe their problem to us!) until today and no changes have been made. From the administrator’s perspective nothing has changed in the environment. By this, the administrator usually means that the drivers are still the same, and there have been no recent updates to the OS. However, there are some variables to consider:
As you can see, from the Print Server administrator’s perspective, nothing in fact has changed. However, subtle changes in related system or external conditions are causing a problem. With that, let’s take a look at our second scenario …
Scenario Two: The server is experiencing a hang. It has been running fine since the day it was brought online, and all of a sudden the server is experiencing issues. The last server maintenance was performed a couple of months ago, but beginning yesterday morning, the server keeps locking up. So what’s going on?
In many enterprises, IT departments are somewhat autonomous. A single server may have components that are managed by several different teams. For example, Antivirus and Anti-Spyware software are managed by the Security team, the Storage team is responsible for the SAN environment, Host Bus Adapters (HBA’s) and related firmware. Meanwhile, the Windows team is responsible for the Server Operating System, including the overall system configuration and performance. With this type of division and ownership, it can become problematic for all the teams to stay in sync. This is not an indictment of any of the teams, it is an unavoidable by-product of decentralization. So what might be going on in this scenario?
Again, based on the scenario above, there are some fairly innocuous changes that, at the time of implementation, did not result in issues. However, over time or under certain conditions, problems do surface – but, “Nothing changed in the environment” …
With that, it’s time to bring this post to a close. Thanks for stopping by! By the way, you can find more information on the quirky terms Heisenbug and Bohrbug as well as other similar terms on the Wikipedia page devoted to Unusual Software Bugs.
- Pushkar Prasad
EDIT (6/23): Added Wikipedia link to article