If there's one thing that's true of all busy Exchange servers, it's that they generate massive amounts of disk I/O. There's a joke around here that Exchange is the world's biggest hard disk diagnostics program.

Typically, your disks will be the first component of your Exchange server that starts groaning as you add load. And, frequently, you'll find that if you get your disks out of the redline area of the dial, that other performance issues suddenly heal themselves too. Why is this so?

Exchange databases use transactional logging. As new data comes in, the most urgent priority is getting the new stuff secured on disk in a log file. If you are experiencing "log stalls," then everything else that needs to happen with that data must wait. This can lead to a cascade of other bottlenecks.

There is a very good KB article on log stalls. The article tells you how to use System Monitor to tell if you have a log stall problem, and how to tune Exchange if necessary:

XADM: Log Stalls/sec Are Regularly Greater than 0 (Zero)

http://support.microsoft.com/?id=188676

But all the fine tuning in the world won't help if you are plain just demanding too much from your disk system. How can you tell what kind of load your disk system can really sustain?

For years, the Exchange database test team here at Microsoft has used a homegrown tool called Jetstress to simulate heavy disk I/O loads. It can be downloaded from:

http://www.microsoft.com/downloads/details.aspx?FamilyId=94B9810B-670E-433A-B5EF-B47054595E9C&displaylang=en

You may have used LoadSim in the past. JetStress has some similarities, but is not a replacement for LoadSim. JetStress is a more sharply focused tool than LoadSim. It is intended only to simulate Exchange disk I/O activity. LoadSim lets you simulate network and client activity, and thus indirectly works out the disk system. JetStress goes right at the disk system, no indirection about it.

You don't even have to have Exchange installed to use JetStress. You simply copy a few files to a server and start pounding it to its limits. JetStress generates a test database from scratch, of whatever size you want. Typically, to get valid results, you only need to generate a database that is 5% the size of your intended real database. You can then tell JetStress to make the same changes to the database that happen during normal operation. It adds, deletes, replaces and reads records from the database. By using System Monitor, you can see how much real Exchange load your disks can handle. You can change your disk configurations and re-run the same tests to see what kind of difference it makes.

There are two basic kinds of testing you can do with JetStress:

  • Performance Testing
  • Disk Subsystem Stability Testing

We usually recommend that you let JetStress run at least 2 hours when you're testing to see what kind of sustained throughput your disk system can handle. If you're doing stability testing, the recommendation is 24 hours. Now, what exactly do I mean by stability testing?

Exchange can subject your server to very complex random I/O. As you push computer systems closer and closer to their tested limits, and as you run huge amounts of data through the system, you're more likely to encounter glitches and even bugs in the ability of the system to reliably process and preserve data. JetStress will let you load your system up till it's running as fast as it can, and will keep it under stress to see if it remains reliable in both storing and retrieving data.

The way you can tell if the system is performing reliably is to look for error -1018 from your Exchange database. This error occurs whenever a page is read from the database, and the checksum on the page is wrong. Every page in an Exchange database is checksummed as it is written, and the checksum is verified every time the page is looked at again. If even a single bit is wrong on the page, Exchange declares the page bad and reports a -1018 error. You can learn more about Exchange page checksums and how we detect corruption in the database in this KB article:
XADM: Understanding and Analyzing -1018, -1019 and -1022 Exchange Database Errors
http://support.microsoft.com/?id=314917

If your database has -1018 pages after the stress test, then the disk system cannot be considered reliable at the load level under which it was tested.

When you download the JetStress utility, you get excellent documentation along with it. The documentation will walk you through every phase, from setting up tests and monitoring their progress, to interpreting their results. It even tells you which System Monitor (Performance Monitor) counters to look at, and what values are OK. It also tells you how to validate the integrity of the database after a stability test.

Mike Lee