Microsoft Exchange Server 2013 continues to innovate in the areas of storage, high availability, and site resilience. In this three-part blog series, I’ll describe the improvements in Exchange 2013 related to these three areas. Part 1 focuses on the storage improvements that we’ve made. Part 2 will focus on high availability. And part 3 will focus on site resilience.
Exchange 2013 continues to use DAGs and mailbox database copies, along with other features such as Single Item Recovery, Retention Policies, lagged database copies, etc., to provide Exchange Native Data Protection. The high availability platform, the Exchange Information Store and the Extensible Storage Engine (ESE), have all been enhanced to provide greater availability, easier management, and to reduce costs.
With respect to storage, these enhancements include:
Of course, we also introduced the Managed Store. The Managed Store is the name of the newly rewritten Information Store processes (Microsoft.Exchange.Store.Service.exe and Microsoft.Exchange.Store.Worker.exe) in Exchange 2013. The new Managed Store is written in C# and tightly integrated with MSExchangeRepl.exe to provide higher availability through improved resiliency. In addition, the Managed Store has been architected to enable more granular management of resource consumption and faster root cause analysis through improved diagnostics. The Managed Store works with the Microsoft Exchange Replication service to manage mailbox databases, which continues to leverage Extensible Storage Engine (ESE) as the database engine. Exchange 2013 includes significant changes to the mailbox database schema that provide many optimizations over previous versions of Exchange. In addition to these changes, the Microsoft Exchange Replication service is responsible for all service availability related to back-end servers. The architectural changes enable faster database failover and better physical disk failure handling. Managed Store will be the subject of a future blog post.
Let’s dig a little deeper into each of these features by first providing some context. While most of these features were designed primarily for configurations that use just a bunch of disks (JBOD), all of these features can work with any supported Exchange storage. JBOD environments bring unique challenges to Exchange:
In Exchange 2010, passive database copies have a very low checkpoint depth which is required for fast failover. In addition, in Exchange 2010, the passive copy performs aggressive pre-reading of data in order to keep up with a 5 MB checkpoint depth. As a result of using a low checkpoint depth, and performing these aggressive pre-read operations, IOPS for a passive database copy was equal to IOPS for an active copy in Exchange 2010.
In Exchange 2013, we’re able to get fast failover without a low checkpoint depth on the passive copy, and in fact, we now have fast failover with a high checkpoint depth on the passive copy (100 MB). Since we now have 100 MB checkpoint depth on the passive, the passive copy has been de-tuned to no longer be so aggressive. As a result of increasing the checkpoint depth and de-tuning the aggressive pre-reads, IOPS for a passive copy is about 50% of the active copy IOPS in Exchange 2013.
Of course, we had to deal with having a higher checkpoint depth on the passive copy. On failover in Exchange 2010, we flush the cache as we are converting the copy from a passive copy to an active copy, and with a high checkpoint depth we would not have fast failover.
In Exchange 2013, the ESE team re-wrote logging in ESE to persist the cache as the transition from passive to active is made. Because ESE doesn’t need to flush the cache, we get fast failover. In fact, in Exchange 2013, failover times are generally down by 50% over Exchange 2010. Internally, in our Exchange 2013 dogfood environment, we see failovers around 10 seconds. In our production Exchange 2010 environment, they are around 20 seconds.
One other change we made was around background database maintenance. Based on our bad block experience in Office 365, we decided that BDM did not need to be as aggressive. So, BDM is now throttled back from 5MB per sec/copy to ~1-2MB per sec/copy (basically, where we had a 20ms sleep in the process, we now have a 100ms sleep in the process). At this new rate of activity, BDM doesn’t cause any IO or latency issues.
This feature is about Exchange optimizing for large disks. These optimizations result in a much more efficient use of large disks in terms of capacity, IOPS and re-seed times, and they are meant to address the challenges I describe above that are associated with running in a JBOD storage configuration:
Continuing a long-standing practice for Exchange, we have optimized Exchange 2013 so that it can use very large disks (e.g., 8 TB) in a JBOD configuration more efficiently. In Exchange 2013, with multiple databases per disk, you can have the same sized disks storing multiple database copies, including lagged copies. The goal is to drive the distribution of users across the number of volumes that exist, providing you with a symmetric design where during normal operations each DAG member hosted a combination of active, passive and optional lagged copies on the same volumes.
An example of a configuration that uses multiple databases per volume is illustrated below:
The above configuration provides a nice, symmetrical design. All four servers have the same four databases all hosted on a single disk per server. The key here is that the number of copies of each database that you have should be equal to the number of database copies per disk. In the above example, there are four copies of each database: one active copy, two passive copies, and one lagged copy. Because there are four copies of each database, the proper configuration is one that has 4 copies per volume. In addition, Activation Preference (AP) is configured so that it is balanced across the DAG and across each server. For example, the active copy will have an AP value of 1, the first passive copy will have an AP value of 2, the second passive copy will have an AP value of 3, and the lagged copy will have an AP value of 4.
In addition to having a better distribution of users across the existing volumes, another benefit of using multiple databases per disk is that it reduces the amount of time to restore data protection in the event of a failure that necessitates a reseed (e.g., disk failure).
As a database gets bigger, reseeding the database takes longer and longer. For example, a 2TB database could take 23 hours to reseed, whereas an 8TB database could take as long as 93 hours (almost 4 days). Both seeds would occur at around 20MB/sec. This generally means that a very large database cannot be seeded within an operationally reasonable amount of time.
In the case of a single-database-copy-per-disk scenario, the seeding operation is effectively source-bound, because it is always seeding the disk from a single source. By dividing the volume into multiple database copies, and by having the active copy of the passive databases on a given volume stored on separate DAG members, the system is no longer source-bound in the context of reseeding the disk. When a failed disk is replaced, it can be reseeded from multiple sources. This allows the system to reseed and restore data protection for these databases in a much shorter amount of time.
Another benefit is a 25% increase in aggregate disk utilization, and a balanced configuration of four databases with four copies across four disks and four servers. In the event you lose partial service (for example, you lose two DAG members and you have a datacenter *over), you are still at only 50% IOPS utilization.
There are specific architecture requirements when using multiple databases per volume:
We also recommend adhering to the following best practices:
Automatic reseed, or AutoReseed for short, is a feature that is the replacement for what is normally administrator-driven action in response to a disk failure, database corruption event, or other issue that necessitates a reseed of a database copy. AutoReseed is designed to automatically restore database redundancy after a disk failure by using spare disks that have been provisioned on the system.
In an AutoReseed configuration, a standardized storage presentation structure is used, and the administrator picks the starting point. AutoReseed is about restoring redundancy as soon as possible after a drive fails. This involves pre-mapping a set of volumes (including spare volumes) and databases using mount points. In the event of a disk failure where the disk is no longer available to the operating system, or is no longer writable, a spare volume is allocated by the system, and the affected database copies are re-seeded automatically. AutoReseed uses the following process:
At this point, if the failure was a disk failure, it would require manual intervention by an operator or administrator to remove and replace the failed disk, format it, initialize it and re-configure it as a spare.
AutoReseed is configured using three properties of the DAG. Two of the properties refer to the two mount points that are in use. Exchange 2013 Preview leverages the fact that Windows Server allows multiple mount points per volume. The AutoDagVolumesRootFolderPath property refers to the mount point that contains all of the volumes that are available. This includes volumes that host databases and spare volumes. The AutoDagDatabasesRootFolderPath property refers to the mount point that contains the databases. A third DAG property, AutoDagDatabaseCopiesPerVolume, is used to configure the number of database copies per volume.
An example AutoReseed configuration is illustrated below:
In this example, there are three volumes, two of which will contain databases (VOL1 and VOL2), and one of which is a blank, formatted spare (VOL3).
To configure AutoReseed:
In this configuration, if DB1 or DB2 were to experience a failure, VOL3 will be automatically re-purposed by the system, and a copy of the failed database will be automatically reseeded to VOL3.
This feature continues the innovation introduced in Exchange 2010 to allow the system to recover from failures that affect resiliency or redundancy. In addition to the Exchange 2010 bugcheck behaviors, Exchange 2013 Preview includes additional recovery behaviors for long IO times, excessive memory consumption by the Microsoft Exchange Replication service (MSExchangeRepl.exe), and severe cases where the system is in such a bad state, threads cannot be scheduled.
Even in JBOD environments, storage array controllers can have issues, such as crashing or hanging. Exchange 2010 included hung IO detection and recovery features that provided enhanced resilience. Exchange 2013 enhances server and storage resilience by including new behaviors for other serious conditions. These conditions and behaviors are described in the following table.
No threads, including non-managed threads, can be scheduled
Restart the server
IO operation latency measurements
Replication service memory use
Measure the working set of MSExchangeRepl.exe
Lagged copy enhancements include integration with Safety Net and automatic play-down of log files in certain scenarios. Safety Net is a feature of transport that replaces the Exchange 2010 feature known as transport dumpster. Safety Net is similar to transport dumpster, in that it is a delivery queue that's associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully delivered to the active mailbox database on the Mailbox server. Each active mailbox database on the Mailbox server has its own queue that stores copies of the delivered messages. You can specify how long Safety Net stores copies of the successfully delivered messages before they expire and are automatically deleted.
Safety Net takes also over some responsibility from shadow redundancy in DAG environments. In DAG environments, shadow redundancy doesn't need to keep another copy of the delivered message in a shadow queue while it waits for the delivered message to replicate to the passive copies of mailbox database on the other Mailbox servers in the DAG. The copy of the delivered message is already stored in Safety Net, so shadow redundancy can re-deliver the message from Safety Net if necessary.
With the introduction of Safety Net, activating a lagged database copy becomes significantly easier. Say for example, you have a lagged copy that has a 2 day replay lag. In that case you would also configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy, you can suspend replication to it, and copy it twice (to preserve the lagged nature of the database and to create an extra copy in case you need it). Then, take a copy and throw away all of the log files, except for those in the required range. Mount the copy, and that will trigger an automatic request to Safety Net to redeliver the last two days of mail. No more hunting for where the point of corruption was introduced. You would get back the last two days’ worth of mail, minus of course, the data that is ordinarily lost on a lossy failover.
Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios:
Lagged copy playdown behavior is disabled by default, and can be enabled by running the following command:
Set-DatabaseAvailabilityGroup <DAGName> –ReplayLagManagerEnabled $True
Set-DatabaseAvailabilityGroup <DAGName> –ReplayLagManagerEnabled $True
This enables playdown when there are fewer than X copies. The default value is 3, but number is configurable by modifying the following registry value:
To enable playdown for low disk space thresholds, you must configure another registry entry:
Great post, Scott. Looking forward to part 2.
Thanks Scott for this post and for your explanations. Waiting for part 2.
Great info scott
i always enjoy your presentations which are always interesting and funny:)
keeps me focues:)
nice one from autralia about this
keep it going
Great post Scott and I really enjoyed your session on this topic at MEC as well.
One thing I was curious about was where the Exchange Team got the 20MB/sec number from for reseeding. I'm not questioning it but am genuinely curious what factors and areas of analysis were considered to come to that number. Thanks.
I have a question about the passive copy IO in Exchange 2010. Early in the blog you state that the disk IOPs for the passive copy were underutilized but later state that the IOPS for a passive copy was equal to the IOPs for a active copy in exchange 2010. How are the disks underutilized in terms of IOPS between the active and passive copies if the IOPS is the same? Am I misunderstanding something?