Is this thing on?

Scott Schnoll's weblog

Storage, High Availability and Site Resilience in Exchange Server 2013, Part 1

Storage, High Availability and Site Resilience in Exchange Server 2013, Part 1

  • Comments 6
  • Likes

Microsoft Exchange Server 2013 continues to innovate in the areas of storage, high availability, and site resilience.  In this three-part blog series, I’ll describe the improvements in Exchange 2013 related to these three areas. Part 1 focuses on the storage improvements that we’ve made.  Part 2 will focus on high availability.  And part 3 will focus on site resilience.

Exchange 2013 continues to use DAGs and mailbox database copies, along with other features such as Single Item Recovery, Retention Policies, lagged database copies, etc., to provide Exchange Native Data Protection. The high availability platform, the Exchange Information Store and the Extensible Storage Engine (ESE), have all been enhanced to provide greater availability, easier management, and to reduce costs.

With respect to storage, these enhancements include:

  • Reduction in IOPS over Exchange 2010 – enables you to leverage larger disks in terms of capacity and IOPS as efficiently as possible.
  • Multiple databases per volume – enables you to host multiple databases (mixtures of active and passive copies) on the same volume, thereby leveraging larger disks in terms of capacity and IOPS as efficiently as possible.
  • AutoReseed – enables you to quickly restore database redundancy after disk failure. If a disk fails, the database copy stored on that disk are copied from the active database copy to a spare disk on the same server. If multiple database copies were stored on the failed disk, they can all be automatically re-seeded on a spare disk. This enables faster reseeds, as the active databases are likely to be on multiple servers and the data is copied in parallel.
  • Automatic recovery from storage failures – allows the system to recover from failures that affect resiliency or redundancy. Exchange 2013 includes recovery behaviors for long IO times, excessive memory consumption by the Microsoft Exchange Replication service (MSExchangeRepl.exe), and severe cases where the system is in such a bad state, threads cannot be scheduled.
  • Lagged copy enhancements – Lagged copies can now care for themselves to a certain extent using automatic log play down. In addition, lagged copies can leverage Safety Net, making recovery or activation much easier.

Of course, we also introduced the Managed Store.  The Managed Store is the name of the newly rewritten Information Store processes (Microsoft.Exchange.Store.Service.exe and Microsoft.Exchange.Store.Worker.exe) in Exchange 2013. The new Managed Store is written in C# and tightly integrated with MSExchangeRepl.exe to provide higher availability through improved resiliency. In addition, the Managed Store has been architected to enable more granular management of resource consumption and faster root cause analysis through improved diagnostics. The Managed Store works with the Microsoft Exchange Replication service to manage mailbox databases, which continues to leverage Extensible Storage Engine (ESE) as the database engine. Exchange 2013 includes significant changes to the mailbox database schema that provide many optimizations over previous versions of Exchange. In addition to these changes, the Microsoft Exchange Replication service is responsible for all service availability related to back-end servers. The architectural changes enable faster database failover and better physical disk failure handling.  Managed Store will be the subject of a future blog post.

 

Improved Resilience for JBOD Environments

Let’s dig a little deeper into each of these features by first providing some context.  While most of these features were designed primarily for configurations that use just a bunch of disks (JBOD), all of these features can work with any supported Exchange storage. JBOD environments bring unique challenges to Exchange:

  • The trend of capacity increase continues, with 8TB drives expected to be available soon. When using 8TB drives in conjunction with Exchange’s database size best practices guidelines (2 TB), you would waste 5+ TB of disk space. One solution would be to simply grow the databases larger, but that inhibits manageability because it introduces very long reseed times, including, in some cases, operationally unmanageable reseed times. Not to mention the reliability of copying that amount of data over the network.
  • In the Exchange 2010 model, the disk storing a passive copy is under-utilized in terms of IOPS.  And in the case of a lagged passive copy, not only is the disk under-utilized in terms of IOPS, but its also asymmetric in terms of its size, relative to the disks used to store the active and non-lagged passive copies.
  • In Exchange 2010, if you are in a scenario where you are running out of disk space in a JBOD configuration, you have only a few options. You could for example, throw away the catalogs while you move users off.  It does impair the search experience, but it gives you breathing room when you need it. Still, you are limited in terms of available options.

Reduction in IOPS over Exchange 2010

In Exchange 2010, passive database copies have a very low checkpoint depth which is required for fast failover.  In addition, in Exchange 2010, the passive copy performs aggressive pre-reading of data in order to keep up with a 5 MB checkpoint depth. As a result of using a low checkpoint depth, and performing these aggressive pre-read operations, IOPS for a passive database copy was equal to IOPS for an active copy in Exchange 2010.

In Exchange 2013, we’re able to get fast failover without a low checkpoint depth on the passive copy, and in fact, we now have fast failover with a high checkpoint depth on the passive copy (100 MB). Since we now have 100 MB checkpoint depth on the passive, the passive copy has been de-tuned to no longer be so aggressive.  As a result of increasing the checkpoint depth and de-tuning the aggressive pre-reads, IOPS for a passive copy is about 50% of the active copy IOPS in Exchange 2013.

Of course, we had to deal with having a higher checkpoint depth on the passive copy. On failover in Exchange 2010, we flush the cache as we are converting the copy from a passive copy to an active copy, and with a high checkpoint depth we would not have fast failover.

In Exchange 2013, the ESE team re-wrote logging in ESE to persist the cache as the transition from passive to active is made. Because ESE doesn’t need to flush the cache, we get fast failover. In fact, in Exchange 2013, failover times are generally down by 50% over Exchange 2010. Internally, in our Exchange 2013 dogfood environment, we see failovers around 10 seconds. In our production Exchange 2010 environment, they are around 20 seconds.

One other change we made was around background database maintenance.  Based on our bad block experience in Office 365, we decided that BDM did not need to be as aggressive. So, BDM is now throttled back from 5MB per sec/copy to ~1-2MB per sec/copy (basically, where we had a 20ms sleep in the process, we now have a 100ms sleep in the process). At this new rate of activity, BDM doesn’t cause any IO or latency issues.

Multiple Databases Per Volume

This feature is about Exchange optimizing for large disks. These optimizations result in a much more efficient use of large disks in terms of capacity, IOPS and re-seed times, and they are meant to address the challenges I describe above that are associated with running in a JBOD storage configuration:

  • Database sizes must be manageable
  • Reseed operations must be fast and reliable
  • Storage capacity is increasing, but IOPS are not
  • Disks hosting passive database copies are underutilized in terms of IOPS
  • Lagged copies have asymmetric storage requirements
  • You have limited agility to recover from low disk space conditions

Continuing a long-standing practice for Exchange, we have optimized Exchange 2013 so that it can use very large disks (e.g., 8 TB) in a JBOD configuration more efficiently. In Exchange 2013, with multiple databases per disk, you can have the same sized disks storing multiple database copies, including lagged copies. The goal is to drive the distribution of users across the number of volumes that exist, providing you with a symmetric design where during normal operations each DAG member hosted a combination of active, passive and optional lagged copies on the same volumes.

An example of a configuration that uses multiple databases per volume is illustrated below:

Multiple databases per volume

The above configuration provides a nice, symmetrical design. All four servers have the same four databases all hosted on a single disk per server. The key here is that the number of copies of each database that you have should be equal to the number of database copies per disk. In the above example, there are four copies of each database: one active copy, two passive copies, and one lagged copy. Because there are four copies of each database, the proper configuration is one that has 4 copies per volume. In addition, Activation Preference (AP) is configured so that it is balanced across the DAG and across each server. For example, the active copy will have an AP value of 1, the first passive copy will have an AP value of 2, the second passive copy will have an AP value of 3, and the lagged copy will have an AP value of 4.

In addition to having a better distribution of users across the existing volumes, another benefit of using multiple databases per disk is that it reduces the amount of time to restore data protection in the event of a failure that necessitates a reseed (e.g., disk failure).

As a database gets bigger, reseeding the database takes longer and longer. For example, a 2TB database could take 23 hours to reseed, whereas an 8TB database could take as long as 93 hours (almost 4 days). Both seeds would occur at around 20MB/sec. This generally means that a very large database cannot be seeded within an operationally reasonable amount of time.

In the case of a single-database-copy-per-disk scenario, the seeding operation is effectively source-bound, because it is always seeding the disk from a single source. By dividing the volume into multiple database copies, and by having the active copy of the passive databases on a given volume stored on separate DAG members, the system is no longer source-bound in the context of reseeding the disk. When a failed disk is replaced, it can be reseeded from multiple sources. This allows the system to reseed and restore data protection for these databases in a much shorter amount of time.

Another benefit is a 25% increase in aggregate disk utilization, and a balanced configuration of four databases with four copies across four disks and four servers. In the event you lose partial service (for example, you lose two DAG members and you have a datacenter *over), you are still at only 50% IOPS utilization.

There are specific architecture requirements when using multiple databases per volume:

  • You must use a single logical disk partition per physical disk. Do not create multiple partitions on the disk.
  • Each database copy and its companion files (transaction logs, content index, etc.) must be hosted in a unique directory on the single partition.

We also recommend adhering to the following best practices:

  • Database copies should have the same neighbors (e.g., they should all share the same disk on each server).
  • Balance activation preference across the DAG, such that each database copy on a given disk has a unique Activation Preference value.

Automatic Reseed

Automatic reseed, or AutoReseed for short, is a feature that is the replacement for what is normally administrator-driven action in response to a disk failure, database corruption event, or other issue that necessitates a reseed of a database copy. AutoReseed is designed to automatically restore database redundancy after a disk failure by using spare disks that have been provisioned on the system.

In an AutoReseed configuration, a standardized storage presentation structure is used, and the administrator picks the starting point. AutoReseed is about restoring redundancy as soon as possible after a drive fails. This involves pre-mapping a set of volumes (including spare volumes) and databases using mount points. In the event of a disk failure where the disk is no longer available to the operating system, or is no longer writable, a spare volume is allocated by the system, and the affected database copies are re-seeded automatically. AutoReseed uses the following process:

  1. The Microsoft Exchange Replication service periodically scans for copies that have a status of FailedAndSuspended.
  2. When it finds a copy with that status, it performs some pre-requisite checks, such as checking to see if this is a single copy situation, checking to see if spares are available, and making sure there is nothing preventing the system from performing an automatic reseed.
  3. If the pre-requisite checks pass successfully, the Replication service allocates and remaps a spare.
  4. Next, the seeding operation is performed.
  5. Once the seed has been completed, the Replication service verifies that the newly seeded copy is healthy.

At this point, if the failure was a disk failure, it would require manual intervention by an operator or administrator to remove and replace the failed disk, format it, initialize it and re-configure it as a spare.

AutoReseed is configured using three properties of the DAG. Two of the properties refer to the two mount points that are in use. Exchange 2013 Preview leverages the fact that Windows Server allows multiple mount points per volume. The AutoDagVolumesRootFolderPath property refers to the mount point that contains all of the volumes that are available. This includes volumes that host databases and spare volumes. The AutoDagDatabasesRootFolderPath property refers to the mount point that contains the databases. A third DAG property, AutoDagDatabaseCopiesPerVolume, is used to configure the number of database copies per volume.

An example AutoReseed configuration is illustrated below:

AutoReseed

In this example, there are three volumes, two of which will contain databases (VOL1 and VOL2), and one of which is a blank, formatted spare (VOL3).

To configure AutoReseed:

  1. All three volumes are mounted under a single mount point. In this example, the administrator has configured a mount point of C:\ExchVols is used. This represents the directory used to get storage for Exchange databases.
  2. Then, the root directory of the mailbox databases is mounted as another mount point. In this example, the administrator has configured a mount point of C:\ExchDbs is used. Next, create a directory structure so that each database has two directories: one directory for the database file and one for the log files.
  3. Then create databases. The above example illustrates a simple design using a single database per volume. Thus, on VOL1, there are two directories: one for DB1’s database file, and one for its logs. On VOL2, there are two directories: one for DB2’s database file, and one for its logs.

In this configuration, if DB1 or DB2 were to experience a failure, VOL3 will be automatically re-purposed by the system, and a copy of the failed database will be automatically reseeded to VOL3.

Automatic Recovery from Storage Failures

This feature continues the innovation introduced in Exchange 2010 to allow the system to recover from failures that affect resiliency or redundancy. In addition to the Exchange 2010 bugcheck behaviors, Exchange 2013 Preview includes additional recovery behaviors for long IO times, excessive memory consumption by the Microsoft Exchange Replication service (MSExchangeRepl.exe), and severe cases where the system is in such a bad state, threads cannot be scheduled.

Even in JBOD environments, storage array controllers can have issues, such as crashing or hanging. Exchange 2010 included hung IO detection and recovery features that provided enhanced resilience. Exchange 2013 enhances server and storage resilience by including new behaviors for other serious conditions. These conditions and behaviors are described in the following table.

Name Check Action Threshold
System Bad State

No threads, including non-managed threads, can be scheduled

Restart the server

302 seconds
Long IO times

IO operation latency measurements

Restart the server

41 seconds

Replication service memory use

Measure the working set of MSExchangeRepl.exe

  1. Log event 4395 in the crimson channel with a service termination request
  2. Initiate termination of MSExchangeRepl.exe
  3. If service termination fails, restart the server
4 GB

Lagged Copy Enhancements

Lagged copy enhancements include integration with Safety Net and automatic play-down of log files in certain scenarios. Safety Net is a feature of transport that replaces the Exchange 2010 feature known as transport dumpster. Safety Net is similar to transport dumpster, in that it is a delivery queue that's associated with the Transport service on a Mailbox server. This queue stores copies of messages that were successfully delivered to the active mailbox database on the Mailbox server. Each active mailbox database on the Mailbox server has its own queue that stores copies of the delivered messages. You can specify how long Safety Net stores copies of the successfully delivered messages before they expire and are automatically deleted.

Safety Net takes also over some responsibility from shadow redundancy in DAG environments. In DAG environments, shadow redundancy doesn't need to keep another copy of the delivered message in a shadow queue while it waits for the delivered message to replicate to the passive copies of mailbox database on the other Mailbox servers in the DAG. The copy of the delivered message is already stored in Safety Net, so shadow redundancy can re-deliver the message from Safety Net if necessary.

With the introduction of Safety Net, activating a lagged database copy becomes significantly easier. Say for example, you have a lagged copy that has a 2 day replay lag. In that case you would also configure Safety Net for a period of 2 days. If you encounter a situation in which you need to use your lagged copy, you can suspend replication to it, and copy it twice (to preserve the lagged nature of the database and to create an extra copy in case you need it). Then, take a copy and throw away all of the log files, except for those in the required range. Mount the copy, and that will trigger an automatic request to Safety Net to redeliver the last two days of mail. No more hunting for where the point of corruption was introduced. You would get back the last two days’ worth of mail, minus of course, the data that is ordinarily lost on a lossy failover.

Lagged copies can now care for themselves by invoking automatic log replay to play down the log files in certain scenarios:

  • When a low disk space threshold is reached
  • When the lagged copy has physical corruption and needs to be page patched
  • When there are fewer than 3 available healthy copies (active or passive) for more than 24 hours

Lagged copy playdown behavior is disabled by default, and can be enabled by running the following command:

Set-DatabaseAvailabilityGroup <DAGName> –ReplayLagManagerEnabled $True

This enables playdown when there are fewer than X copies.  The default value is 3, but number is configurable by modifying the following registry value:

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagManagerNumAvailableCopies

To enable playdown for low disk space thresholds, you must configure another registry entry:

HKLM\Software\Microsoft\ExchangeServer\v15\Replay\Parameters\ReplayLagPlayDownPercentDiskFreeSpace

Comments
  • Great post, Scott. Looking forward to part 2.

  • Thanks Scott for this post and for your explanations. Waiting for part 2.

  • Great info scott

    i always enjoy your presentations which are always interesting and funny:)

    keeps me focues:)

    nice one from autralia about this

    keep it going

  • Great post Scott and I really enjoyed your session on this topic at MEC as well.

    One thing I was curious about was where the Exchange Team got the 20MB/sec number from for reseeding. I'm not questioning it but am genuinely curious what factors and areas of analysis were considered to come to that number. Thanks.

  • I have a question about the passive copy IO in Exchange 2010. Early in the blog you state that the disk IOPs for the passive copy were underutilized but later state that the IOPS for a passive copy was equal to the IOPs for a active copy in exchange 2010. How are the disks underutilized in terms of IOPS between the active and passive copies if the IOPS is the same? Am I misunderstanding something?

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment