One of the goals of Exchange 2010 mailbox resiliency is to minimize data loss. In Exchange 2010 SP1 we added continuous replication block mode to help further reduce data loss when a failover occurs. However, on a very busy mailbox database with a high log generation rate, there is a greater chance for data loss if replication to the passive database copies cannot keep up with log generation.

One scenario that can introduce a high log generation rate is mailbox moves. Consider the following two examples:

  • Example 1: As an administrator you decide to move a mailbox from DatabaseA to DatabaseB. The mailbox move completes successfully. However, immediately following the move operation, the server hosting the active copy of DatabaseB fails. Another copy of DatabaseB is activated with data loss because AttemptCopyLastLogs cannot complete successfully. As a result, a portion of the mailbox data could be lost.
  • Example 2: As an administrator you decide to move a collection of mailboxes within your Exchange 2010 RTM environment, whose entire data set fit on a single 1MB log file. You schedule the moves and the mailboxes are successfully moved from DatabaseA to DatabaseB. Immediately following the move, DatabaseB’s server fails. Another copy of DatabaseB is activated with data loss because AttemptCopyLastLogs cannot complete successfully. At the time of the failure, the active log file that contained all of the data associated with the mailbox moves and the associated transactions was not replicated to the other copies. As a result, the copy mounts, but the moved mailboxes are not within DatabaseB. In addition, because the Exchange Mailbox Replication service marked the mailbox moves as complete, the mailboxes are no longer within DatabaseA.

As you can imagine, these are serious data loss issues. Thankfully, we thought of these while developing Exchange 2010.

Data Guarantee API & the Mailbox Replication Service

Exchange 2010 includes a Data Guarantee API that is used by services like the Mailbox Replication service (MRS) to check the health of the database copy architecture based on a defined setting of the database, as set by the system or an administrator. Specifically, the Data Guarantee API can be used to:

  1. Check Replication Health - Confirm that the prerequisite number of database copies is available.
  2. Check Replication Flush - Confirm that the required log files have been replayed against the prerequisite number of database copies.

When executed, the API returns the following information back to the calling application:

  1. Status information returns one of the following values:
    • Retry: returned as a result of transient errors that prevent a condition from being checked against the database.
    • Satisfied: returned when the database meets the required conditions, or if the database is not replicated.
    • NotSatisfied: returned when the database does not meet the required conditions. In addition, information is provided back to the calling application as to why the NotSatisfied response was returned.
  2. How long the calling application should wait before attempting to check again.
    1. If copy information has not been collected, the default wait time is 10 seconds.
    2. If no healthy database copies are found, the default wait time is 2 minutes.
    3. If a healthy copy is found, but is slightly behind in replication, the default wait time is 1 minute.
    The maximum possible wait time is 10 minutes.

DataMoveReplicationConstraint

The value for the DataMoveReplicationConstraint property of the mailbox database determines how many database copies should be evaluated as part of the request. The DataMoveReplicationConstraint property has the following possible values:

  • None: This is the default value when a mailbox database is created. When set to None, the data guarantee API conditions are ignored. This setting should only be used for mailbox databases are not replicated.
  • SecondCopy: At least one passive database copy must meet the data guarantee API conditions. This is the default value when you add the second copy of a mailbox database.
  • SecondDatacenter: At least one passive database copy in another Active Directory site must meet the data guarantee API conditions.
  • AllDatacenters: At least one passive database copy in each Active Directory site must meet the data guarantee API conditions.
  • AllCopies: All copies of the mailbox database must meet the data guarantee API conditions.

Check Replication Health

When the Data Guarantee API is executed to evaluate the health of the database copy infrastructure, the following items are evaluated:

  1. If the DataMoveReplicationConstraint is set to SecondCopy, then for a given replicated database at least one passive database copy must:
    1. Be healthy.
    2. Have a replay queue within 10 minutes of replay lag time.
    3. Have a copy queue length less than 10 logs.
    4. Have an average copy queue length less than 10 logs. The average copy queue length is computed based on the number of times the application has queried the database status.
  2. If the DataMoveReplicationConstraint is set to SecondDatacenter, then for a given database at least one passive database copy in another Active Directory site must:
    1. Be healthy.
    2. Have a replay queue within 10 minutes of replay lag time.
    3. Have a copy queue length less than 10 logs.
    4. Have an average copy queue length less than 10 logs.
  3. If the DataMoveReplicationConstraint is set to AllDatacenters, then for a given database, the active copy must be mounted, and a passive copy in each AD site must:
    1. Be healthy.
    2. Have a replay queue within 10 minutes of replay lag time.
    3. Have a copy queue length less than 10 logs.
    4. Have an average copy queue length less than 10 logs.
  4. If the DataMoveReplicationConstraint is set to AllCopies, then for a given database, the active copy must be mounted, and all passive database copies must:
    1. Be healthy.
    2. Have a replay queue within 10 minutes of replay lag time.
    3. Have a copy queue length less than 10 logs.
    4. Have an average copy queue length less than 10 logs.

Check Replication Flush

In Exchange 2010 SP1, the Data Guarantee API can also be used to validate that a prerequisite number of database copies have replayed the required transaction logs. This is verified by comparing the last log replayed timestamp with that of the calling service’s commit time stamp (in most cases, this is the time stamp of the last log file that contains required data) plus an additional 5 seconds (to deal with system time clock skews or drift). If the replay time stamp is greater than the commit time, then the DataMoveReplicationConstraint is satisfied.

If replay time stamp is not greater than the commit time, then the DataMoveReplicationConstraint is not satisfied.

Mailbox Replication Service

MRS calls into the Data Guarantee API several times throughout the lifetime of the move request. As documented in Understanding Move Requests, mailbox moves are performed as follows:

  1. The Move Request updates Active Directory and injects a message within the system mailbox of a mailbox database in the target Active Directory site. MRS will query the Data Guarantee API to determine the health of the target database copy infrastructure. As long as the returned status is Satisfied, the move request will continue.
  2. MRS will begin the data move by cloning the mailbox structure in the target mailbox database. MRS will query the Data Guarantee API to determine the health of the target database copy infrastructure. As long as the returned status is Satisfied, the move request will continue.
  3. MRS will perform the initial synchronization by taking a snapshot of the source mailbox and replicating folders and content. Throughout this process, MRS will query the Data Guarantee API every 10 seconds to determine the health of the target database copy infrastructure. As long as the returned status is Satisfied, the move request will continue.
  4. MRS will perform incremental synchronization events and replicate the delta changes (when compared with the initial snapshot). Throughout this process, MRS will query the Data Guarantee API every 10 seconds to determine the health of the target database copy infrastructure. As long as the returned status is Satisfied, the move request will continue.
  5. MRS will lock the source mailbox.
  6. MRS will perform an incremental synchronization to obtain the changes made since the last synchronization event, in addition, to copying other data structures within the mailbox. Beginning with SP1, MRS will force the target database to roll the active transaction log file if the log isn’t rolled naturally, thereby ensuring continuous replication can replicate the log file data that contains the moved mailbox synchronization data. MRS determines whether this activity has been successful by using the Check Replication Flush capability within the Data Guarantee API.
  7. MRS will query the Data Guarantee API to determine the health of the target database copy infrastructure. As long as the returned status is Satisfied, the move request will continue.
  8. MRS will update mailbox-enabled user account in Active Directory indicating the move is complete.
  9. MRS will unlock the target mailbox.
  10. MRS will change the state of the mailbox in the source database to soft-deleted. This feature was added in Exchange 2010 SP1 and ensures that in the event the target database is lost, you can still recover the mailbox from its previous database.

For Steps 1 through 4, if at any time the Data Guarantee API returns a NotSatisfied or a Retry response, MRS will queue the move request and retry the query every 30 seconds. MRS will queue the move request for up to 15 minutes before failing the move request. If a Satisifed response is returned within the 15 minute stalling period, MRS will automatically resume the move request.

During Step 6, MRS will wait a maximum of 30 minutes for the Data Guarantee API to return a Satisfied response (retrying the query every 10 seconds). If a Satisfied response is not returned, MRS will fail the mailbox move.

When a move request has failed it will not be resumed automatically by MRS. Prior to initiating a Resume-MoveRequest, the administrator should execute the Get-MoveRequestStatistics to troubleshoot why the move request failed. After addressing the cause of the failure, the administrator can then execute the Resume-MoveRequest.

Note that if both the primary mailbox and the personal archive are being moved at the same time, both completions need to be guaranteed for the total move request to proceed.

Determining the Appropriate DataMoveReplicationConstraint for your Environment

You should configure the DataMoveReplicationConstraint property on each mailbox database according to the following:

If you are deploying...Set DataMoveReplicationConstraint to
Mailbox databases that do not have any database copies None
A DAG within a single Active Directory site SecondCopy
A DAG in multiple datacenters using a stretched Active Directory site SecondCopy
A DAG that spans two Active Directory sites and you will have highly available database copies in each site SecondDatacenter
A DAG that spans two Active Directory sites and you will have only lagged database copies in the second site SecondCopy
This is because the Data Guarantee API will not guarantee data being committed until the log file is replayed into the database copy and due to the nature of the database copy being lagged this constraint will fail the move request, unless the lagged database copy ReplayLagTime value is less than 30 minutes.
A DAG that spans three or more Active Directory sites and each site will contain highly available database copies AllDatacenters

Conclusion

In order to minimize data loss as a result of moving mailboxes in your highly available Exchange 2010 environment, set the correct DataMoveReplicationConstraint on each mailbox database.

Ross Smith IV