High Availability and Site Resilience:

I have noticed many of us getting stuck while performing DR exercise. Typically while restoring DAG services to the production servers/site.

In most of the cases I have observed issue occurs due to incorrect action/step. Therefore trying to streamline the process of datacenter switchover.

By combining the native site resilience capabilities in Microsoft Exchange Server 2010 Service Pack 1 and later with proper planning, a second datacenter can be rapidly activated to serve the failed datacenter's clients.

 

Note: This process has been tested with two AD site and 3 DAG members.

Environment used:

  • Exchange 2010 SP1 and later
  • Windows server 2008 R2 SP1
  • Active Directory site: Two
  • DAG members: 3 (two production server and 1 DR server).
  • HUBCAS: One or more in each site.
  • DatacenterActivationMode: DagOnly

 

Important: If network or Active Directory infrastructure reliability has been compromised as a result of the primary datacenter failure, we recommend that all messaging services be off until these dependencies are restored to healthy service.

 

  1. 1.       Terminate a partially running datacenter:  This step involves terminating Mailbox and Unified Messaging services in the primary datacenter, if any services are still running. This is particularly important for the Mailbox server role because it uses an active/passive high availability model. If services in a partially failed datacenter aren't stopped, it's possible for problems from the partially failed datacenter to negatively affect the services during a switchover back to the primary datacenter.

 

  1. 2.       Validate and confirm the prerequisites for the second datacenter:  This step can be performed in parallel with step 1 because validation of the health of the infrastructure dependences in the second datacenter is largely independent of the first datacenter services. Each organization typically requires its own method for performing this step. For example, you may decide to complete this step by reviewing health information collected and filtered by an infrastructure monitoring application, or by using a tool that's unique to your organization's infrastructure. This is a critical step, because activating the second datacenter when its infrastructure is unhealthy and unstable is likely to yield poor results

 

  1. 3.       Activate the Mailbox servers:  This step begins the process of activating the second datacenter. This step can be performed in parallel with step 4 because the Microsoft Exchange services can handle database outages and recover. Activating the Mailbox servers involves a process of marking the failed servers from the primary datacenter as unavailable followed by activation of the servers in the second datacenter. The activation process for Mailbox servers depends on whether the DAG is in database activation coordination (DAC) mode. For more information about database activation coordination mode, see Understanding Datacenter Activation Coordination Mode.

If the DAG is in DAC mode, you can use the Exchange site resilience cmdlets to terminate a partially failed datacenter (if necessary) and activate the Mailbox servers. For example, in DAC mode, this step is performed by using the Stop-DatabaseAvailabilityGroup cmdlet. In some cases, the servers must be marked as unavailable twice (once in each datacenter). Next, the Restore-DatabaseAvailabilityGroup cmdlet is run to restore the remaining members of the database availability group (DAG) in the second datacenter by reducing the DAG members to those that are still operational, thereby reestablishing quorum. If the DAG isn't in DAC mode, you must use the Windows Failover Cluster tools to activate the Mailbox servers. After either process is complete, the database copies that were previously passive in the second datacenter can become active and be mounted. At this point, Mailbox server recovery is complete.

  1. 4.       Activate the other server roles:  This involves using the URL mapping information and the Domain Name System (DNS) change methodology to perform all required DNS updates. The mapping information describes what DNS changes to perform. The amount of time required to complete the update depends on the methodology used and the Time to Live (TTL) settings on the DNS record (and whether the deployment’s infrastructure honors the TTL).

 

Terminate a partially running datacenter: 

 

  • Confirmed that the database copies were in healthy state on all the nodes.
  • Shutdown Nodes A and B in production Site.
  • Ran the below commands in sequence on the Node 3 in DR Site:
  1. Stop-DatabaseAvailabilityGroup -identity DAGName -MailboxServer MailboxServer01 -ConfigurationOnly
  2. Stop-DatabaseAvailabilityGroup -identity DAGName -MailboxServer MailboxServer02 -ConfigurationOnly
  3. net stop clussvc

 

 

Restoring DAG Services in the DR site:

 

 

  1. 1.    Restore-DatabaseAvailabilityGroup -identity DAGName -AlternateWitnessDirectory “C:\ AlternateWitnessDirectoryPath” -AlternateWitnessServer AlternateWitnessServer

 

  1. 2.    Move-ActiveMailboxDatabase -Identity "DatabaseName" -ActivateOnServer MailboxServer03

 

  • Now, need to make changes to CAS, HUB and other DNS changes as per the above link.
  1. 1.    Get-Mailboxdatabase –Server MailboxServer01| Set-MailboxDatabase <MailboxDatabaseName> –RpcClientAccessServer <CasarrayName/CASName>
  2. 2.    Get-Mailboxdatabase –Server MailboxServer02| Set-MailboxDatabase <MailboxDatabaseName> –RpcClientAccessServer <CasarrayName/CASName>

The Site Switch Over is now COMPLETE!

Important: HUB will perform automatic switchover or I must say HUB server does not require any manual intervention.

 

Restoring Service to the Primary Datacenter:

Generally, datacenter failures are either temporary or permanent. With a permanent failure, such as an event that has caused the permanent destruction of a primary datacenter, there's no expectation that the primary datacenter will be activated. However, with a temporary failure (for example, an extended power loss or extensive but repairable damage), there's an expectation that the primary datacenter will eventually be restored to full service.

The process of restoring service to a previously failed datacenter is referred to as a switchback. The steps used to perform a datacenter switchback are similar to the steps used to perform a datacenter switchover. A significant distinction is that datacenter switchbacks are scheduled, and the duration of the outage is often much shorter.

It's important that switchback not be performed until the infrastructure dependencies for Exchange have been reactivated, are functioning and stable, and have been validated. If these dependencies aren't available or healthy, it's likely that the switchback process will cause a longer than necessary outage, and it's possible the process could fail altogether.

 

Mailbox Server Role Switchback

  • Start the servers in primary server.
  • Run the below commands in sequence in powershell:
  1. Start-DatabaseAvailabilityGroup -identity DAGName -MailboxServer MailboxServer01
  2. Start-DatabaseAvailabilityGroup -identity DAGName -MailboxServer MailboxServer02
  3. Set-DatabaseAvailabilityGroup -Identity DAGName
  4. Dismount-Database "DatabaseName1"
  5. Dismount-Database "DatabaseName2"
  6. Move-ActiveMailboxDatabase  "Database Name1 " -Mountonserver "MailboxServer01"
  7. Move-ActiveMailboxDatabase  "Database Name2 " -Mountonserver "MaulboxServer02 "
  8. Mount-Database "DatabaseName1"
  9. Mount-Database "DatabaseName2"

 

Important: Step 4 to 9 must be performed for each databases and must be activated on the production servers as per activation manager (PAM).

  • Changed the DNS entries for CAS servers.

 

The failback is now COMPLETE!

 

Caution: Before we plan to perform datacenter switchover we must be sure out alternate FSW is set with appropriate permission. “Exchange trusted subsystem and DAG CNO needs full permission on FSW.

 

In failures, please validate DAGTAsk to identify any type of issues.

Path: C:\ExchangeSetupLogs\DAGTask\DAGtask.txt

 

 

Thank you,

Mukut-