High Availability and Site Resilience:
I have noticed many of us getting stuck while performing DR exercise. Typically while restoring DAG services to the production servers/site.
In most of the cases I have observed issue occurs due to incorrect action/step. Therefore trying to streamline the process of datacenter switchover.
By combining the native site resilience capabilities in Microsoft Exchange Server 2010 Service Pack 1 and later with proper planning, a second datacenter can be rapidly activated to serve the failed datacenter's clients.
Note: This process has been tested with two AD site and 3 DAG members.
Important: If network or Active Directory infrastructure reliability has been compromised as a result of the primary datacenter failure, we recommend that all messaging services be off until these dependencies are restored to healthy service.
If the DAG is in DAC mode, you can use the Exchange site resilience cmdlets to terminate a partially failed datacenter (if necessary) and activate the Mailbox servers. For example, in DAC mode, this step is performed by using the Stop-DatabaseAvailabilityGroup cmdlet. In some cases, the servers must be marked as unavailable twice (once in each datacenter). Next, the Restore-DatabaseAvailabilityGroup cmdlet is run to restore the remaining members of the database availability group (DAG) in the second datacenter by reducing the DAG members to those that are still operational, thereby reestablishing quorum. If the DAG isn't in DAC mode, you must use the Windows Failover Cluster tools to activate the Mailbox servers. After either process is complete, the database copies that were previously passive in the second datacenter can become active and be mounted. At this point, Mailbox server recovery is complete.
Terminate a partially running datacenter:
Restoring DAG Services in the DR site:
The Site Switch Over is now COMPLETE!
Important: HUB will perform automatic switchover or I must say HUB server does not require any manual intervention.
Restoring Service to the Primary Datacenter:
Generally, datacenter failures are either temporary or permanent. With a permanent failure, such as an event that has caused the permanent destruction of a primary datacenter, there's no expectation that the primary datacenter will be activated. However, with a temporary failure (for example, an extended power loss or extensive but repairable damage), there's an expectation that the primary datacenter will eventually be restored to full service.
The process of restoring service to a previously failed datacenter is referred to as a switchback. The steps used to perform a datacenter switchback are similar to the steps used to perform a datacenter switchover. A significant distinction is that datacenter switchbacks are scheduled, and the duration of the outage is often much shorter.
It's important that switchback not be performed until the infrastructure dependencies for Exchange have been reactivated, are functioning and stable, and have been validated. If these dependencies aren't available or healthy, it's likely that the switchback process will cause a longer than necessary outage, and it's possible the process could fail altogether.
Mailbox Server Role Switchback
Important: Step 4 to 9 must be performed for each databases and must be activated on the production servers as per activation manager (PAM).
The failback is now COMPLETE!
Caution: Before we plan to perform datacenter switchover we must be sure out alternate FSW is set with appropriate permission. “Exchange trusted subsystem and DAG CNO needs full permission on FSW.
In failures, please validate DAGTAsk to identify any type of issues.