When running Restore-DatabaseAvailabilityGroup as part of the datacenter switchover process, servers in the secondary datacenter are forced online from a quorum and cluster perspective, and servers in the primary datacenter are evicted from the DAG’s cluster. When nodes in the primary datacenter come back online and network connectivity is restored, these restored nodes are not aware that any changes to cluster membership have occurred. The cluster services on the nodes in the primary datacenter will attempt to join/form a cluster with the nodes running in the secondary datacenter. When this occurs, the nodes in the secondary datacenter inform the nodes in the primary datacenter that they were evicted.
After a datacenter switchover has occurred, unless the original datacenter is gone or otherwise unrecoverable, eventually services in the primary datacenter will be restored. When services are restored, including full network connectivity, database availability group (DAG) administrators can begin the switchback process by using the Start-DatabaseAvailabilityGroup cmdlet.
Before performing a switchback, you can perform the following tasks to verify that it is safe to run Start-DatabaseAvailabilityGroup for servers in the primary datacenter.
The first task is to ensure that the following events are present in the system log of the servers on the StoppedMailboxServers list:
Log Name: System Source: Service Control Manager Date: 5/27/2012 1:13:35 PM Event ID: 7040 Task Category: None Level: Information Keywords: Classic User: SYSTEM Computer: MBX-1.exchange.msft Description: The start type of the Cluster Service service was changed from auto start to disabled.
Log Name: System Source: Microsoft-Windows-FailoverClustering Date: 5/27/2012 1:13:35 PM Event ID: 4621 Task Category: Cluster Evict/Destroy Cleanup Level: Information Keywords: User: SYSTEM Computer: MBX-1.exchange.msft Description: This node was sucessfully removed from the cluster.
Log Name: System Source: Service Control Manager Date: 5/27/2012 1:13:35 PM Event ID: 7036 Task Category: None Level: Information Keywords: Classic User: N/A Computer: MBX-1.exchange.msft Description: The Cluster Service service entered the stopped state.
In this example, MBX-1 was informed of the eviction, and had it’s cluster services cleaned up and it’s Cluster service startup type set to disabled. The second task is to verify that the Cluster service startup type is set to Disabled. You can use the Services snap-in to verify this.
The third and last task is to verify that the cluster registry has been successfully cleaned up. This is an important step because any remnants of the cluster registry can lead the server to believe it is actually still in a cluster even though it has been evicted. You can use registry editor and navigate to HKEY_LOCAL_MACHINE (HKLM). If there is a hive called Cluster under the root of HKLM then the cleanup did not complete successfully.
Here is an example of a node where a successful cleanup was performed:
Here is an example of a node where the Cluster service has not been successfully cleaned up:
Anytime part of the cleanup process fails it typically means that Start-DatabaseAvailabilityGroup will also fail. If any of these three tasks show that cleanup did not complete successfully, it’s relatively easy to fix these issues. Administrators can force the cleanup to occur by running a cluster command.
Cluster node /force
Windows 2008 R2 / Windows 2012:
Clear-CluserNode <NODENAME> –Force
Some administrators proactively include this as a step in their datacenter switchover documentation when bringing resources back to the primary datacenter. This is not a bad idea. Proactively running this command, even on a node was cleaned up successfully has no ill effects and eliminates the need to perform the three tasks listed above.
Therefore, I recommend administrators either incorporate the three tasks or proactively run the cleanup command as a part of their datacenter switchover procedures.
Datacenter Activation Coordination Series:
Part 1: My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e) Part 2: Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft) Part 3: Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy) Part 4: Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq) Part 5: Datacenter Activation Coordination: How do I Force Automount Concensus? (http://aka.ms/T5sgqa) Part 6: Datacenter Activation Coordination: Who has a say? (http://aka.ms/W51h6n) Part 7: Datacenter Activation Coordination: When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover. (http://aka.ms/Oieqqp) Part 8: Datacenter Activation Coordination: Stop! In the Name of DAG... (http://aka.ms/Uzogbq) Part 9: Datacenter Activation Coordination: An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)
Excellent post! I just ran into the issue toay at customer and your post has saved me ton of time.
I'm glad it helped!