Restore-DatabaseAvailabilityGroup is one of the cmdlets used as part of the datacenter switchover process. The purpose of Restore-DatabaseAvailabilityGroup is to read the DAG’s list of stopped servers and evict the listed servers from the DAG’s underlying cluster. The list of servers in this scenario typically includes all DAG members in the failed primary datacenter. This allows the DAG and the cluster to shrink, and because it now has fewer members, it requires fewer servers to maintain quorum and perform DAG operations.
1) Starts a surviving node in the second datacenter using /forceQuourm.
2) Forcibly evicts each server listed on the stopped servers list.
I have worked support cases where this eviction process fails with an exception. In these cases, restore-databaseAvailabilityGroup issued the eviction while the Cluster service was still initializing (even though service control manager reported the service as started). When the Cluster service is initializing it is unable to process eviction requests. As a result, the commands failed. For a few customers, the error is consistently reproducible necessitating the use of a workaround in order for restore-databaseAvailabiltyGroup to work.
Note: Customers upgrade to Exchange 2010 Service Pack 1 before following these instructions. These instructions will only work with Exchange 2010 SP1.
Prior to SP1, the Cluster service must be found in a stopped state in order to utilize restore-databaseAvailabilityGroup. After SP1, the Cluster service no longer needs to be in a stopped state in order to proceed.
The following error may be noted when running
restore-databaseAvailabilityGroup –site <DRSite>
WARNING: Server 'PrimarySiteServer' was marked as stopped in database availability group 'DAG' but couldn't be removed from the cluster. Error: A server-side database availability group administrative operation failed. Error: The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"EvictClusterNodeEx(node.domain.com) failed with 0x46. Error: The remote server has been paused or is in the process of being started"' failed. [Server: DRSiteServer.domain.com] WARNING: The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2010-09-02_14-54-39.766_restore-databaseavailabilitygroup.log".
The error 0x46 translates to
ERROR_SHARING_PAUSED winerror.h # The remote server has been paused or is in the process of # being started.
Upon further review, the Service Control Manager reported the Cluster service as started, and Failover Cluster Manager will connect to the cluster service. Despite the error message, the attempt to start the Cluster service by using /forceQuorum was successful.
So the solution is simply to re-run restore-databaseAvailabilityGroup and the stopped DAG members will be successfully evicted.
Great tip...I just hit this one on my SP1 RU5 DAG.....
Has this still not been resolved in Exchange 2010 SP2? Or even Rollup 2? I can reproduce this exact problem in our production environment, albeit intermittently.
There's really nothing to fix here. In SP1 and later we no longer require the cluster service to be stopped in order to run the restore if the remaining nodes are running and have quorum. Prior to SP1 you could end up in a loop. This allows you to simply execute the command again and it should complete with success.
Negative ghost rider. We are and have been experiencing the same as of last Friday 11/9/12. Re-running the command didn't do the trick. Here we are on SP2 with rollup 2, my how far we have come.
We are running Exchange 2010 SP2 rollup 5 v2 and having issues running our BCM datacentre failover testing.
Can you confirm that we need to stop the cluster services in the secondary datacenter prior to running restore-databaseavailabilitygroup? We are noticing that sometimes these services automatically restart and sometimes they dont afterwards - not sure exacly if we need to run this command in our version of exchange and what should happen?
Most likely you are hitting a pretty common timing issue.
If you look at the properties of the cluster service, on the recovery actions tab, you'll see that the default action is restart. Everytime the cluster service attempts to start after a lost quorum condition it will eventually be killed (terminated). Upon every termination the restart interval increases.
When the service has been terminated it visually looks like it is in a stopped state. Then, by the time you come around to running the command, service control manager has issued a restart on it...and the circle continues.
The answer though is that the cluster service must be stopped on all nodes remaining prior to executing the restore-databaseavailabilitygroup.