Tim McMichael

Navigating the world of high availability...

Part 6: Datacenter Activation Coordination – Who has a say?

Part 6: Datacenter Activation Coordination – Who has a say?

  • Comments 4
  • Likes

I recently worked with a customer that had a three-member database availability group (DAG) that was extended to two sites in a site resilience configuration. During scheduled maintenance in the primary datacenter, the customer encountered an interesting situation. In this case, the customer had two DAG members deployed in their primary datacenter and the third member deployed in a remote datacenter. In addition, Datacenter Activation Coordination (DAC) mode was enabled for their DAG.

 

There was a need to shut down the servers in the primary datacenter.  After completing maintenance tasks the servers in the primary datacenter were powered on.  It was then noted that all of the databases were dismounted on the servers in the primary datacenter.  This was verified with get-mailboxdatabase –status | fl name,mounted:

 

[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : False

Name    : Mailbox Database 1370762657
Mounted : False

Name    : Mailbox Database 1511135053
Mounted : False

Name    : Mailbox Database 1757981393
Mounted : False

 

So, the administrator issued a mount command, but an error was returned:

 

[PS] C:\>Mount-Database "Mailbox Database 1370762657"
Couldn't mount the database that you specified. Specified database: Mailbox Database 1370762657; Error code: An Active Manager operation failed. Error An Active Manager operation encountered an error. To perform this operation, the server must be a member of a database availability group, and the database availability group must have quorum. Error: Automount consensus not reached.. [Server: MBX-1.exchange.msft].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 1370762657:ADObjectId) [Mount-Database], InvalidOperationException
    + FullyQualifiedErrorId : FE7E9C2B,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase

 

The error indicates that the DAG members must have quorum and automount consensus in order to mount databases.  Because the DAG had DAC mode enabled, in order for automount consensus to be reached:

 

  • The node must be a member of a cluster.
  • The cluster must have quorum.
  • The node must be able to contact another member with a DACP bit set to 1 <or> it must be able to contact all other servers on the started servers list.

 

When the DAG members in the primary datacenter were shut down, the remaining DAG member went into a lost quorum state.  Therefore the DACP bit of the third member changed to 0 in response to a cluster service state change.  When the servers in the primary datacenter restart, they are unable to contact another DAG member with a DACP bit set to 1. Reviewing the properties of the DAG we saw that all three DAG members were on the started Mailbox servers list:

 

[PS] C:\>Get-DatabaseAvailabilityGroup -Identity DAG | fl name,startedmailboxservers,stoppedmailboxservers


Name                  : DAG
StartedMailboxServers : {MBX-3.exchange.msft, MBX-2.exchange.msft, MBX-1.exchange.msft}
StoppedMailboxServers : {}

 

It was possible for the servers in the primary datacenter to contact the Microsoft Exchange Replication service on all servers on the started Mailbox servers list.  So why then is automount consensus not reached?

 

When reviewing the status of servers in the cluster, we noted that the server in the remote datacenter was marked as down:

 

Import-Module FailoverClusters

Get-ClusterNode | fl name,state

Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Down

Why does MBX-3 report a status of down?  Traditionally, when a lost quorum condition is encountered we expect the Cluster service on the servers where quorum was lost to terminate.  Looking at the properties of the Cluster service we see that the default action is to restart when the service terminates.

 

image

 

In reviewing the application log on MBX-3 for events that occurred at the time MBX-1 and MBX-2 were shut down, we saw the following events:

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 10:03:04 AM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

 

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 10:03:04 AM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

 

We saw that the Cluster service acknowledged that there were no longer enough servers to maintain quorum and that the Cluster service entered the stop state gracefully – it did not terminate.  When MBX-1 and MBX-2 were gracefully shutdown in this example they communicate with all servers in the cluster announcing that they will be leaving.  In other words the servers in the primary datacenter did not just unexpectedly disappear, as in the case of a network failure or other catastrophic failure.

 

Since MBX-3 was informed that the other servers were leaving, it then determined that not enough votes would remain to satisfy quorum, and gracefully stopped its Cluster service rather than terminating it.  When MBX-1 and MBX-2 were brought back online they subsequently formed a cluster using their votes (only 2 of 3 votes necessary) and then began the process of determining automount consensus.  Since MBX-3 was a member of a cluster, but did not have its Cluster service started, it had no response to the DACP bit inquiry.  The condition that we must contact all servers on the started servers list of the DAG when no servers advertise a DACP bit of 1 was not met.

 

To resolve this, the administrator simply needs to restart the Cluster service on MBX-3.  This will in most cases result in databases mounting automatically as it allows the criteria for automount consensus to be satisfied and reached.  Here is a sample showing databases automatically mounted after starting the cluster service on MBX-3.

 

PS C:\> Get-ClusterNode | fl name,state


Name  : mbx-1
State : Up

Name  : mbx-2
State : Up

Name  : mbx-3
State : Up


[PS] C:\>Get-MailboxDatabase -Status | fl name,mounted


Name    : Mailbox Database 1252068500
Mounted : True

Name    : Mailbox Database 1757981393
Mounted : True

Name    : Mailbox Database 1370762657
Mounted : True

Name    : Mailbox Database 1511135053
Mounted : True

To illustrate the difference, here is an example where MBX-1 and MBX-2 were powered off instead of being gracefully shut down.  The events on MBX-3 show that the servers left unexpectedly and the Cluster service was terminated due to a lost quorum condition.

 

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node 'MBX-2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1135
Task Category: Node Mgr
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
Cluster node 'MBX-1' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/6/2012 12:15:00 PM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      MBX-3.exchange.msft
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the stopped state.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:15:00 PM
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

Log Name:      System
Source:        Service Control Manager
Date:          8/6/2012 12:16:01 PM
Event ID:      7036
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      MBX-3.exchange.msft
Description:
The Cluster Service service entered the running state.

Based on the events present, the Cluster service was terminated and the Service Control Manager issued a restart.  When MBX-1 and MBX-2 come back up, MBX-3 will successfully join the cluster.  Thus in this scenario, each DAG member can contact all servers on the started mailbox servers list, receive a response, and automount consensus can be reached, and databases will automatically mount.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (http://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (http://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (http://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (http://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (http://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (http://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (http://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (http://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (http://aka.ms/Qlt035)

========================================================

Comments
  • <p>I am very interested to learn DAG with DAC mode, your blogs shows me very clear picture about it.</p> <p>Thanks you very much for sharing it, We are expecting more real time scenarios from your end.</p>

  • <p>@Gangairyan...</p> <p>Thanks - I&#39;m glad this is helpful.</p> <p>TIMMCMIC</p>

  • <p>what I don&#39;t understand , how did server get outout of &quot;Stopped clustered mailbox&quot; servers in your scenario.</p> <p>is there a possibility that -configurationonly switch was Run to take out the server from &quot;Stopped clusteredmailboxservers&quot; list , while cluster service was unresponive ?</p>

  • <p>@Sam</p> <p>In this case it does not matter. &nbsp;Neither start &lt;or&gt; stop was used in this example.</p> <p>What I was trying to highlight here - it is commonly assumed that if the replication service on a machine is acceptable that it can cast a vote in this process. &nbsp;That is not true. &nbsp;The replication service has to be started - but the node itself must also be a member of a cluster for it&#39;s vote to count.</p> <p>TIMMCMIC</p>

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment