While working with a customer to resolve networking issues with their DAG last week I enquired if they had installed the recommended update for their DAG. They were not aware of the Exchange product group’s recommendation so I thought I’d bubble this back up again:
The post is on the Exchange team blog site and mentions the following update:
KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working
This hotfix is strongly recommended for all databases availability groups that are stretched across multiple datacentres. For DAGs that are not stretched across multiple datacentres, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.
As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the cluster database must also be operating properly.
Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.
The problem typically manifests itself in the form of cluster quorum loss due to an asymmetric communication failure (when two nodes cannot communicate with each other but can still communicate with other nodes). If there are delays among other nodes in the receiving of cluster regroup messages from the cluster’s Global Update Manager (GUM), regroup messages can end up being received in unexpected order. When that happens, the cluster loses quorum instead of invoking the expected behaviour, which is to remove one of the nodes that experienced the initial communication failure from the cluster.
Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG members have latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that discover a broken connection between the pair. If the first node detects a connection loss well before the second node, a race condition can occur:
If this issue does occur, the consequences are very bad for DAGs. As a result, we recommend that you deploy this hotfix to all of your Mailbox servers that are members of a DAG, especially if the DAG is stretched across datacentres. This hotfix can also benefit environments running Exchange 2007 Single Copy Clusters and Cluster Continuous Replication environments.
In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:
If you would like to have Microsoft Premier Field Engineering (PFE) visit your company and assist with the topic(s) presented in this blog post, then please contact your Microsoft Premier Technical Account Manager (TAM) for more information on scheduling and our varied offerings!
If you are not currently benefiting from Microsoft Premier support and you’d like more information about Premier, please email the appropriate contact below, and tell them you how you got introduced!
For all other areas please use the US contact point.