In early August 2011, the Windows SE team released the following Knowledge Base (KB) article and accompanying software hotfix regarding an issue in Windows Server 2008 R2 failover clusters:
KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working
The Exchange team has posted an article on the team blog that strongly recommend this hotfix for all databases availability groups that are stretched across multiple datacenters. For DAGs that are not stretched across multiple datacenters, this hotfix is good to have, as well.
The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.
In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:
http://support.microsoft.com/kb/2549472 - Cluster node cannot rejoin the cluster after the node is restarted or removed from the cluster in Windows Server 2008 R2 http://support.microsoft.com/kb/2549448 - Cluster service still uses the default time-out value after you configure the regroup time-out setting in Windows Server 2008 R2 http://support.microsoft.com/kb/2552040 - A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication fail