Recently, in two separate occasions, I had to assist in resolving an issue where a member of an Exchange 2010 database availability group (DAG) failed to participate in the DAG's Cluster Communications and therefore were unable to bring any database on those servers online. In both instances, this occurred after the server was rebooted. While each issue had a slightly different resolution, I am fairly confident that they are related. And since it took awhile to isolate and resolve these issues, I'd thought I would share this experience regarding these issues.
Before I begin, in neither scenario did we lose quorum of the DAG. Also, the symptoms of both scenarios were nearly identical.
RESOLUTION #1 In this scenario, this occurred within our lab running on Hyper-V. Based on hyper-V's network summary output, I could see that the servers really were not communicating properly. Yes, they could ping and they could authenticate with the domain, but cluster communication was failing. The resolution was to consistently configure the network settings on all DAG members & to reset the hyper-v network properties. This meant:
RESOLUTION #2 In this scenario, this occurred in production. Ultimately we decided to change the IP address of the 'broken' DAG member and reboot the server again. This allowed the server to properly register its network connections with the cluster DB (ClusDB) and all other nodes were able to talk properly. This allowed the DAG member to rejoin the DAG and then all databases were able to mount and/or replicate their copy successfully.
We found that not all of the production DAG members were identically configured with their network settings (i.e. 2 DAG members did not have a REPL network configured). Per http://technet.microsoft.com/en-us/library/dd638104.aspx#NR, "each DAG member must have the same number of networks". We fixed the networks and updated the servers to include the recommended hotfixes - http://blogs.technet.com/b/dblanch/archive/2012/02/27/a-few-hotfixes-to-consider.aspx
Questions/Answers Why did changing the IP address of the DAG member work? Well, not exactly sure but we believe that this was either a stale TCP route or something in the CLUSDB was preventing any server with that IP address from joining the cluster. Did you reboot all of the DAG member server before or after changing the IP address? No, we did not want to risk losing another server within the DAG (had already lost 2 of the 12 members). We did, however, reboot all of the servers in the lab scenario. Did you ever lose quorum of the DAG? Nope. Do you think that you could have prevented this? Maybe, if we had applied all of the hotfixes outlined here & confirmed all network settings were identical on all DAG members, then maybe servers might not have caused this issue. There may be other things causing this, but it is always recommended to resolve the known issues first.
Good Luck. Doug