There are many potential causes for Cluster communication failures including:

  • Network latency
  • Network outages
  • Faulty drivers or network cards, including TCP offload issues
  • Misconfigured firewall rules
  • Security software such as anti-virus, intrusion detection, etc.

I was recently working with my customer on an issue where their Windows Server 2008 R2 eight-node Failover Cluster would randomly experience Cluster communication failures and their entire Cluster would go down. On the nodes, we would see events such as these:

In the Cluster.log file, we clearly see a networking related issue:

00003728.000017c8::2011/12/19-12:39:48.993 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00001654::2011/12/19-12:39:49.507 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00001654::2011/12/19-12:39:49.975 INFO  [Reconnector- Reconnector from epoch 3 to epoch 4 waited 28.000 so far.

00003728.000017c8::2011/12/19-12:39:50.022 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000017c8::2011/12/19-12:39:50.537 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000047f8::2011/12/19-12:39:51.005 INFO  [Reconnector-] Connection attempt timed out.

00003728.000047f8::2011/12/19-12:39:51.052 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00004150::2011/12/19-12:39:51.567 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00004150::2011/12/19-12:39:51.988 INFO  [Reconnector-] Reconnector from epoch 3 to epoch 4 waited 30.000 so far.

00003728.000017c8::2011/12/19-12:39:52.081 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000034f8::2011/12/19-12:39:52.175 INFO  [ACCEPT] 0.0.0.0:~3343~: Accepted inbound connection from remote endpoint :~48450~.

00003728.00003948::2011/12/19-12:39:52.175 INFO  [SV] Securing route from (:~3343~) to remote  (:~48450~).

00003728.00003948::2011/12/19-12:39:52.175 INFO  [SV] Got a new incoming stream from 48450~

00003728.00002a08::2011/12/19-12:39:52.206 ERR   node was pruned out by the membership manager (status = 5892), executing OnStop

00003728.00002a08::2011/12/19-12:39:52.206 INFO  [DM]: Shutting down, so unloading the cluster database.

00003728.00002a08::2011/12/19-12:39:52.206 WARN  [DM] Hive::DatabaseUnloadOnShutdown: Unable to grab the lock (it will not unload the hive)

00003728.00002a08::2011/12/19-12:39:52.206 ERR   FatalError is Calling Exit Process.

00003174.00002d4c::2011/12/19-12:39:52.409 WARN  [RHS] Cluster service has terminated.

00003b94.00000c18::2011/12/19-12:39:52.409 WARN  [RHS] Cluster service has terminated.

00003174.00002d4c::2011/12/19-12:39:52.409 INFO  [RHS] Exiting.

00003b94.00000c18::2011/12/19-12:39:52.409 INFO  [RHS] Exiting.

00003054.00004764::2011/12/19-12:40:53.447 INFO 

 

Looking at the System Event log, there was no evidence of the public or private networks failing.

We applied the following two hotfixes:

2552040                A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication failure occurs

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2552040

2550886                A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2550886

Yet still the Cluster communication failed.

We isolated the Cluster communication by removing the Exchange replication traffic from the private network.

At that point, NIC teaming of the private network was no longer necessary since the private network was only hosting Cluster communications. We therefore broke the team and removed the NIC teaming software. We also ensured that the network drivers and firmware were at the latest and greatest.

Yet still the Cluster communication failed.

Since these are 10GB network cards, we disabled TCP offload from within the operating system and on the network cards per the following article:

951037  Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008

http://support.microsoft.com/default.aspx?scid=kb;EN-US;951037

Yet still the Cluster communication failed.

Despite no signs of there being latency, we increased the SameSubnetDelay to 2000 milliseconds and SameSubnetThreshold to 10 just in case there were momentary blips of latency issues that we were not catching in our traces and network analysis. Please see the following blog for more information on this: http://blogs.technet.com/b/askcore/archive/2010/02/12/windows-server-2008-failover-clusters-networking-part-1.aspx

Yet still the Cluster communication failed.

At this point, things are pretty hot. Their Exchange migration was not going well. We are pretty much to the end of our rope.

Why was the Cluster communication still failing?

Why was there no sign of the private and public networks failing at the time of our Cluster communication failures?

I started looking outside of the typical Cluster communication failures and ran across this:

"On a computer that is running Windows 7 or Windows Server 2008 R2, the network location profile that is selected changes unexpectedly from Domain to Public. Additionally, the firewall settings (these are determined by the network location profile) change to the settings that correspond to the Public network location profile. Therefore, some outgoing connections may be blocked, and some applications may be disconnected."

A light bulb went off. Angels started singing. I started jumping up and down doing my "happy dance". It all made so much sense!

I immediately requested my customers Microsoft-Windows-NetworkProfile Operational Event Log from each of the nodes to check and see if they are experiencing events that are changing and identifying from Public to domain to Public, etc. For those of you not familiar with this event log, it resides in the following location in Event Viewer: Applications and Services\Microsoft\Windows\NetworkProfile\Operational

It was definitely happening. The events were all over the place, very random, and there were some nodes already in this faulty condition.


Looking at the other nodes, we saw the same thing happening over and over again and it lined up with our previous Cluster outages.

12/12/2011    10:13:18 AM    Microsoft-Windows-FailoverClustering    1135    Cluster node 'CONTOSONODE1' was removed from the active failover cluster membership.

 In the NetworkProfile log, it was on the Domain Profile instead of Public.

 12/12/2011    10:13:14 AM    Microsoft-Windows-NetworkProfile    10001    "Network Disconnected

                Name: contoso.com

                Desc:contoso.com

                Type: Managed

                State: Disconnected

                Category: Domain Authenticated

But then changed to the Public Profile.

 12/12/2011    10:13:24 AM    Microsoft-Windows-NetworkProfile     4001    Entered State: Identifying Network Interface Guid: {491C2D84-B062-41B2-805A-0905DC53976C}

12/12/2011    10:13:25 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: Identifying...

                Desc: Identifying...

                Type: Unmanaged

                State: Connected,IPV4 (Local)

                Category: Public

12/12/2011    10:13:26 AM    Microsoft-Windows-NetworkProfile     4003    Transitioning to State: Unidentified Network Interface Guid: {C83435F5-B9D8-464A-85F5-9054C3B92044}

12/12/2011    10:13:27 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: Unidentified network

                Desc: Unidentified network

                Type: Unmanaged

                State: Connected,IPV4 (Local)

                Category: Public

On this node, it did change back to a Domain Profile and the Cluster Service started again. But on some of the nodes, it would stay stuck on the opposite of what was needed and a reboot would be required to bring the node back online.

12/12/2011    10:19:46 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: contoso.com

                Desc: contoso.com

                Type: Managed

                State: Connected,IPV4 (Internet)

                Category: Domain Authenticated

12/12/2011    10:21:22 AM    7036    Service Control Manager    The Cluster Service service entered the running state.

We had two options:

1)       Open up port 3343 for Cluster Communications on all networks.

2)      Apply the following hotfix to all nodes and reboot:

2524478                The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2524478

My customer went with option 2. Their NetworkProfile Operational event logs have been clean ever since and their Cluster communications have not failed again.

Now you may be wondering how I found the hotfixes mentioned in this blog post and that is a very good question. Some of it was just through some good ole Bing searches. Check out the first two results:


Additionally, Microsoft does this great thing for Failover Clusters and some of our other products as well. They create and update Knowledge Base articles with a list of recommended hotfixes for customers to proactively apply. I highly recommend checking these out and applying any hotfixes that fit (after some initial testing in a test environment of course).

Windows Server 2008 R2 (no service pack):

980054  Recommended hotfixes and updates for Windows Server 2008 R2-based server clusters

http://support.microsoft.com/default.aspx?scid=kb;EN-US;980054

For Windows Server 2008 R2 SP1:

2545685                Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2545685

Happy Clustering!!

~ Charity Shelbourne