Failover Cluster Communication Failures

Failover Cluster Communication Failures

Rate This
  • Comments 10

There are many potential causes for Cluster communication failures including:

  • Network latency
  • Network outages
  • Faulty drivers or network cards, including TCP offload issues
  • Misconfigured firewall rules
  • Security software such as anti-virus, intrusion detection, etc.

I was recently working with my customer on an issue where their Windows Server 2008 R2 eight-node Failover Cluster would randomly experience Cluster communication failures and their entire Cluster would go down. On the nodes, we would see events such as these:

In the Cluster.log file, we clearly see a networking related issue:

00003728.000017c8::2011/12/19-12:39:48.993 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00001654::2011/12/19-12:39:49.507 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00001654::2011/12/19-12:39:49.975 INFO  [Reconnector- Reconnector from epoch 3 to epoch 4 waited 28.000 so far.

00003728.000017c8::2011/12/19-12:39:50.022 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000017c8::2011/12/19-12:39:50.537 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000047f8::2011/12/19-12:39:51.005 INFO  [Reconnector-] Connection attempt timed out.

00003728.000047f8::2011/12/19-12:39:51.052 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00004150::2011/12/19-12:39:51.567 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.00004150::2011/12/19-12:39:51.988 INFO  [Reconnector-] Reconnector from epoch 3 to epoch 4 waited 30.000 so far.

00003728.000017c8::2011/12/19-12:39:52.081 WARN  [NETFT] Failed to send keep-alive ioctl to NetFT: 0xd0000001

00003728.000034f8::2011/12/19-12:39:52.175 INFO  [ACCEPT] 0.0.0.0:~3343~: Accepted inbound connection from remote endpoint :~48450~.

00003728.00003948::2011/12/19-12:39:52.175 INFO  [SV] Securing route from (:~3343~) to remote  (:~48450~).

00003728.00003948::2011/12/19-12:39:52.175 INFO  [SV] Got a new incoming stream from 48450~

00003728.00002a08::2011/12/19-12:39:52.206 ERR   node was pruned out by the membership manager (status = 5892), executing OnStop

00003728.00002a08::2011/12/19-12:39:52.206 INFO  [DM]: Shutting down, so unloading the cluster database.

00003728.00002a08::2011/12/19-12:39:52.206 WARN  [DM] Hive::DatabaseUnloadOnShutdown: Unable to grab the lock (it will not unload the hive)

00003728.00002a08::2011/12/19-12:39:52.206 ERR   FatalError is Calling Exit Process.

00003174.00002d4c::2011/12/19-12:39:52.409 WARN  [RHS] Cluster service has terminated.

00003b94.00000c18::2011/12/19-12:39:52.409 WARN  [RHS] Cluster service has terminated.

00003174.00002d4c::2011/12/19-12:39:52.409 INFO  [RHS] Exiting.

00003b94.00000c18::2011/12/19-12:39:52.409 INFO  [RHS] Exiting.

00003054.00004764::2011/12/19-12:40:53.447 INFO 

 

Looking at the System Event log, there was no evidence of the public or private networks failing.

We applied the following two hotfixes:

2552040                A Windows Server 2008 R2 failover cluster loses quorum when an asymmetric communication failure occurs

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2552040

2550886                A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2550886

Yet still the Cluster communication failed.

We isolated the Cluster communication by removing the Exchange replication traffic from the private network.

At that point, NIC teaming of the private network was no longer necessary since the private network was only hosting Cluster communications. We therefore broke the team and removed the NIC teaming software. We also ensured that the network drivers and firmware were at the latest and greatest.

Yet still the Cluster communication failed.

Since these are 10GB network cards, we disabled TCP offload from within the operating system and on the network cards per the following article:

951037  Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008

http://support.microsoft.com/default.aspx?scid=kb;EN-US;951037

Yet still the Cluster communication failed.

Despite no signs of there being latency, we increased the SameSubnetDelay to 2000 milliseconds and SameSubnetThreshold to 10 just in case there were momentary blips of latency issues that we were not catching in our traces and network analysis. Please see the following blog for more information on this: http://blogs.technet.com/b/askcore/archive/2010/02/12/windows-server-2008-failover-clusters-networking-part-1.aspx

Yet still the Cluster communication failed.

At this point, things are pretty hot. Their Exchange migration was not going well. We are pretty much to the end of our rope.

Why was the Cluster communication still failing?

Why was there no sign of the private and public networks failing at the time of our Cluster communication failures?

I started looking outside of the typical Cluster communication failures and ran across this:

"On a computer that is running Windows 7 or Windows Server 2008 R2, the network location profile that is selected changes unexpectedly from Domain to Public. Additionally, the firewall settings (these are determined by the network location profile) change to the settings that correspond to the Public network location profile. Therefore, some outgoing connections may be blocked, and some applications may be disconnected."

A light bulb went off. Angels started singing. I started jumping up and down doing my "happy dance". It all made so much sense!

I immediately requested my customers Microsoft-Windows-NetworkProfile Operational Event Log from each of the nodes to check and see if they are experiencing events that are changing and identifying from Public to domain to Public, etc. For those of you not familiar with this event log, it resides in the following location in Event Viewer: Applications and Services\Microsoft\Windows\NetworkProfile\Operational

It was definitely happening. The events were all over the place, very random, and there were some nodes already in this faulty condition.


Looking at the other nodes, we saw the same thing happening over and over again and it lined up with our previous Cluster outages.

12/12/2011    10:13:18 AM    Microsoft-Windows-FailoverClustering    1135    Cluster node 'CONTOSONODE1' was removed from the active failover cluster membership.

 In the NetworkProfile log, it was on the Domain Profile instead of Public.

 12/12/2011    10:13:14 AM    Microsoft-Windows-NetworkProfile    10001    "Network Disconnected

                Name: contoso.com

                Desc:contoso.com

                Type: Managed

                State: Disconnected

                Category: Domain Authenticated

But then changed to the Public Profile.

 12/12/2011    10:13:24 AM    Microsoft-Windows-NetworkProfile     4001    Entered State: Identifying Network Interface Guid: {491C2D84-B062-41B2-805A-0905DC53976C}

12/12/2011    10:13:25 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: Identifying...

                Desc: Identifying...

                Type: Unmanaged

                State: Connected,IPV4 (Local)

                Category: Public

12/12/2011    10:13:26 AM    Microsoft-Windows-NetworkProfile     4003    Transitioning to State: Unidentified Network Interface Guid: {C83435F5-B9D8-464A-85F5-9054C3B92044}

12/12/2011    10:13:27 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: Unidentified network

                Desc: Unidentified network

                Type: Unmanaged

                State: Connected,IPV4 (Local)

                Category: Public

On this node, it did change back to a Domain Profile and the Cluster Service started again. But on some of the nodes, it would stay stuck on the opposite of what was needed and a reboot would be required to bring the node back online.

12/12/2011    10:19:46 AM    Microsoft-Windows-NetworkProfile    10000    "Network Connected

                Name: contoso.com

                Desc: contoso.com

                Type: Managed

                State: Connected,IPV4 (Internet)

                Category: Domain Authenticated

12/12/2011    10:21:22 AM    7036    Service Control Manager    The Cluster Service service entered the running state.

We had two options:

1)       Open up port 3343 for Cluster Communications on all networks.

2)      Apply the following hotfix to all nodes and reboot:

2524478                The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2524478

My customer went with option 2. Their NetworkProfile Operational event logs have been clean ever since and their Cluster communications have not failed again.

Now you may be wondering how I found the hotfixes mentioned in this blog post and that is a very good question. Some of it was just through some good ole Bing searches. Check out the first two results:


Additionally, Microsoft does this great thing for Failover Clusters and some of our other products as well. They create and update Knowledge Base articles with a list of recommended hotfixes for customers to proactively apply. I highly recommend checking these out and applying any hotfixes that fit (after some initial testing in a test environment of course).

Windows Server 2008 R2 (no service pack):

980054  Recommended hotfixes and updates for Windows Server 2008 R2-based server clusters

http://support.microsoft.com/default.aspx?scid=kb;EN-US;980054

For Windows Server 2008 R2 SP1:

2545685                Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters

http://support.microsoft.com/default.aspx?scid=kb;EN-US;2545685

Happy Clustering!!

~ Charity Shelbourne

 

 

Leave a Comment
  • Please add 5 and 2 and type the answer here:
  • Post
  • well done Charity Shelbourne, it is simly awesome.!

    i've done multiple exchange deployments and spent considerable time with MS support incidents so that to identity the fix this. Everytime we've only been suggested to increase the latency timers.

    What you finding is something so basic yet not figured out even at top tiers support in MS until we have your post here leading us to find and fix ours :)

    Hats off! you have something as troubleshooting genius in you i must say.

  • I rock!! Just kidding. :-) Seriously, your comment really made my day. I'm glad to hear my blog post will make an impact and I look forward to sharing more useful posts.

  • Hi Charity,

    That's an interesting story but confusing the same time.

    When Failover Clustering feature gets installed, it creates an exception in Windows Firewall (also known as a Firewall Rule). That exception permits intra-cluster communications on UDP port 3343. (And, in fact, the same for TCP. Which is never used for whatever reason). And, what's most noticeable, by defaults those exceptions apply to all network profiles.

    I can showcase this by the following set of commands executed on an almost-vanilla Windows Server 2008 R2 SP1 box.

    PS C:\Users\artemp\Documents> netsh advfirewall firewall show rule name="Failover Clusters (UDP-In)" verbose

    No rules match the specified criteria.

    ____________________________________________________________________________

    PS C:\Users\artemp\Documents> Add-WindowsFeature -Name "Failover-Clustering"

    Success Restart Needed Exit Code Feature Result                              

    ------- -------------- --------- --------------                              

    True    No             Success   {Failover Clustering}                        

    ____________________________________________________________________________

    PS C:\Users\artemp\Documents> netsh advfirewall firewall show rule name="Failover Clusters (UDP-In)" verbose

    Rule Name:                            Failover Clusters (UDP-In)

    ----------------------------------------------------------------------

    Description:                          Inbound rule for Failover Clusters to allow internal cluster communication by the cluster virtual network adapter. [UDP 3343]

    Enabled:                              Yes

    Direction:                            In

    Profiles:                             Domain,Private,Public

    Grouping:                             Failover Clusters

    LocalIP:                              Any

    RemoteIP:                             Any

    Protocol:                             UDP

    LocalPort:                            3343

    RemotePort:                           3343

    Edge traversal:                       No

    Program:                              System

    InterfaceTypes:                       Any

    Security:                             NotRequired

    Rule source:                          Local Setting

    Action:                               Allow

    Ok.

    So it looks like your customer had explicitly edited this firewall exception later to apply only to Domain profile. That sounds quite strange to me, since I hardly expect to find an Active Directory Domain Controller on a Cluster Private network. So it's perfectly normal for that network(s) to be treated as Public by NLA and Windows Firewall.

    In fact, I had a very similar (yet slightly different) case last year. Here's the detailed description: social.technet.microsoft.com/.../3829.aspx

    And one more shameless plug. You just don't need Bing to search for clustering hotfixes. And, frankly, KB2545685 is quite useless. For instance, it has no idea about KB2524478 as well as of a whole bunch of similar cases.

    So please use either of the following lists.

    social.technet.microsoft.com/.../list-of-cluster-hotfixes-for-windows-server-2008-r2.aspx

    social.technet.microsoft.com/.../3153.aspx

  • Well Mr. Pronichkin

    it helped in our case for obvious reason.

    Because we had different policies in domain profiles pushed through GP so that maintenance on these servers remain standard and not subject to manual changes per server. Public profile really blocks most communication for obvious reasons. that also through GP.

    The above article and this knowledge of automatic flipping of profiles came as rescue to us big time.

    should it not ??

  • @Shahid,

    Glad to know it helped you. Just want to clarify my point.

    Since you don't want to put a DC into Cluster Private networks, they will be detected as Public anyway. So you will need a firewall exeption for that profile—eiter default (as describe in my previous comment—please note, that those intra-cluster communications are permitted by default on all profiles!) or created manually. And provided you have those exceptions for all profiles, it does not really matter how it flips.

    The only case it matters is if you explicitly change that network profile from Public to Private—and!—disable (or edit) default firewall exception. This is relatively uncommon case—judging by my experience, since we generally recommend against messing with default exceptions. (Though it is perfectly good idea to enforce them with Group Policies and disable locally defined firewall rules).

  • Shahid and Pronichkin,

    Thank you so much for your comments and interest! I love discussions such as these. Shahid, I am glad to hear that this helped you in your environment.

    Pronichkin, you do have some very valid points. Yes, Microsoft does not recommend changing the default firewall exceptions, default Cluster properties, etc. In short, we do not recommend changing the defaults at all unless you have a valid business justification.

    However, you are also missing a few points.

    1) This is a bug. Bugs do not necessarily follow expected logic.

    2) Not all of our customers use the built in Windows Firewall. In my customer's case, they are using a third party firewall. You are assuming that they have correctly defined the rules in whatever third party software and/or hardware. With this bug, it is also possible that the network location changes and you land back on the correct network location. However, depending on your environment, the number of nodes, etc.; just the process that we go through when changing the network location and the blip in network communication while we are authenticating, etc. can be enough to cause the Cluster communications to fail.

    3) Hotfixes are not created for one off scenarios. There has to be a big enough impact to warrant our Product Group taking the time to create, test, and release a hotfix.

    4) Typically, Microsoft only recommends applying hotfixes if you are experiencing the problem. KB2545685 (and the other corresponding articles) list the hotfixes Microsoft recommends that you proactively apply to your cluster nodes regardless of whether you are currently seeing the symptoms described. Hotfixes make this list when we see a large number of customers impacted by the problem for which that hotfix will fix. This is not meant to be a comprehensive list of all the hotfixes listed that apply to a Cluster. We would never recommend a customer proactively apply every hotfix we release.

    Thank you for the great discussion!

    ~ Charity Shelbourne

  • Charity Shelbourne, I deal with cluster a fair amount and I enjoyed the blog post.  

    I do have one question, since you mentioned it.  Why doesn't Microsoft advocate proactively applying hotfixes?  

    I find it difficult to understand why Microsoft wouldn't want people to apply patches for known bugs in a product that get fixed.  Shouldn't Microsoft be proactive in pushing these to customers so that we have fewer problems to deal with?  Also, aren't 99% of hotfixes deployed in service pack releases anyway?  

    I would like to know Microsoft's take on this.  Thanks for your time and the blog post.  

  • @John_Mares

    There actually is a method to the madness. The reason why we tend not to push these hotfixes is because they do not go under the same type of testing as a Service Pack does. So while yes many of these hotfixes get rolled into a Service Pack that Service Pack undergoes rigorous testing before we release it. While individual hotfixes are tested they do not go through this same test matrix. Clustering itself tends to be even more complex. Since many hotfixes tend to be storage related a hotfix to storport.sys could cause an issue with a third party hba driver. If we do see significant cases on a specific issue then yes the hotfix is typically made available via WU/MU/WSUS. One example was the 2003 SNP hotfix to disable this feature. Even then I still see many customers have this feature enabled not knowing about it.

  • i have the same error on a Windows 2012 Cluster (used for SQL-2012 Always ON). Any suggestions?

  • @ Dean

    Which error? SQL Clustering is pretty unique from a typical cluster in how things work and are done. However, if you're seeing something on the Cluster side, I might be able to give you some pointers.

Page 1 of 1 (10 items)