Microsoft Enterprise Platforms Support: Windows Server Core Team
EPS Team Blogs
Product Team Blogs
Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership when the nodes are hosted on VMWare ESX. I have documented node membership problems in a previous blog:
Having a problem with nodes being removed from active Failover Cluster membership?http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx
This is a sample of the event you will see in the System Event Log in Event Viewer:
One specific problem that I have seen a few times lately is with the VMXNET3 adapters dropping inbound network packets because the inbound buffer is set too low to handle large amounts of traffic. We can easily find out if this is a problem by using Performance Monitor to look at the “Network Interface\Packets Received Discarded” counter.
Once you have added this counter, look at the Average, Minimum and Maximum numbers and if they are any value higher than zero, then the receive buffer needs to be adjusted up for the adapter. This problem is documented in VMWare’s Knowledge Base:
Large packet loss at the guest OS level on the VMXNET3 vNIC in ESXi 5.x / 4.xhttp://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495
I hope that this post helps you!
James BurrageSenior Support Escalation EngineerWindows High Availability Group
Nice tip, thanks for sharing James!
I tried to figure out what are the recommended values but couldnt find it in the vmware docs.
can you refer to these values?
Wow, nice tip. I'm seeing huge numbers for packet drops on the replication network. The default is "not present" so what number should we start with? Thanks.
Let me add my thanks too. We have been experiencing failovers that we could not explain. I think this is the smoking gun we have been looking for.
We don't have the suggested values, you have to contact VMware for that guidance. I don't know what the maximums are unfortunately.
So this problem only affects when the VM vNIC is VMXNet3 not any other type such as e1000 ?
In a cluster with failover issues if you have seperate heartbeat NIC and Public NIC and the Public NIC has value higher than zero and increase buffer on the NIC will fix the failover issues or only if you have issues on heartbeat NIC?