Microsoft Enterprise Platforms Support: Windows Server Core Team
EPS Team Blogs
Product Team Blogs
Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership when the nodes are hosted on VMWare ESX. I have documented node membership problems in a previous blog:
Having a problem with nodes being removed from active Failover Cluster membership?http://blogs.technet.com/b/askcore/archive/2012/02/08/having-a-problem-with-nodes-being-removed-from-active-failover-cluster-membership.aspx
This is a sample of the event you will see in the System Event Log in Event Viewer:
One specific problem that I have seen a few times lately is with the VMXNET3 adapters dropping inbound network packets because the inbound buffer is set too low to handle large amounts of traffic. We can easily find out if this is a problem by using Performance Monitor to look at the “Network Interface\Packets Received Discarded” counter.
Once you have added this counter, look at the Average, Minimum and Maximum numbers and if they are any value higher than zero, then the receive buffer needs to be adjusted up for the adapter. This problem is documented in VMWare’s Knowledge Base:
Large packet loss at the guest OS level on the VMXNET3 vNIC in ESXi 5.x / 4.xhttp://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2039495
I hope that this post helps you!
James BurrageSenior Support Escalation EngineerWindows High Availability Group
Nice tip, thanks for sharing James!
I tried to figure out what are the recommended values but couldnt find it in the vmware docs.
can you refer to these values?
Wow, nice tip. I'm seeing huge numbers for packet drops on the replication network. The default is "not present" so what number should we start with? Thanks.
Let me add my thanks too. We have been experiencing failovers that we could not explain. I think this is the smoking gun we have been looking for.
We don't have the suggested values, you have to contact VMware for that guidance. I don't know what the maximums are unfortunately.
So this problem only affects when the VM vNIC is VMXNet3 not any other type such as e1000 ?
In a cluster with failover issues if you have seperate heartbeat NIC and Public NIC and the Public NIC has value higher than zero and increase buffer on the NIC will fix the failover issues or only if you have issues on heartbeat NIC?
I would like to echo Robbie Foust's request- what are good values to start with?
The trick about what to increase it to is difficult to answer as all networks and environments are different. Some increase it by a little, some double the value, some need to go higher. This is a setting in the VMware network card driver and even their
article referenced in this blog does not state what you should increase it to. You could start with doubling the value and monitor it. If it needs raising, then increase it. If the dropped packets appear to go away, leave it or lower it and monitor. Unfortunately,
it's not really a setting we can say that "x" will resolve the issue as each environment is different.
I have been looking at my servers that have VMXNET3 installed and noticed all of them are set to "Not Present". So I began with the default value set on this VMWare article:
Once I did this all my counters dropped to 0. Thanks for the info.
@MW the counters are cumulative and they reset to zero once the NICs are reset for any reason. Changing the value resets the NIC briefly, so it will always bring the counters down to zero. You still need to monitor them for discarded packets later.