When should I evict a cluster node?

When should I evict a cluster node?

  • Comments 11
  • Likes

image 

I thought I’d post a quick blog on this topic since we run into cases where evicting a cluster node is used as a troubleshooting step. That being said, evicting a node should NEVER be a primary troubleshooting step.

Evicting a node to try and resolve a cluster issue may get you deeper in the hole and ultimately make the issue more complex than it started out.  As an example, you originally started with a failover issue.  You evict the node but now you can’t get the node back into the cluster. Since you can no longer add the node back, you have this secondary issue that must be resolved before you can address your original problem.

In my experience of working many cluster issues, I have never resolved an issue by evicting a node. The only times you should ever evict a node are under the following scenarios.

  • Replacing a node with different hardware.
  • Reinstalling the operating system.
  • Permanently removing a node from a cluster.
  • Renaming a node of a cluster.

Let’s take a look at some very common scenarios where I’ve seen evicting a node used improperly.

Cluster service won’t start on node 2 of a cluster. Node 2 is evicted from the cluster. The original problem with why the cluster service didn’t start is still there but now that same problem also prevents node 2 from coming back into the cluster.

Resources don’t failover to node 2. Every time a failover occurs, the disks don’t come online and fail back to node 1. One of the nodes is evicted and then added back to the cluster. None of this addresses the disk issue so problem still remains.

If the reason for the disk failure is an Error 2, then the drives not seen properly by the evicted node.  So when you go to try and add the evicted node back in and take the defaults, it could error trying to join back with this error in CLCFGSRV.LOG

Major Task ID: {B8C4066E-0246-4358-9DE5-25603EDD0CA0}
Minor Task ID: {3BB53C9E-E14A-4196-9066-5400FB8860C9}
Progress (min, max, current): 0, 1, 1
Description:
Checking that all nodes have access to the quorum resource
Status: 0x800713de
The quorum disk could not be located by the cluster service.
Additional Information:
For more information, visit Help and Support Services at
http://go.microsoft.com/fwlink/?LinkId=4441.

I could go on and on but the point I am trying to make is that unless you fall into the four specific scenarios I mention, don’t evict your cluster nodes. Your Microsoft Support Engineers thank you and your users will thank you.

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Good post Jeff,  I'll be forwarding this one on to co-workers for sure as I've seen evict node sometimes used as one of the first steps taken.

    Thanks

    Mike

  • si la interfaz nueva es muy bacana..

    pero me colgo cuando abri un pdf online...

    no se si por el acrobat reader

  • evicted a node and now the cluster service won't start and when i try to add the node back to the cluster i get "The computer xxxx is joined to a cluster".

    How do i fix this?

  • To fix this issue all you have to do is set server’s cluster service to disabled.

  • Great and many thanks for writing this blog. I have the following questions though- would appreciate your input:

    1. Differences between Evict Node in Windows server 2008 and Remove Node in SQL Server 2008 r2?

    2.  What are the difference [extra steps etc] when adding back: a) the evicted node and b) the removed node.

    3. Beside patching what would be other situations when one would have to remove a SQL Server Node.

    I would appreciate if you shoot me an email at naprico@hotmail.com listing the URL of your response to these questions. Many thanks again.

    Best

    Naprico

  • how can you fix the following situation. There is a cluster with two nodes A,B. Node A is up and running Cluster service on node B can't start with the message related to signature problems on the quorum. I tried to force the cluster service to start without a quorum but without success.

    This is a cluster 2003.

  • What is going on is that when the cluster service starts, it will always try to join the cluster first.  If it cannot join a cluster, it will try to form the cluster.  However, in your case, the cluster is already running on the other node, it has the quorum/witness disk.  This node tries to get it and cannot since it is already owned.  So the result is that the cluster terminates and gives the error about the drive.  The focus on your troubleshooting is why cannot the node join.  In most all cases, it is that the nodes Cannot communicate over port 3343.  Something is blocking the port.  It could be antivirus, firewall, etc.

  • Or, it could be your security settings.  You would need to review the cluster log to see the errors when the join takes place.  If it is a 1722 or 1726, it is the connectivity issue I just explained.  If it is a 5, it is an access denied.  If you click on my name in the tag list, I created a blog about access denied errors and cluster administrator.  It has the info you need on what to look for and how to fix it.

  • Hi i am in a very critical situation. my secondary cluster server is not working i am not able to see the quorum drive, share drive neither the solution services, but i am able to see the same on primary cluster. This is happened when i tried to upgrade the solution without stopping the cluster services. so can you help me with how can i re add the secondary server in cluster.

    please help this is do and die situation for me.

  • @Dennis, I was able to add my server back into the cluster by disabling the cluster service as you instructed. @jeff, really? It was a little late when I found your article. Perhaps you might want to revise it?

  • p.s. the linked=4441 is a dead link