Microsoft Enterprise Platforms Support: Windows Server Core Team
EPS Team Blogs
Product Team Blogs
The purpose of this posting is to explain the supportability of removing resources on a Cluster Server. We have seen an increase lately with users manually deleting resources from the Cluster registry and I wanted to say that this is unsupported by Microsoft. Doing this can cause issues with your Clusters and I wanted to bring up the issues as well as how to get out of the predicament that you can get in.
First, the ONLY supported ways of deleting a resource is either through Cluster Administrator (Windows 2003), Failover Cluster Management (2008 and 2008 R2), CLUSTER.EXE, and Powershell (2008 R2). Reasons that have been given for manually deleting the resource from the registry is that they cannot get into the UI. This is where CLUSTER.EXE or Powershell comes in. For example, say I have a resource called Johns Resource and I want to delete it. The command I would be using to do this would be:
Cluster res “Johns Resource” /delete
Remove-ClusterResource “Johns Resource”
Using the command will remove the resource from all entries on all nodes, including the quorum.
To break this down further, a resource in the Cluster will be in several locations in the Cluster Hive and it is referenced by a guid.
C8d32427-7daa-4a94-ba85-850f5a920382 <<-- Johns Resource
28baec47-2589-49a9-aa7c-cc32b57e1875 <<-- the group name
Contains <<-- all resources in the group here
What users have been doing is simply deleting the guid under the Resources key only. This GUID can also listed in the HKEY_LOCAL_MACHINE\Cluster\Dependencies as well as the HKEY_LOCAL_MACHINE\Cluster\Checkpoints registry keys, so checking there is also needed as it is not being removed. However, the resource is still listed under the group. When they do this, they also manually delete it on all nodes as well as the quorum drive. Sometimes, it takes a restart of the Cluster Service everywhere before it finally is no longer there. CLUSTER.EXE would have done it right then and there and no restarts necessary.
In Windows 2003 Cluster, when you start the Cluster Service, we see this in the Cluster Log:
[FM] Group 28baec47-2589-49a9-aa7c-cc32b57e1875 contains Resource C8d32427-7daa-4a94-ba85-850f5a920382. [FM] Creating resource C8d32427-7daa-4a94-ba85-850f5a920382 [FM] Initializing resource C8d32427-7daa-4a94-ba85-850f5a920382 from the registry. [FM] Unable to open resource key C8d32427-7daa-4a94-ba85-850f5a920382, 2 [FM] DestroyResource: destroying C8d32427-7daa-4a94-ba85-850f5a920382 [DM] Deleting object C8d32427-7daa-4a94-ba85-850f5a920382 [FM] Failed to find resource C8d32427-7daa-4a94-ba85-850f5a920382 for group 28baec47-2589-49a9-aa7c-cc32b57e1875
When you go to open Cluster Administrator, there are no initial errors. However, if you have multiple resources that are like this in the same group, you could receive an Error 1130 (Not enough Server Storage) and you are unable to create any more resources in the group.
In Windows Server 2008 (and R2) Clusters, the results are much different. The Cluster Service will show as started; however, the cluster will not form. In the System Event Log, you will see these errors:
Event ID: 7024 Source: Service Control Manager Description: The Cluster Service terminated with service-specific error 2 (0x2).
Event ID: 1092 Source: FailoverClustering Description: Failed to form Cluster ‘clustername’ with error code 2. Failover cluster will not be available.
In the Windows 2008 Cluster Log, you will see this:
WARN [DM] Key \Registry\Machine\Cluster does not appear to be loaded (status STATUS_OBJECT_NAME_NOT_FOUND(c0000034) INFO [DM] Loading Hive, Key Cluster, FilePath C:\Windows\Cluster\CLUSDB ERR [CORE] Node 1: exception caught ERROR_FILE_NOT_FOUND(2)' because of 'OpenSubKey failed.' ERR Exception in the InstallState is fatal (status = 2) ERR FatalError is Calling Exit Process.
These are the things that you can run into by manually removing or “hacking” a resource out of the registry and not remove it from all the locations in the hive. This is also one of the reasons why this is an unsupported method for removing a resource in a Cluster. The whole reasoning for Failover Clusters is high availability. By attempting the unsupported methods above, you can cause downtime which gets away from high availability.
John Marlin Senior Support Escalation Engineer Microsoft Enterprise Platforms Support
Thanks for this very useful information. I have a question on this - in a windows server 2003 two node sql server cluster - if I am doing an activity like adding new disks to the servers then I will have to bring down the servers one at a time. Can a situation like the one that you have described above can occur after the upgrade activity?
How do I go about removing client access points on 2008 R2 clusters?
In Windows 2003, unfortunately, you would need to do this or assign the drives to only one node at a time. Assign the drives to first node only, format them, and add to the Cluster. You can then add the assignment to the other node(s) and test your failovers.
In Windows 2008 and beyond, this is not necessary as all new drives added to the systems are in an offline state. So you can manually bring them online in Disk Management (or Server Manager) individually on each node. The reasons are that we do not want to
have multiple machines with direct access to the drives while they are not Clustered. This could lead to corruption of the drive(s).
Client Access Points are simply an IP Address resource and a Network name resource. So you would delete the Name and then the IP Address, or, delete the IP Address and Cluster will auto delete the name based on the dependencies.
Currently I have a problem with a two-node cluster on Window 2008. This cluster has two SQL instances (Services and Applications) SG1DB and SG2DB instances. Instance 1 was no longer needed so I used SQL setup to remove it (using the Maintenance option and remove
node option as instructed by MSFT help) I did that on passive node first and went successfully. No longer shows up on SQL server configuration manager. Then, I did the same on active node for Instance 1 but setup ended unexpectedly almost at the end of the
process. SQL error screen says something like:
The resource 'Bck_SG1DB' could not be moved from cluster group 'SQL Server (SG1DB)' to cluster group 'Available Storage'.
Error: There was a failure to call cluster code from provider. Exception message: Generic Failure.
Status code: 5015.
Description: The operation failed because either the specified cluster node is not the owner of the resource or the node is not a possible owner of the resource
So, what I tried to do is to delete it from Failover Cluster Manager console, by selecting it from within Services and Applications but I was unable to do this 'cos I got an other error saying:
Could not move the resource to available storage. The operation failed because either the specified cluster node is not the owner of the resource or the node is not a possible owner of the resource.
I noticed that on both nodes, SQL instance on SQL server configuration manager was removed. It looks to me that it only got stuck only when it was tried to be removed from Failover cluster Manager.
Can you help me to find a way to remove it?
If you bring up the properties of the 'Bck_SG1DB' resource, you will see the Possible Owners of the resource. Make sure that all nodes are listed here.