It’s been a while since my last post – not due to the dearth of interesting issues to blog about, rather, there have been TOO many issues, interesting and otherwise. Too many issues == too much work!

Nonetheless, this particular issue was interesting enough to make me take some time out, despite my busy schedule, and write about it.

On one node of a Windows Server 2008 SP2 Failover Cluster, the Cluster service would start, then terminate within a few seconds. If you run a cluster log /g and review C:\Windows\Cluster\Reports\cluster.log, you’ll see this:

 

00001928.00000f2c::2011/11/23-07:05:30.498 INFO  [RCM] Created monitor process 5828 / 0x16c4
00001d10.00001228::2011/11/23-07:05:31.496 INFO  [RHS] Initializing.
00001928.00000f2c::2011/11/23-07:05:31.590 INFO  [RCM] Created 32-bit monitor process 3800 / 0xed8
00001928.00000f2c::2011/11/23-07:05:35.068 WARN  [RCM] rcm::RcmMonitor::StartMonitor: Retrying...
00001928.00000f2c::2011/11/23-07:05:35.068 INFO  [RCM] Created 32-bit monitor process 5340 / 0x14dc
00001928.00000f2c::2011/11/23-07:05:38.656 WARN  [RCM] rcm::RcmMonitor::StartMonitor: Retrying...
00001928.00000f2c::2011/11/23-07:05:38.656 INFO  [RCM] Created 32-bit monitor process 9564 / 0x255c
00001928.00000f2c::2011/11/23-07:05:42.337 WARN  [RCM] rcm::RcmMonitor::StartMonitor: Retrying...
00001928.00000f2c::2011/11/23-07:05:42.337 INFO  [RCM] Created 32-bit monitor process 10092 / 0x276c
00001928.00000f2c::2011/11/23-07:05:45.956 WARN  [RCM] rcm::RcmMonitor::StartMonitor: Retrying...
00001928.00000f2c::2011/11/23-07:05:45.956 INFO  [RCM] Created 32-bit monitor process 7376 / 0x1cd0
00001928.00000f2c::2011/11/23-07:05:48.608 ERR   [CORE] Node 1: exception caught ERROR_SUCCESS(0)' because of 'Too many failures while attempting to start RHS process.'
00001928.00000f2c::2011/11/23-07:05:48.624 ERR   Exception in the InstallState is fatal (status = 0)
00001928.00000f2c::2011/11/23-07:05:48.624 ERR   FatalError is Calling Exit Process.
00001928.000027d4::2011/11/23-07:05:48.624 INFO  [CS] About to exit process...
00001d10.00001228::2011/11/23-07:05:48.640 WARN  [RHS] Cluster service has terminated.
00001d10.00001228::2011/11/23-07:05:48.671 INFO  [RHS] Exiting.

 

From the log snippet, we can see that the 64 bit RHS.exe was launched with Process ID 5828, but after multiple attempts, the Cluster service is not able to launch the 32 bit RHS.exe process.

The RHS.exe process is critical to the functioning of the cluster and if it cannot launch, the Cluster service cannot run. As this is a 64 bit version of Windows, we have a 64 bit version of RHS.exe – for 64 bit Resource DLLs and a 32 bit version – for 32 bit Resource DLLs. You can think of Resource DLLs as an interface between the Cluster service and the applications(SQL, Exchange, SAP) or components(IP Address, Network Name, Physical Disks) which have Cluster resources. These DLLs provide the Cluster service with “entry points” to control the applications and components through their Cluster resources. Examples of these entry points are Open, Close, Online, Offline.

Now we know that the Cluster service fails to start on this node because it cannot launch the 32 bit RHS.exe process. Our attention therefore turns to why the 32 bit RHS.exe fails to launch.

If the Cluster service fails to launch, would we be able to launch it manually? Let’s give it a shot.

We’ve got to remember that this is the 32 bit RHS.exe, which is in C:\Windows\SysWOW64 – not C:\Windows\Cluster, which contains the 64 bit RHS.exe

Open an elevated (Administrator) command prompt, change directory to C:\Windows\SysWOW64 and run RHS.exe. Immediately, we get this error:

 

dbghelp

 

There we go! The Cluster service is not able to launch it and neither are we able to do so, manually.

Why am I excited? Because the pop-up message mentions a certain “dbghelp.dll”. The first place to look for this file would be the C:\Windows\SysWOW64 – because that’s where RHS.exe is as well.

When I browsed through the folder to look for dbghelp.dll, we found that the file was present but there was also a dbghelp.dll.old file that caught my attention.

Dbghelp.dll was about 1300kb and dbghelp.dll.old was 780kb. That’s interesting. Why? Because my lab Windows Server 2008 SP2 system has the 780kb dbghelp.dll, but not the 1300kb file.

Turns out that the Application team had a specific requirement to use the 1300kb dbghelp.dll while upgrading their clustered application, so they switched out the original file, put this one in and forgot about it. Their application may have been compatible with this new dbghelp.dll, however, RHS.exe was not, and that’s why it failed to start.

Resolving the issue now that we know the root cause is a matter of renaming the 1300kb dbghelp.dll to something else, taking ownership of the 780kb dbghelp.dll.old file, giving Administrators FULL CONTROL, renaming this file to dbghelp.dll, changing the ownership back to NT SERVICE\TrustedInstaller and changing Administrators permissions back to only Read and Read & Execute. PHEW! Issue resolved!