Applying an OpsMgr hotfix to a RMS Cluster node? Some things to be aware of.
When you apply a SCOM hotfix to a RMS cluster, you need to be aware of some issues, and some workarounds. This is something I have seen several times in the field…
On any server/agent, the Hotfix installer will stop any discovered OpsMgr services, including the SDK, Config, and HealthService. This part is normal. It does this in order to update the files (DLL’s) that are part of the hotfix payload, and then it will start the services again when complete. This all works well, except for on RMS clusters.
The reason for this, is that the Hotfix installer is not 100% cluster aware.
In a RMS cluster… the passive node will have these three services stopped, and the services will be set to Manual Startup. On the active node – the OpsMgr services are also set to Manual Startup, but the services are running, because the Cluster service controls these services now. This is how a clustered service works, and we should not ever stop a clustered service in Service Control Manager, we really should take the resource offline, in Cluster Admin.
So I have two options… I can apply the hotfix to the Active Node… or the Passive node.
If I choose the active node – the hotfix installer will try and stop all the OpsMgr services, and this will cause the Cluster service to try and restart them, or eventually fail them over to the passive node – depending on your Cluster configuration settings. Therefore – it is probably best to patch the passive node first… ensure the hotfix applied correctly, and then move the cluster group and OpsMgr RMS group over to the freshly hotfixed node… and go patch the other one (now passive)
This works – but is not 100% smooth. When we apply the hotfix to the passive node, the hotfix installer will try and start the services at the end of the process, even though they were not running previously. We do NOT want these services trying to run on the passive node – since it does not own the cluster disk resources…. so the services will start, but cannot do anything but log errors.
You will also see an error from the HealthService – not being able to start. It is apparent that this service fails because it cannot access the disk resource, but the SDK and config services WILL start.
What is worse – is that the hotfix installer – changes the config of the service startup types to Automatic – which means these services will continue to try and run on the passive node across reboots.
So – the guidance I have, for RMS clusters – is:
- Patch the passive node (we will call this Node 2)
- Click ok on the HealthService start failure error.
- Ensure the hotfix applied by inspecting the DLL(s) versions as documented in the KB.
- Stop the running SDK and Config services on the passive node.
- Set any OpsMgr services that were changed to Automatic – BACK to Manual.
- Move the cluster resource groups over to the freshly patched Node 2.
- On Node 1 (now passive) apply the hotfix, and repeat steps starting at Step 2 above.
NOTE: This is only applicable to OpsMgr specific hotfixes. For OS hotfixes – you would follow your standard clustered OS hotfix routine.