(Or – How to patch a RMS Cluster in 14 easy steps)
I had a previous blog post on this subject that I have since taken down. This has been a “fuzzy” area really since Microsoft shipped OpsMgr 2007, mostly just because there was no official guidelines or instructions on the “right” way to apply SCOM hotfixes to a RMS cluster.
This post also won’t be “official” but comes out of knowledge gained from success and failures, and discussions with the product group on the best way to accomplish this.
The recommended process will be very similar to the R2 upgrade process for upgrading a clustered RMS from SP1 to R2. I must say I am not a fan of this process… because it forces downtime for the RMS. When generally patching Microsoft clusters – hotfixes should be applied to the passive node, then failing over the cluster and patching the other (now passive) node. This results in the least amount of downtime for the application, which is why we clustered something in the first place. This should be a design goal of any update.
However – SCOM hotfixes are not designed with this in mind. The challenge is – that several SCOM hotfixes will make changes to the database, while being applied to the RMS. They expect the SDK service to be running locally, so they can accomplish this. This FORCES us to always apply a SCOM hotfix to the active node (under the current design).
This process of updating the active node, results in the stopping of the services on each active node. This might trigger a failover of the cluster resources, which we do NOT want to happen. So – we must take steps to keep the failover from occurring. This will be covered in detail below.
I really hope, that if we continue with supporting any clustered roles in SCOM – that we design the hotfix process around always patching the passive node in the future… as that limits the downtime of an update to the smallest amount possible.
Ok – enough about the process – lets get down to it.
In this example – I will be applying the OpsMgr R2 CU1 rollup hotfix to an OpsMgr R2 RMS Windows 2003 cluster. Any SCOM hotfix would follow the same process, unless there is specific guidance in the release notes for the hotfix. ALWAYS read the release notes.
High level overview:
- Log on to the active node (node1) of the RMS cluster
- Configure the cluster to not be able to fail over to the passive node (node2).
- Run the hotfix update on the active node (node1)
- Validate that the hotfix applied successfully to node1.
- Configure the cluster to be able to fail over to the passive node (node2).
- Move all the cluster resource groups to node2.
- Configure the cluster to not be able to fail over to node1.
- Reboot node1 if you were prompted to do so by the update.
- Log on to node2, and run the hotfix update on the newly active node (node2)
- Validate that the hotfix applied successfully to node2.
- Configure the cluster to be able to fail over to node1.
- Move all the cluster resource groups back to node1.
- Reboot node2 if you were prompted to do so by the update.
- Verify that you can fail the cluster groups to node2 and then back to node1.
In this example – my cluster nodes are RMSCLN1 and RMSCLN2. My RMS virtual name is RMSV1.
I log on to RMSCLN1. The first thing I do is fail everything over to RMSCLN2, and then back to RMSCLN1, because I like to make sure my cluster (or customer’s cluster) is actually working correctly before we make any changes. I have seen poorly managed clusters, which were broken, and the SCOM hotfix got blamed when they didn't work after the update. Now – after this – I have validated that the virtual RMS (RMSV1) is currently running on RMSCLN1:
I need to configure the RMS to NOT be able to fail over to the passive node (RMSCLN2). There are two ways to do this…. we can configure the “possible owners” (see the R2 upgrade guide) of each clustered resource, OR – we can simply “pause” the passive node while working on the active node. I like the “pause” method as it is much simpler and less work to accomplish the same goal. In cluster administrator, I simply right click the passive node name, and choose “Pause Node”:
Here is how it looks if your cluster is Server 2008: (I am using a SQL server here for example only)
Ok, with RMSCLN2 paused – I begin the update on RMSCLN1. If this were a Server 2008 OS, I would open a Command prompt “As an administrator” to run the update. This is critical. I also MUST run the original MSI that I downloaded, and nothing else. So – I kick off SystemCenterOperationsManager2007-R2CU1-KB974144-X86-X64-IA64-ENU.MSI.
This is actually the R2 hotfix utility installer. **Note: If you have ever previously installed this – you might get prompted to repair or remove. Simply REMOVE it, then kick it off again. We want to execute the server update process from the update window that comes up at the end of this installation. This is critical to a successful update, as there are post-update processes that run after the hotfix that will not run if you attempt ANY other method to start the update.
Install the hotfix installer to the DEFAULT LOCATION of: C:\Program Files (x86)\System Center 2007 R2 Hotfix Utility\
This will install the hotfix update files locally. It will run very quickly and you will be presented with the Software Update splash screen. This is where we MUST execute the update from:
Click “Run Server Update” to start the actual hotfix update.
DONT CLICK anything else while the update process is going. Just calm yourself, and let the process finish. I wont say the SCOM update process is fragile, but I have seen so many weird things that happen, and it seems if you just let it finish and don't try to do other things while this process completes, I have a much higher success rate. So JUST WAIT.
If all goes well, the update takes about 5 minutes (generally) and you will see this:
Click Finish. WAIT.
You might be prompted to restart the computer if a file could not be updated during the process. ALWAYS CHOOSE NO!
WAIT. There are some post install processes running that you need to wait just a bit for. You might see another dialogue box about a pending reboot required… click OK to close it, and wait just a bit afterwards. Let the “invisible” post install processes complete. This is usually done within 15 seconds.
Now (and only now after waiting a bit) – you can click “Exit” on the Software Update splash screen:
And then click “Close” on the Hotfix Utility installation complete screen:
The next step in the process is to validate that all necessary actions completed with success on updating this cluster. The two things I like to check for validation are that the core DLL files got updated (just need to spot-check a few – not compare each DLL against the KB – some are not updated by design)
I like to browse the \Program Files\System Center Operations Manager 2007\ directory, sort by date – and spot check a few DLL’s in the KB:
Next up on validation – is to make sure the Agent files got updated, for any subsequent agent push/repair/update actions… to make sure the agents will always get the latest. So – we will inspect \Program Files\System Center Operations Manager 2007\AgentManagement\x86\ and \Program Files\System Center Operations Manager 2007\AgentManagement\amd64\
Alternatively – for validation – I would check the release notes for any other areas that are called out for validating, such as possibly the registry, or a SQL query, or MP version updates, etc…
Now I need to allow failover to RMSCLN2, so I un-pause (resume) RMSCLN2.
Server 2008: (I am using a SQL server here for example only)
Next I need to fail over (move) all cluster resource groups to RMSCLN2. I use the UI for this – calling “Move Group” for the Cluster group, and the RMS group. In 2008, you would use “Move this service or application to another node”.
7. Now – I need to Pause RMSCLN1 in the same method as I did previously for RMSCLN2 in step 2.
The next step is to reboot RMSCLN1. However – before you do this – there is something you should understand about SCOM hotfixes. We actually mess up the services configuration – and we set the System Center Data Access (SDK) and the System Center Management Configuration (Config) service to “Automatic”. This is bad… because we DONT want these services running in two places. The cluster configuration is to set all clustered services to Manual, and let the cluster handle who is running what. So – BEFORE you reboot Node1, you need to set these services back to manual so they don't try and start on the reboot. This is simply a bug in that the SCOM hotfixes are not cluster aware.
I log on to RMSCLN2, and follow the exact same update steps as documented in step 3 above. (install from elevated command prompt, install hotfix util and run update in a single pass, don't get click-happy. See detailed steps from #3 above.
10. I run the same validation steps from step #4 above on RMSCLN2
11. I un-pause RMSCLN1 in the cluster configuration.
12. I move all cluster resource groups back to RMSCLN1.
13. I configure the SDK and CONFIG services back to MANUAL, and then reboot RMSCLN2.
At this point, I like to go ahead and perform one more failover routine… after the second node completes it’s reboot. This is a good step to ensure after making changes – that the cluster is fully functional, we didn't forget anything, and the failover functionality will work in the case of a future issue.
That’s it! :-)
Next – I would move on to any other required steps in the specific update/hotfix release notes. For example – in R2 CU1, I need to perform a SQL script update, and then a MP import, before moving on applying the update to the rest of the servers. You can see the detailed steps about these, specific to CU1, in my post here: