Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

How to apply a SCOM hotfix to a clustered RMS

How to apply a SCOM hotfix to a clustered RMS

  • Comments 6
  • Likes

(Or – How to patch a RMS Cluster in 14 easy steps)

 

I had a previous blog post on this subject that I have since taken down.  This has been a “fuzzy” area really since Microsoft shipped OpsMgr 2007, mostly just because there was no official guidelines or instructions on the “right” way to apply SCOM hotfixes to a RMS cluster. 

This post also won’t be “official” but comes out of knowledge gained from success and failures, and discussions with the product group on the best way to accomplish this.

 

The recommended process will be very similar to the R2 upgrade process for upgrading a clustered RMS from SP1 to R2.  I must say I am not a fan of this process… because it forces downtime for the RMS.  When generally patching Microsoft clusters – hotfixes should be applied to the passive node, then failing over the cluster and patching the other (now passive) node.  This results in the least amount of downtime for the application, which is why we clustered something in the first place.  This should be a design goal of any update.

However – SCOM hotfixes are not designed with this in mind.  The challenge is – that several SCOM hotfixes will make changes to the database, while being applied to the RMS.  They expect the SDK service to be running locally, so they can accomplish this.  This FORCES us to always apply a SCOM hotfix to the active node (under the current design). 

This process of updating the active node, results in the stopping of the services on each active node.  This might trigger a failover of the cluster resources, which we do NOT want to happen.  So – we must take steps to keep the failover from occurring.  This will be covered in detail below.

I really hope, that if we continue with supporting any clustered roles in SCOM – that we design the hotfix process around always patching the passive node in the future… as that limits the downtime of an update to the smallest amount possible.

 

Ok – enough about the process – lets get down to it.

In this example – I will be applying the OpsMgr R2 CU1 rollup hotfix to an OpsMgr R2 RMS Windows 2003 cluster.  Any SCOM hotfix would follow the same process, unless there is specific guidance in the release notes for the hotfix.  ALWAYS read the release notes.

 

High level overview:

  1. Log on to the active node (node1) of the RMS cluster
  2. Configure the cluster to not be able to fail over to the passive node (node2).
  3. Run the hotfix update on the active node (node1)
  4. Validate that the hotfix applied successfully to node1.
  5. Configure the cluster to be able to fail over to the passive node (node2).
  6. Move all the cluster resource groups to node2.
  7. Configure the cluster to not be able to fail over to node1.
  8. Reboot node1 if you were prompted to do so by the update.
  9. Log on to node2, and run the hotfix update on the newly active node (node2)
  10. Validate that the hotfix applied successfully to node2.
  11. Configure the cluster to be able to fail over to node1.
  12. Move all the cluster resource groups back to node1.
  13. Reboot node2 if you were prompted to do so by the update.
  14. Verify that you can fail the cluster groups to node2 and then back to node1.

 

Detailed process:

 

In this example – my cluster nodes are RMSCLN1 and RMSCLN2.  My RMS virtual name is RMSV1.

 

1. 

I log on to RMSCLN1.  The first thing I do is fail everything over to RMSCLN2, and then back to RMSCLN1, because I like to make sure my cluster (or customer’s cluster) is actually working correctly before we make any changes.  I have seen poorly managed clusters, which were broken, and the SCOM hotfix got blamed when they didn't work after the update.  Now – after this – I have validated that the virtual RMS (RMSV1) is currently running on RMSCLN1:

 image

 

2. 

I need to configure the RMS to NOT be able to fail over to the passive node (RMSCLN2).  There are two ways to do this…. we can configure the “possible owners” (see the R2 upgrade guide) of each clustered resource, OR – we can simply “pause” the passive node while working on the active node.  I like the “pause” method as it is much simpler and less work to accomplish the same goal.  In cluster administrator, I simply right click the passive node name, and choose “Pause Node”:

image image

 

Here is how it looks if your cluster is Server 2008:  (I am using a SQL server here for example only)

image image

 

3. 

Ok, with RMSCLN2 paused – I begin the update on RMSCLN1.  If this were a Server 2008 OS, I would open a Command prompt “As an administrator” to run the update.  This is critical.  I also MUST run the original MSI that I downloaded, and nothing else.  So – I kick off SystemCenterOperationsManager2007-R2CU1-KB974144-X86-X64-IA64-ENU.MSI.  

This is actually the R2 hotfix utility installer.  **Note:  If you have ever previously installed this – you might get prompted to repair or remove.  Simply REMOVE it, then kick it off again.  We want to execute the server update process from the update window that comes up at the end of this installation.  This is critical to a successful update, as there are post-update processes that run after the hotfix that will not run if you attempt ANY other method to start the update.

Install the hotfix installer to the DEFAULT LOCATION of:  C:\Program Files (x86)\System Center 2007 R2 Hotfix Utility\

This will install the hotfix update files locally.  It will run very quickly and you will be presented with the Software Update splash screen.  This is where we MUST execute the update from:

image

 

Click “Run Server Update” to start the actual hotfix update.

DONT CLICK anything else while the update process is going.  Just calm yourself, and let the process finish.  I wont say the SCOM update process is fragile, but I have seen so many weird things that happen, and it seems if you just let it finish and don't try to do other things while this process completes, I have a much higher success rate.  So JUST WAIT.

If all goes well, the update takes about 5 minutes (generally) and you will see this:

image

 

Click Finish.  WAIT.

You might be prompted to restart the computer if a file could not be updated during the process.  ALWAYS CHOOSE NO!

image

 

WAIT.  There are some post install processes running that you need to wait just a bit for.  You might see another dialogue box about a pending reboot required… click OK to close it, and wait just a bit afterwards.  Let the “invisible” post install processes complete.  This is usually done within 15 seconds.

Now (and only now after waiting a bit) – you can click “Exit” on the Software Update splash screen:

image

 

And then click “Close” on the Hotfix Utility installation complete screen:

image

 

 

4. 

The next step in the process is to validate that all necessary actions completed with success on updating this cluster.  The two things I like to check for validation are that the core DLL files got updated (just need to spot-check a few – not compare each DLL against the KB – some are not updated by design) 

I like to browse the \Program Files\System Center Operations Manager 2007\ directory, sort by date – and spot check a few DLL’s in the KB:

image

 

Next up on validation – is to make sure the Agent files got updated, for any subsequent agent push/repair/update actions… to make sure the agents will always get the latest.  So – we will inspect \Program Files\System Center Operations Manager 2007\AgentManagement\x86\ and \Program Files\System Center Operations Manager 2007\AgentManagement\amd64\

Got it:

image

image

Alternatively – for validation – I would check the release notes for any other areas that are called out for validating, such as possibly the registry, or a SQL query, or MP version updates, etc…

 

5. 

Now I need to allow failover to RMSCLN2, so I un-pause (resume) RMSCLN2.

 

Server 2003:

image

Server 2008:  (I am using a SQL server here for example only)

image

 

6. 

Next I need to fail over (move) all cluster resource groups to RMSCLN2.  I use the UI for this – calling “Move Group” for the Cluster group, and the RMS group.  In 2008, you would use “Move this service or application to another node”.

 

7.  Now – I need to Pause RMSCLN1 in the same method as I did previously for RMSCLN2 in step 2.

 

8. 

The next step is to reboot RMSCLN1.  Howeverbefore you do this – there is something you should understand about SCOM hotfixes.  We actually mess up the services configuration – and we set the System Center Data Access (SDK) and the System Center Management Configuration (Config) service to “Automatic”.  This is bad… because we DONT want these services running in two places.  The cluster configuration is to set all clustered services to Manual, and let the cluster handle who is running what.  So – BEFORE you reboot Node1, you need to set these services back to manual so they don't try and start on the reboot.  This is simply a bug in that the SCOM hotfixes are not cluster aware.

 

9. 

I log on to RMSCLN2, and follow the exact same update steps as documented in step 3 above. (install from elevated command prompt, install hotfix util and run update in a single pass, don't get click-happy.  See detailed steps from #3 above.

 

10.  I run the same validation steps from step #4 above on RMSCLN2

 

11.  I un-pause RMSCLN1 in the cluster configuration.

 

12.  I move all cluster resource groups back to RMSCLN1.

 

13.  I configure the SDK and CONFIG services back to MANUAL, and then reboot RMSCLN2.

 

14. 

At this point, I like to go ahead and perform one more failover routine… after the second node completes it’s reboot.  This is a good step to ensure after making changes – that the cluster is fully functional, we didn't forget anything, and the failover functionality will work in the case of a future issue.

 

That’s it!  :-)

 

Next – I would move on to any other required steps in the specific update/hotfix release notes.  For example – in R2 CU1, I need to perform a SQL script update, and then a MP import, before moving on applying the update to the rest of the servers.  You can see the detailed steps about these, specific to CU1, in my post here:

 

http://blogs.technet.com/kevinholman/archive/2010/01/17/opsmgr-2007-r2-cu1-rollup-hotfix-ships-and-my-experience-installing-it.aspx

Comments
  • Great posting Kevin. Really fills up some gaps. Thanks!

  • Do we need change the SDK, Conig and Health Service to Automatic while applying hotfixes to RMS Active Node?

  • @Babulal Ghule -

    Absolutely not.  Just follow the steps as they are documented.

  • Awesome post Kevin. One suggestion, it would be nice if you could also indicate the time it took for you to perform this type of upgrade as it will give us SCOM engineers an idea how big of change window we should schedule in our environment. I understand there are other variables and every environment/company is unique, but just a ball park number could help us.

    Thanks

    MA

  • @Murad -

    The amount of time varies - customer to customer....   and I know this outage is very critical since there is a bounce of the services - the console/SDK is unavailable for a short time.

    I tell my customers to patch the RMS, plan for a minimum of 30 minutes and a maximum of 2 hours.  (provided there are no serious issues or outages)

    To update the entire management group - it depends on how good you are at running multiple steps at the same time, and how much infrastructure you have in the management group.  We performed an R2 and CU2 upgrade from SP1 at a very large customer recently, and we averaged 4 hours per management group to get both updates in place (not counting agents).

  • Hello,

    Does this apply to the RMS Server when clusterd or also to the SQL Server clustered?

    Only the SQL Server is clsuterd for my environment.

    Thanks,

    Dom

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs