Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Applying an OpsMgr hotfix to a RMS Cluster node? Some things to be aware of.

Applying an OpsMgr hotfix to a RMS Cluster node? Some things to be aware of.

  • Comments 8
  • Likes

 

EDIT 2/1/2010 

the process I came up with to patch RMS clusters was developed out of trial and error – because there was never any direct guidance previously.  I am told this is forthcoming – and my process below is incorrect…. due to how some hotfixes work.

<insert link here to correct RMS cluster hotfix process as documented by the product group>

---------------------------------------

 

When you apply a SCOM hotfix to a RMS cluster, you need to be aware of some issues, and some workarounds.  This is something I have seen several times in the field…

 

On any server/agent, the Hotfix installer will stop any discovered OpsMgr services, including the SDK, Config, and HealthService.  This part is normal.  It does this in order to update the files (DLL’s) that are part of the hotfix payload, and then it will start the services again when complete.  This all works well, except for on RMS clusters.

 

The reason for this, is that the Hotfix installer is not 100% cluster aware. 

In a RMS cluster… the passive node will have these three services stopped, and the services will be set to Manual Startup.  On the active node – the OpsMgr services are also set to Manual Startup, but the services are running, because the Cluster service controls these services now.  This is how a clustered service works, and we should not stop a clustered service in Service Control Manager, we really should take the resource offline, in Cluster Admin. 

 

So I have two options… I can apply the hotfix to the Active Node… or the Passive node. 

If I choose the active node – the hotfix installer will try and stop all the OpsMgr services, and this will cause the Cluster service to try and restart them, or eventually fail them over to the passive node – depending on your Cluster configuration settings.  Therefore – it is probably best to patch the passive node first… ensure the hotfix applied correctly, and then move the cluster group and OpsMgr RMS group over to the freshly hotfixed node… and go patch the other one (now passive)

This works – but is not 100% smooth.  When we apply the hotfix to the passive node, the hotfix installer will try and start the services at the end of the process, even though they were not running previously.  We do NOT want these services trying to run on the passive node – since it does not own the cluster disk resources…. so the services will start, but cannot do anything but log errors.  

You will also see an error from the HealthService – not being able to start.  It is apparent that this service fails because it cannot access the disk resource, but the SDK and config services WILL start.

What is worse – is that the hotfix installer – changes the config of the service startup types to Automatic – which means these services will continue to try and run on the passive node across reboots.

 

So – the guidance I have, for RMS clusters – is:

  1. Patch the passive node (we will call this Node 2)
  2. Click ok on the HealthService start failure error.
  3. Ensure the hotfix applied by inspecting the DLL(s) versions as documented in the KB.
  4. Stop the running SDK and Config services on the passive node.
  5. Set any OpsMgr services that were changed to Automatic – BACK to Manual.
  6. Move the cluster resource groups over to the freshly patched Node 2.
  7. On Node 1 (now passive) apply the hotfix, and repeat steps starting at Step 2 above.

NOTE:  This is only applicable to OpsMgr specific hotfixes.  For OS hotfixes – you would follow your standard clustered OS hotfix routine.

 

So – in summary – there is information coming soon that will have the correct information. 

Until then – I would follow the EXACT same process as documented in the R2 upgrade guide

for any hotfixes that need to be applied on a clustered RMS.

Comments
  • Hi

    Great article - it´s a problem which is very little covered

    My experience with OpsMgr hotfix installation on Cluster environment is to take RMS cluster offline (3 services) - install hotfix on active node - check the automatic/manuel service issue and vertified the services is still stopped (otherwise stop them)- verified the RMS cluster is still offline - install hotfix on passive node - check the automatic/manuel service issue and verified the services is still stopped (otherwise stop them)- verified the RMS cluster is still offline. Take it online and look i Event View for errors. After that - the MS, GW, Manually installed agents can be updated and Agents approved

  • The only thing I dont like about that - is:

    1.  Your method experiences downtime for SCOM.  We should be able to patch SCOM with no downtime if we have a clustered RMS.  By patching the passive node in all cases, this is closer to how we patch the OS in a clustered situation.

    2.  I dont like taking cluster resources online by forcing the services to start.  This can have unpredictable results... and potentially cause the cluster to fail to start on the node you are patching, and failover to the other node.  I played with that process, and that is how I came up with the process I documented.

    Not saying yours wont work - it will... it just seems like it doesnt have any pro's over always patching the passive node?  The only pro I can see, is your way always ensures there is only one Config and SDK service running at any given time.

  • certainly agree with your downtime issue. But as Opsmgr is not clusteraware - it is not the primary goal to keep it up the whole time, but to ensure that the cluster is functional after hotfix patching. But I will try your scenario and see how it works - so if the result is the same, you are right - no downtime. But it´s nice to discuss it, cause there isn’t a lot hotfix_cluster information out there and no recommendations from Microsoft.

  • Patch active Node and pause passive node cluster so that SCOM does not failover.  Then move SCOM to other node and repeat the same process.

  • Why patch the active node?  I am curious.  The concept behind clustering - is no downtime for the application, even while patching.  If you patch the active node this is not the case.  What is the benefit of patching the active node, over the steps I outlined in the article?

  • I have been testing three methods of patching Operations Mangager 2007 hotfixes on clustered RMS.

    One mentioned from Derek which will go for patching the active node and pause the passive meanwhile.

    The second method is Kevin´s way of patching which go for standard patching method where it is the passive node we go for first.

    The Third method I used to go for was a method where the goal was to path the active node, but the cluster RMS should be offline first. And seen from a operations Manager administrator perspective the easiest way is the Kevin way. No downtime and no OpsMgr Windows services which will merge with the three OpsMgr cluster services, because the OpsMgr cluster services control it.

    From an Operations Manager user perspective it will best with the Kevin method because of no downtime (or a little when do a move. This is done with looking in event logs, but how those three ways of patching methods have a influence on the Operations Manager 2007 DB or other places, I have not tested. Maybe somebody will.

  • The console push worked for about 2/3 of the "remotely manageable" agents on my system. I didn't get any failures when I pushed out the update and the 1/3 that I need to re-do don't show up in pending, so there are a bunch I have to do manually. There's no pattern on the systems that didn't get the update OK, either.

    I tried to run a "repair" on a couple of the affected systems, and that didn't do it.

    I tried the manual install, however, running setupupdateom.exe never worked for me. It'd start the splash screen and then bomb out with the dialog for the Windows Installer.

  • Nevermind, I figured out the problem with the manual install, I forgot to run it on a command line as admin on the 2k8 systems.

    However, I still need to figure out why 1/3 of my systems didn't get the update deployed correctly.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs