SCOM - Runbook for persisting Stale Heartbeat Alerts

One of the more common alerts that I see in SCOM environments is the ‘Health Service Heartbeat Failure’.  All too often it is still possible to ping the system, and yet the heartbeat alert remains.  The typical fix is to simply stop the health service on the problem system, delete the contents of the Health Service State folder for the agent, and to restart the service.  In most cases this resolves the issue.  Since this is such a common occurrence I decided to create a runbook in System Center Orchestrator to automate the fix.  In order for this to work it is necessary to be running System Center Operations Manager and Orchestrator 2012 and to have the SC 2012 Operations Manager Integration pack installed in Orchestrator.  Note: this will not resolve any actual problems with failing heartbeats, it will simply clear the cache and force the agent to attempt to update policy.

Activities:

  • The first activity in the runbook is the ‘Monitor Alert’ .  This will monitor for any occurrence of the alert in question.  This activity can be found in the SC 2012 Operations Manager section.  Click the button to the right of the Sever > Connection field and select your SCOM Management Server from the list.  For testing purposes I like to trigger on both New Alerts and Updated alerts.  This allows me to manually set an alert to a new state in order to let the runbook monitor for it instead of waiting for an alert to trigger.  For my runbook I created the following 3 filters:
  • Severity Equals Critical
  • Name Equals Health Service Heartbeat Failure
  • ResolutionState Equals New

  • The second activity is a simple ‘Run Program’ which will ping the system to make sure it is online.  This activity can be found in the System section.  Create a link from the ‘Monitor Alert’ to the ‘Run Program’ activity.  There should be a default include filter created on the link for Monitor Alert returns success.  In the ‘Run Program’ properties select the Command execution mod and enter the name of the computer you are running it from, in this case I used the SCOM Management server since it should be able to talk to all SCOM Agents.  In the command field we will  subscribe to data from the Monitor Alert activity.  Type ping -4, right click to the right of the new text and select Subscribe >  Published Data.  Ensure that the Monitor Alert Activity is selected from the drop down at the top and select the ‘MonitoringObjectDisplayName’ for the published data before clicking OK.  Click Finish to complete the activity configuration.

  • The third activity is ‘Get Service Status’ which will simply return the state of the HealthService.  The following activities will depend upon the state of this service.  This activity can be found in the Monitoring section.  Create a link from the ‘Run Program’ to the ‘Get Service Status’.  In the properties of the new activity we will subscribe to data from the ‘Monitor Alert’ to get the computer.  Right click in the Computer field and select Subscribe > Published Data.  Select Monitor Alert from the Activity drop down and choose the ‘MonitoringObjectDisplayName’.  The Service name will vary depending on whether you are running Operations Manager 2012 or 2012R2.  You can select the button to the right of the field to browse services on the current system or you can simply type the name of the service.  In a 2012 environment the HealthService will be named ‘System Center Management’ and in a R2 environment it will be ‘Microsoft Monitoring Agent’.  

  • For the forth activity we will actually create two different versions.  Depending on whether the previous activity showed that the HealthService was running or stopped the runbook will choose one or the other.  Here we will create 2 versions of the ‘Start/Stop Service’ activity in the System section.
    • Rename the first ‘Start/Stop Service’ activity to ‘Start HealthService’. Create a link from the ‘Get Service Status’ activity to the new ‘Start HealthService’ Activity.  On the link properties change the Include filter to ‘Service status from Get HealthService Status equals Service stopped’.  In the properties for the new ‘Start HealthService’ Activity click the action button for ‘Start service’.  Right click the Computer field and select Subscribe > Published data and select the Service Computer from the ‘Get HealthService Status’ Activity.  For the Service enter the Service name used in the previous activity (either Microsoft Monitoring Agent or System Center Management).

    • Rename the second ‘Start/Stop Service’ activity to Stop HealthService. Create a link from the ‘Get Service Status’ activity to the new ‘Stop HealthService’ Activity.  On the link properties change the Include filter to ‘Service status from Get HealthService Status equals Service Running’.  In the properties for the new ‘Stop HealthService’ activity click the action button for ‘Stop service’.  Right click the Computer field and select Subscribe > Published data and select the Service Computer from the ‘Get HealthService Status’ Activity.  For the Service enter the Service name used in the previous activity (either Microsoft Monitoring Agent or System Center Management).
      (screenshots skipped since they are nearly the same as those shown above, except for the service state)

 

  • For the fifth activity we will be deleting the stale health state on the problem system.  The ‘Delete Folder’ activity can be found in the File Management section.  Create a link from the ‘Stop HealthService’ Activity to this ‘Delete Folder’.  Open the Details section type \\ in the path field and right click in the field to the right.  Select Subscribe >  Published Data.  Select the ‘Get Health Service’ Activity and choose Service computer.  The the right of this text it will be necessary to type the path to the Health Service State folder.  This will vary depending on the version of Operations Manager you are running.  In 2012 R2 the default path is ‘\c$\Program Files\Microsoft Monitoring Agent\Agent\Health Service State’.  Ensure the ‘Delete all files and sub-folders’ option is selected.

  • For the sixth and final step we will repeat the Start Service step from earlier.  Create a new ‘Start/Stop Service’ activity and rename it to ‘Start HealthService’. Create a link from the ‘Get Service Status’ activity to the new ‘Start HealthService’ Activity.  On the link properties change the Include filter to ‘Service status from Get HealthService Status equals Service stopped’.  In the properties for the new ‘Start HealthService’ activity click the action button for ‘Start service’.  Right click the Computer field and select Subscribe > Published data and select the Service Computer from the Get HealthService Status Activity.  For the Service enter the Service name used in the previous activity (either Microsoft Monitoring Agent or System Center Management).

For testing I would make sure you have a current Heartbeat Alert and set the resolution state to something other than new or manually stop the service on an agent system.  Make sure to start the runbook and set the heartbeat alert resolution state to new.  Monitor the runbook, the alert, and the SCOM Agent to ensure the process works as expected.  Good luck and I hope this helps with your re-occurring Heartbeat Alerts.