Are your agents restarting every 10 minutes? Are you sure?
**Updated 6-22-2009 – This post applies to SP1 ONLY!!! This architecture has changed for R2. The version of this article updated for R2 is located here: http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx
Here is something I have been seeing with more and more customers…. and I think everyone should take a look and consider this.
They have a decent percentage of agents, that the HealthService is being programmatically restarted every 10-12 minutes (sometimes less often than 12 minutes, but still very frequent.
This is being caused by one of a few workflows. By default, there is a monitor that watches the HealthService resources, and runs a script to bounce the Health Service anytime that service is consuming too many resources. This is good – because we don't want an issue with a SCOM agent to ever impact the available resources on a monitored agent.
The bad? Well, we don't (by default) alert when the script is called that restarts the HealthService. This means – this can be affecting you – and you really have no way of knowing this out of the box. Also – sometimes the restart fails…. and this leaves you with a handful of agents that generate a Heartbeat failure. When you look at the agent – it is fine, just the HealthService isn't running…. you start it, and everything goes back to normal… or so you think. If every day you have to respond to a few Heartbeat failures…. and you find the OS up – just the agent stopped… this might be the cause.
Here are the two monitors which target “Health Service” – and can restart the service:
Here are the two rules, targeting “Agent”:
Most often – I see it is the monitor causing the restart… on the Health Service Private bytes, but it could be any of the above – and you should consider any of these if you are impacted by this.
Note the default overrides. We allow the management servers to use up to 1.6GB of HealthService privatebytes, and if you have the current Exchange 2007 MP – we have an override which allows Exchange 2007 agents to use up to 600MB, up from the default which is 100MB.
The Exchange 2007 MP was updated with this override, because this issue was already detected for large Exchange servers.
The problem is – that other large servers can potentially use more than 100MB.
I have been seeing this mostly on:
1. Large SQL database servers
2. Server 2008 domain controllers
3. DHCP servers with large scope counts
4. Exchange 2007 servers
5. Large Exchange 2003 Mailbox servers
6. IIS7 (Server 2008) Servers
Here is what ALL customers should implement… A rule that watches for the event, when the restart script is called.
Here is the event that gets created, in the OpsMgr event log on the agent, when this script tries to restart the HealthService:
Event Type: Warning
Event Source: Health Service Script
Event Category: None
Event ID: 6024
Date: 3/26/2009
Time: 9:22:33 AM
User: N/A
Computer: DC01
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshhold.
Here is the event created, when it is a problem with the MonitoringHost process:
Event Type: Warning
Event Source: Health Service Script
Event Category: None
Event ID: 6025
Date: 4/21/2009
Time: 5:23:41 AM
User: N/A
Computer: EX2CLN1
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Monitoring Host exceeded Process\Handle Count threshhold.
or
Event Type: Warning
Event Source: Health Service Script
Event Category: None
Event ID: 6026
Date: 3/26/2009
Time: 10:14:30 AM
User: N/A
Computer: DC01
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Monitoring Host exceeded Process\Private Bytes threshhold.
Here is the event after the restart is noted as a success:
Event Type: Information
Event Source: Health Service Script
Event Category: None
Event ID: 6062
Date: 3/26/2009
Time: 10:35:30 AM
User: N/A
Computer: DC01
Description:
RestartHealthService.js : Restarting Health Service. Service successfully restarted.
Here is the event after the restart failed for some reason:
Log Name: Operations Manager
Source: Health Service Script
Date: 5/18/2009 11:58:28 AM
Event ID: 6061
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SQL3CLN1.opsmgr.net
Description:
RestartHealthService.js : Restarting Health Service. Failed to restart service.
So – in order to see if you are impacted by this – the simplest thing to do – is to create a custom rule – that alerts when these events happen.
Create a new rule – target “Windows Server Operating System”, (or whatever is your standard). Look in the OpsMgr event log, with an expression of “Event ID Equals 6024” OR “Event ID Equals 6025” OR “Event ID Equals 6026” OR “Event ID Equals 6062” OR “Event ID Equals 6061”. You could also just as easily write individual rules – once for each… and use a different name for the alert.
Once this is created – you will get an alert for any agent that is restarting, or failed to restart. This will tell you if you need to bump up the default values, for specific agents, or a group of agents, such as SQL servers, domain controllers, etc…. What I have found, is that bumping this number up to 250MB will generally address most agent’s issues…. but you need to monitor this in your environment and see how much memory your HealthService.exe and/or MonitoringHost.exe process needs.
Next – create a view in the Monitoring Console – just for these alerts. New view – Alert View – for all alerts with a given Name. Use something that matches your alert name. Here is mine:
Now – watch this view for any alerts that come in….
What we see is – that I have many agents restarting from time to time… due to the default threshold not being enough. The best way to monitor this… is to watch and see which agents are affected, and then bump up their threshold via overrides. I would recommend bumping the number up in small increments, like 100MB at a time, and see where your “happy place” is.
This article specifically addresses OpsMgr SP1. I am not sure yet if this is going to need to be done in R2, so I will update this article when R2 releases.
As a side note – one of the symptoms I see, is when the OpsDB StateChangeEvent table is one of the largest tables. This is caused – because the constant restart of the agent, causes state to be recalculated for every monitor on the agent. It sends this recalculated state on every restart of the agent, flooding the database with state data.
I’d recommend putting these rules in place to alert on this event, for ANY customer…. knowing is power.
As far as some recommended values…. the best thing is to find your own “happy place”… but as of this writing, I start with 250MB as an initial adjustment. You can create a group of Windows Computer objects that are affected, and simply add computers to this group that seems to need more privatebytes. Use this custom group for your override on the HealthService and/or Monitoring host workflows.
Recommendation examples:
I don't recommend overriding the Health Service Private Bytes Threshold Monitor “for all objects of type: Health Service”. I have seen this impact the Health Service on the management servers – even though there is an override for the management servers which should be more specific in a conflict case – but this doesn't always work in the field.
You can use groups of Windows Computer objects (Or groups of Health Service Instances if you so desire) for this override.
If we wanted to override this value – and increase it to 200MB for all agents, override for a group – and use the “Agent Managed Computer Group”:
You have to be VERY careful when doing the above – because this override will potentially conflict with other group overrides… if you want specific overrides for Exchange, DHCP, DNS, SQL, etc…
A better approach is like so:
Just make SURE that whatever groups you use – they don't have/share any of the same computers… or you can get some conflicting values here.
I am attaching a management pack below. This MP contains a simple rule as described about to alert on the HealthService restart, and contains a view which are scoped only to these alerts. You should keep an eye on these and take action on them. I added alert suppression on these rules – they will create a single alert for each identical computer and event ID, and increment repeat counts on the worst offenders. This is one warning alert you should NOT ignore.