I have written many articles in the past on HealthService restarts. A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.
The Past:
Here are a few of the previous articles:
http://blogs.technet.com/kevinholman/archive/2009/03/26/are-your-agents-restarting-every-10-minutes-are-you-sure.aspx
http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx
Generally – this is a good thing. We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent. The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on. It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events. You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.
However, sometimes it is NORMAL for the agent to consume additional resources. (within reason)
The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles. This was enough for the majority of agents out there. Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems. Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space. Like DNS servers, because they discover and monitor so many DNS zones. DHCP servers, because they discover and monitor so many scopes. Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects. SQL servers, because they discover and monitor multiple DB engines, and databases. Exchange 2007 servers, etc…
What’s new:
At the time of this writing, two new management pack updates have been released. One for SP1, and one for R2. EVERY customer should be running these MP updates. I consider them critical to a healthy environment:
R2 MP Update version 6.1.7533.0
SP1 MP Update version 6.0.6709.0
What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount. So FIRST – get these imported if you don't have them. Yes, now. This alone will solve the majority of HealthService restarts in the wild. These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents. This is a MUCH better default setting than we had previously.
How can I make it better?
I’m glad you asked! Well, there are two things you can do, to enhance your monitoring of this very serious condition.
- Add alerting to a HealthService Restart so you can detect this condition when it still exists.
- Override these monitors to higher thresholds for specific agents/groups.
Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”.
Select any agent preset – and open Health Explorer.
Expand Performance > Health Service Performance > Health Service State.
This is an aggregate rollup monitor. If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.
So – we DONT want to set this monitor to also create the alerts. Because – this monitor can only tell us that “something” was beyond the threshold. We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.
First thing – is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past. ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted. (The exchange MP has a sealed override of 5000 handles and this is fine)
What I like to do – is to add an override, “For all objects of Class”. Enable “Generates Alert”. I also ensure that the default value for “Auto-Resolve alert is set to false. It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless. What this will do – is generate an alert and never close it, anytime this monitor is unhealthy. I need to know this information so I can be aware of very specific agents that might require a higher value:
Repeat this for all 4 monitors.
One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting. It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.
Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them. Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.
Now – you can uses these alerts to find any problem agents out there. I really strongly recommend setting this up for any management group out there. You NEED to know when agents are restarting on their own.