I have written many articles in the past on HealthService restarts. A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.
Here are a few of the previous articles:
Generally – this is a good thing. We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent. The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on. It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events. You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.
However, sometimes it is NORMAL for the agent to consume additional resources. (within reason)
The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles. This was enough for the majority of agents out there. Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems. Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space. Like DNS servers, because they discover and monitor so many DNS zones. DHCP servers, because they discover and monitor so many scopes. Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects. SQL servers, because they discover and monitor multiple DB engines, and databases. Exchange 2007 servers, etc…
At the time of this writing, two new management pack updates have been released. One for SP1, and one for R2. EVERY customer should be running these MP updates. I consider them critical to a healthy environment:
R2 MP Update version 6.1.7533.0
SP1 MP Update version 6.0.6709.0
What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount. So FIRST – get these imported if you don't have them. Yes, now. This alone will solve the majority of HealthService restarts in the wild. These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents. This is a MUCH better default setting than we had previously.
How can I make it better?
I’m glad you asked! Well, there are two things you can do, to enhance your monitoring of this very serious condition.
Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”.
Select any agent preset – and open Health Explorer.
Expand Performance > Health Service Performance > Health Service State.
This is an aggregate rollup monitor. If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.
So – we DONT want to set this monitor to also create the alerts. Because – this monitor can only tell us that “something” was beyond the threshold. We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.
First thing – is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past. ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted. (The exchange MP has a sealed override of 5000 handles and this is fine)
What I like to do – is to add an override, “For all objects of Class”. Enable “Generates Alert”. I also ensure that the default value for “Auto-Resolve alert is set to false. It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless. What this will do – is generate an alert and never close it, anytime this monitor is unhealthy. I need to know this information so I can be aware of very specific agents that might require a higher value:
Repeat this for all 4 monitors.
One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting. It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.
Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them. Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.
Now – you can uses these alerts to find any problem agents out there. I really strongly recommend setting this up for any management group out there. You NEED to know when agents are restarting on their own.
The old overrides wouldn't functional. The new values would still apply. Deleting them would be less confusing though :)
Hi Kevin. I've been using these suggestions for a while now, and have noticed on our domain controllers the restart is launched, but I never get the corresponding 6062 event that the restart was successful. On all other agents I get both the restart and success events.
What can be done to troubleshoot why the agents try to restart but actually don't on our domain controllers?
Thanks for the suggestions. I just recently installed the lateset operations manager MP and it cleared up a slew of flopping agents. once again thanks for blogging.
Is the healthservice private bytes threshold of 1500 MB appropriate for the RMS of a large environment? My RMS is routinely going over the threshold and back under. Should I override the threshold to a larger number? What value? Is the normal not to contually exceed value a function of the number of agents or management servers ?
I dont hear that a lot. What is going over 1.5GB? Healthservice or Private bytes? On a large RMS with lots of memory - this wouldnt suprise me. It is very normal for the config and sdk to go over 2GB... I dont often see the Healthservice or Monitoringhost.exe's get that big though....
How many agents? How much ram in RMS? What are your primary MP's? I would have no issue bumping this up or even turning it off for your RMS.....
The monitor is looking at the "Process\Private Bytes" counter for the "HealthService.exe" process which on our RMS is steadily around 1.7GB. We have increased the threshold to 2GB which it has not exceeded for a few days, now. We have lots of agents (6500) and the RMS has 24GB. We have a bunch of MPs, but we have tuned them to keep the chatter down. AD is the noisiest, of course.
I am all for alerting if this value is unhealthy. It seems that the amount of memory the healthservice and the other key SCOM processes use is going to expand as demand increases, but I would really want to alert if the amount of memory they are using is causing problems. If we turn off this threshold, are we relying on other montiors to trigger when things get really bad?
This monitor makes sense for agents since it triggers the healthservice to get restarted. Agents should not use many local resources and restarting an agent's health service is not that disruptive.
Restaring the healthservice on an RMS is a diffferent issue. It is worth restarting if it is stuck, but not if it is just really busy.
It sounds like your recommendation is to turn off this threshold for at least the RMS. Maybe 1.5GB is OK for regular management servers. Right?
6500 agents - yeah... it sounds like your deductions and plan are solid. From a health perspective... I have never heard of issues when the process is consuming too much memory on the RMS... only unresponsive - and this manifests itself in the consoles not connecting, or config not generating, or unhealthy events which are alerted on.