Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

  • Comments 16
  • Likes

I have written many articles in the past on HealthService restarts.  A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.

The Past:

Here are a few of the previous articles:

http://blogs.technet.com/kevinholman/archive/2009/03/26/are-your-agents-restarting-every-10-minutes-are-you-sure.aspx

http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx

 

Generally – this is a good thing.  We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent.  The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on.  It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events.  You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.

 

However, sometimes it is NORMAL for the agent to consume additional resources.  (within reason)

The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles.  This was enough for the majority of agents out there.  Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems.  Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space.  Like DNS servers, because they discover and monitor so many DNS zones.  DHCP servers, because they discover and monitor so many scopes.  Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects.  SQL servers, because they discover and monitor multiple DB engines, and databases.  Exchange 2007 servers, etc…

 

What’s new:

At the time of this writing, two new management pack updates have been released.  One for SP1, and one for R2.  EVERY customer should be running these MP updates.  I consider them critical to a healthy environment:

R2 MP Update version 6.1.7533.0

SP1 MP Update version 6.0.6709.0

What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount.  So FIRST – get these imported if you don't have them.  Yes, now.  This alone will solve the majority of HealthService restarts in the wild.  These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents.  This is a MUCH better default setting than we had previously.

 

How can I make it better?

I’m glad you asked!  Well, there are two things you can do, to enhance your monitoring of this very serious condition. 

  1. Add alerting to a HealthService Restart so you can detect this condition when it still exists.
  2. Override these monitors to higher thresholds for specific agents/groups.

Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”. 

Select any agent preset – and open Health Explorer.

Expand Performance > Health Service Performance > Health Service State.

image

 

This is an aggregate rollup monitor.  If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.

 

image

 

So – we DONT want to set this monitor to also create the alerts.  Because – this monitor can only tell us that “something” was beyond the threshold.  We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.

First thing – is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past.  ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted.  (The exchange MP has a sealed override of 5000 handles and this is fine)

What I like to do – is to add an override, “For all objects of Class”.  Enable “Generates Alert”.  I also ensure that the default value for “Auto-Resolve alert is set to false.  It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless.  What this will do – is generate an alert and never close it, anytime this monitor is unhealthy.  I need to know this information so I can be aware of very specific agents that might require a higher value:

image

 

Repeat this for all 4 monitors.

 

One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting.  It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.

Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them.  Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.

Now – you can uses these alerts to find any problem agents out there.  I really strongly recommend setting this up for any management group out there.  You NEED to know when agents are restarting on their own.

Comments
  • The old overrides wouldn't functional. The new values would still apply. Deleting them would be less confusing though :)

  • Hi Kevin.  I've been using these suggestions for a while now, and have noticed on our domain controllers the restart is launched, but I never get the corresponding 6062 event that the restart was successful.  On all other agents I get both the restart and success events.

    What can be done to troubleshoot why the agents try to restart but actually don't on our domain controllers?

  • Thanks for the suggestions. I just recently installed the lateset operations manager MP and it cleared up a slew of flopping agents. once again thanks for blogging.

  • Is the healthservice private bytes threshold of 1500 MB appropriate for the RMS of a large environment?  My RMS is routinely going over the threshold and back under.  Should I override the threshold to a larger number? What value?  Is the normal not to contually exceed value a function of the number of agents or management servers ?

  • @Ted -

    I dont hear that a lot.  What is going over 1.5GB?   Healthservice or Private bytes?  On a large RMS with lots of memory - this wouldnt suprise me.  It is very normal for the config and sdk to go over 2GB... I dont often see the Healthservice or Monitoringhost.exe's get that big though....

    How many agents?  How much ram in RMS?  What are your primary MP's?  I would have no issue bumping this up or even turning it off for your RMS.....

  • The monitor is looking at the "Process\Private Bytes" counter for the "HealthService.exe" process which on our RMS is steadily around 1.7GB.  We have increased the threshold to 2GB which it has not exceeded for a few days, now.  We have lots of agents (6500) and the RMS has 24GB.  We have a bunch of MPs, but we have tuned them to keep the chatter down.  AD is the noisiest, of course.

    I am all for alerting if this value is unhealthy.  It seems that the amount of memory the healthservice and the other key SCOM processes use is going to expand as demand increases, but I would really want to alert if the amount of memory they are using is causing problems. If we turn off this threshold, are we relying on other montiors to trigger when things get really bad?

    This monitor makes sense for agents since it triggers the healthservice to get restarted.  Agents should not use many local resources and restarting an agent's health service is not that disruptive.

    Restaring the healthservice on an RMS is a diffferent issue.  It is worth restarting if it is stuck, but not if it is just really busy.

    It sounds like your recommendation is to turn off this threshold for at least the RMS.  Maybe 1.5GB is OK for regular management servers.  Right?

  • 6500 agents - yeah... it sounds like your deductions and plan are solid.  From a health perspective... I have never heard of issues when the process is consuming too much memory on the RMS... only unresponsive - and this manifests itself in the consoles not connecting, or config not generating, or unhealthy events which are alerted on.

  • This is pretty interesting stuff. Is there anything out there that describes how the agent handles 'handles' ( :-) )? Generally, is there anything for Heap and handle management when looking at these issues?

  • @PhilE - generally, we simply track handle count for the agent processes, and kill the agent if we detect too many, which historically has been sign of a leak. Then these leaks get fixed either from the OS side or the agent side. We tend not to troubleshoot beyond identifying the root cause of a leak then requesting a hotfix.

  • What can I do to help identify a Process\Handle Count leak? It appears there is one on 2012 R2 DCs.

  • Our environment is only about 250 server agents and we are seeing healthserice.exe run at typically 2GB. Is there any queries or scripts to see if any MP might be misconfigured for 1sec refresh?

  • @Matt - 2GB for healthservice on the agents - or on the management server? On the management server that's fine. On the agents, that would be very bad.

  • Kevin,
    2GB is only on the Management Server. So that would be OK? I just put a override on that object and increased the privte bytes to 2.5GB and called it good.

  • Kevin,

    After migrating to 2012 R2 i noticed frequent grey agent issue on MSX 2010 servers, a large complex MSX environment all servers running Win 2K8R2 Ent.
    The MSX servers are suffering the multi role, store.exe memory issue (add more ram and more gets consumed) which hasn't helped the Agent. The agent is fighting the OS and many other processes for whats left over of the memory.
    I have the windows MP process termination rule turned on with an emal alert in order to pickup crashed processes. Occasionally the Microsoft Monitoring Agent generated alerts for service crashes on the MSX servers. I blamed this primarily on the memory performance issue with the server, which is partly true. But I never experienced this with 2007 R2.
    I was using the same MPs for MSX and Win as the old 2007 R2 MG with the same overrides so I thought it was a bit odd that things would be worse in 2012 R2.
    I remembered this article, checked my notes and checked what working values i had for 2007 R2 overrides. When i went to override with the same in 2012 R2 I discovered the default values for handle and memory are now much higher than the override values i had used for 2007 R2.

    i check the MSX server which generated the process terminated alert, sometimes the service is running, sometimes it was not and so i had a grey agent.
    I haven't looked at the restart js that runs when these thresholds are met but if i was to write it i would try a graceful shutdown first and if that failed within a certain time i would terminate the process, something I have done with Argent scripts in the past.
    When i look at the history for Agents on MSX for SCOM 2007 R2 there were more frequent restarts but no grey agent problem. I also never recieved alerts for the Agent being terminated.
    What i suspect is the problem, on these poorly performing servers when the js script tries a graceful shutdown/restart of the agent the agent is doing so much active work as a result of these new high threshold values the agent cannot shutdown in time so the js script terminates it. Occasionally either the script or windows service recovery is starting the agent service.

    This went a long way to helping thhe agent perform better regardless of the servers performance.
    http://blogs.msdn.com/b/rslaten/archive/2014/04/22/operations-manager-health-service-restarts-due-to-exceeding-handle-count-threshold.aspx

    Also, I put in a more recent SQL MP this same problem was affecting some Win2k8R2/SQL 2008 R2 servers. The same fix in the link above seems to make a difference.

    Increasing these threshold values is not always a good thing. While the H/W has improved it doesnt mean the engineers have. :P

  • @Jason -

    The default values are up because modern operating systems simply use more memory, and the processes in SCOM also need more memory on specific servers which have a huge instance space. I saw this same issue all the time in SCOM 2007R2 as well. Gray agents upon a failed restart of the agent. The KEY issue to solve here - is WHY IS THE AGENT LEAKING memory or handles? This is the nature of what's causing the restart in the first place. Grey agents can easily be automated/remediated with the built in recoveries for HB failures. The key thing to solve, is to stop or greatly slow the leak. We have done this on all other OS's except Windows Server 2008R2 SP1. This OS version with the agent still has issues that have never been resolved IMHO. However, applying all the hotfixes I propose on my site, in conjunction with the two from Russ Slaten's blog - does the best. The problem is - those on Russ's blog are a Kernel update. That's a big deal and not something to take lightly, which is the only reason I haven't yet added it to my hotfix site. Most people simply deal with setting the value to something that slows the leak to where the agent isn't restarting too often, and deal with occasional HB failures. The best course of action right now, I'd agree - is to apply all the hotfixes and stop the leak. I just wish it didn't take swatting this annoying fly with a sledgehammer to do it.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs