Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

HealthService restarts – still a challenge in OpsMgr 2012.

HealthService restarts – still a challenge in OpsMgr 2012.

  • Comments 1
  • Likes

 

Way back in the day I wrote about this issue, where the SCOM agent in some cases can consume above typical resource levels of memory, handles, etc.  When this occurs – we will restart the agent to kill any “runaway” processes.  Read about this here:

 

http://blogs.technet.com/b/kevinholman/archive/2009/12/21/the-new-and-improved-guide-on-healthservice-restarts-aka-agents-bouncing-their-own-healthservice.aspx

 

 

One of the things I have noticed, is that on many of my servers, these thresholds are being breached on a regular basis – mostly due to the monitoringhost.exe processes needing to use more than the default of 300mb of RAM (private bytes). 

 

The issue is, that you will likely have NO idea this is happening.  We don’t generate any alerts for this by default – we simply “fix the problem” by creating a state change, then running a response script to bounce the agent.  The bad part about this, is you could have agents in a constant restart loop.

In SCOM 2012 – I still recommend making the following changes via overrides:  Open the “Operations Manager > Agent Details > Agents by Version” view in the console:

image

 

Open health explorer for one of the agents – and here is an example of an agent that has been bouncing on a regular basis:

image

 

I recommend the following:

Private bytes monitors should be set to a default threshold of 629145600  (double the default of 300MB)

Handle Count monitors should be set to 15,000  (the default of 6000 is too low)

In addition, on each monitor:

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to false (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)

--------------------

Override EACH monitor, “for all objects of another class” and choose “Agent” class.

image

image

 

This is a good configuration:

image

image

image

image

 

 

As a refresher – this will be common on any monitored systems that discover a large number of instances – such as Exchange, DNS, SQL servers, SCVMM, large web servers, etc.

Comments

  • Kevin,
    Hope you are doing good.

    I am facing the above problem with my management servers and the health state of the servers are in critical because of this alert.
    Steps taken for this alert.

    Applied override for the classes management server, management server agent.

    Parameter name – Agent performance monitor type (Consecutive Samples) – Threshold – default value – 314572800 Effective Value – 1610612736.

    After changing the threshold values also the state change is happening. Could you please help me to fix this issue.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs