Welcome to TechNet Blogs Sign in | Join | Help

Health Service and MonitoringHost thresholds in R2 – how this has changed and what you should know

In THIS post – I described the way the agent HealthService will bounce on a regular basis, and how to alert on that, and change the thresholds.  Please see that post for details on SP1.

 

Now – in R2 – much of this has changed in how it works.  That said – the core challenge in SP1 still exists in R2:

1.  The default agent threshold is 100MB for the HealthService and MonitoringHost process.  That is too low for many of the typical agents in production environments.

2.  When we bounce the agent for using more than 100MB, we do this silently, and do not alert.  If your agents are constantly restarting in a loop, you will never know.

 

Lets take a look at R2:

 

First off – lets examine the HealthService class.  There are two Monitors located at  Health Service > Entity Health > Performance > Health Service Performance > Health Service State.  They are for Handle Count threshold, and Private Bytes Threshold.

image

 

Health Service Handle Count Threshold:  The default threshold is 2000 handles for an agent.  There are built-in overrides for the Management servers – bumping this number up to 10,000 handles.  Also, the new Native Exchange 2007 MP, bumps the threshold up to 5000, for Exchange 2007 computers.  It is common that you *might* have to bump up this threshold for SOME agents.  It is also common to bump the Management Server threshold up to 20,000 or even 50,000 if yours is constantly using more, but stable.

image

 

 

Health Service Private Bytes Threshold:  The default is 100MB for all agents.  There is a threshold override for Management servers to use up to 1.6GB.  The new Native Exchange 2007 MP bumps the threshold to 600MB for Exchange 2007 servers.  This is the monitor that needs the most attention!  The default of 100MB is not enough for many server roles, especially if hosted on Server 2008 OS.  You will likely need to override this monitor, for groups of Windows Computer objects, that are affected.  Your agents will potentially be a in perpetual restart loop until this is done.  Here is an example of mine – with a few overrides in place for SQL computers and DNS servers:

image

 

 

Now…. in R2 – the MonitoringHost workflows have changed, from rules, to Monitors.  These are located under the “Agent” class.  You will find them under Agent > Entity Health > Performance > Health Service Performance > Health Service State.  They are named Monitoring Host Handle Count Threshold, and Monitoring Host Private Bytes Threshold.

 

image

 

***Note:  In this view - you will also see the HealthService monitors, but note – these are inherited from the Health Service class.  This is because there is a dependency rollup that rolls up the Health Service State of the HealthService, to the Agent.  I will explain why in a moment.

 

 

Monitoring Host Handle Count Threshold:  The default threshold is 2000 handles for an agent.  There are built-in overrides for the Management servers – bumping this number up to 10,000 handles.  Also, the new Native Exchange 2007 MP, bumps the threshold up to 5000, for Exchange 2007 computers.  It is common that you *might* have to bump up this threshold for SOME agents.  It is also common to bump the Management Server threshold up to 20,000 or even 50,000 if yours is constantly using more, but stable.

image

 

 

Monitoring Host Private Bytes Threshold:  The default is 100MB for all agents.  There is a threshold override for Management servers to use up to 1.6GB.  The new Native Exchange 2007 MP bumps the threshold to 600MB for Exchange 2007 servers. 

image

 

 

Now, in R2 – all four of these monitors roll their health state up to the Aggregate Roll-up monitor under the agent class, named “Health Service State”.

 

image

 

If we look at the properties of that aggregate monitor, we can see the recovery action to restart the HealthService is now on this monitor.  Therefore – if ANY of the 4 monitors below it are in a critical state – they will roll up to this monitor, which will launch a script to bounce the HealthService on the agents.

 

image

 

 

This script, when it executes, will launch an event 6024 in the OpsMgr event log on the agent, that is is restarting the HealthService. 

 

***NOTE – the text used in the event log is not technically accurate, in that it always states “Health Service exceeded Process\Handle Count or Private Bytes threshold.”  It could be an issue with the Monitoring Host – NOT the HealthService, and this event might mislead you in troubleshooting.  So just know that a 6024 event is a generic restart event – you need to look at the individual monitor state change history in Health Explorer to properly investigate.

 

 

So – to summarize the changes from SP1 to R2:

  1. The MonitoringHost threshold rules are now standard monitors.
  2. The Health Service monitors roll up to Agent - Health Service State rollup monitor.
  3. The Health Service State rollup monitor has a recovery which runs a script to bounce the HealthService when it is in a critical state.
  4. We still do not alert by default when the script bounces your agent, and you need to create a rule to look for this, or, alert on the state-change of the monitor.
  5. You still will likely need to adjust the threshold of the Health Service Private Bytes monitor for many of your agents.

 

 

Lets talk about #4 above:  You need to know when your agents are getting bounced, especially if they are caught in a loop of bouncing.

You have a few choices here…. but I like to either: 

  1. Create an alert rule, target “Agent”, Event ID 6024, in the OpsMgr event log.
  2. Override the “Health Service State” rollup monitor, to “Generates Alert = True”.

Either one of those will give you a solution, that will detect the monitor state change, which results in bouncing the agent’s Health Service.

 

With regard to #5 above:  You will likely need to adjust this default threshold for many agents.  From my previous blog post on this topic – I have been seeing that mostly on the following types of servers:

    1. Large SQL database servers
    2. Server 2008 domain controllers
    3. DHCP servers with large scope counts
    4. DNS servers with large zone counts
    5. Exchange 2007 servers
    6. Large Exchange 2003 Mailbox servers
    7. IIS7 (Server 2008) Servers
    8. Proxy agents that perform special agent-less monitoring (Nworks/Vmware, etc…)

Create groups for these server types, and override the default threshold for this monitor for those groups.  In general, I have found bumping to 250MB resolves most agent issues, but some special cases could need much more.

Published Monday, June 22, 2009 11:52 PM by kevinhol
Filed under:

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

No Comments

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker