Welcome to TechNet Blogs Sign in | Join | Help

Are your agents restarting every 10 minutes? Are you sure?

**Updated 6-22-2009 – This post applies to SP1 ONLY!!!  This architecture has changed for R2.  The version of this article updated for R2 is located here:  http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx

Here is something I have been seeing with more and more customers…. and I think everyone should take a look and consider this.

They have a decent percentage of agents, that the HealthService is being programmatically restarted every 10-12 minutes (sometimes less often than 12 minutes, but still very frequent.

This is being caused by one of a few workflows.  By default, there is a monitor that watches the HealthService resources, and runs a script to bounce the Health Service anytime that service is consuming too many resources.  This is good – because we don't want an issue with a SCOM agent to ever impact the available resources on a monitored agent.

The bad?  Well, we don't (by default) alert when the script is called that restarts the HealthService.  This means – this can be affecting you – and you really have no way of knowing this out of the box.  Also – sometimes the restart fails…. and this leaves you with a handful of agents that generate a Heartbeat failure.  When you look at the agent – it is fine, just the HealthService isn't running…. you start it, and everything goes back to normal… or so you think.  If every day you have to respond to a few Heartbeat failures…. and you find the OS up – just the agent stopped… this might be the cause.

Here are the two monitors which target “Health Service” – and can restart the service:

image

Here are the two rules, targeting “Agent”:

image

Most often – I see it is the monitor causing the restart… on the Health Service Private bytes, but it could be any of the above – and you should consider any of these if you are impacted by this.

Note the default overrides.  We allow the management servers to use up to 1.6GB of HealthService privatebytes, and if you have the current Exchange 2007 MP – we have an override which allows Exchange 2007 agents to use up to 600MB, up from the default which is 100MB.

The Exchange 2007 MP was updated with this override, because this issue was already detected for large Exchange servers.

The problem is – that other large servers can potentially use more than 100MB. 

I have been seeing this mostly on:

1. Large SQL database servers

2.  Server 2008 domain controllers

3.  DHCP servers with large scope counts

4.  Exchange 2007 servers

5.  Large Exchange 2003 Mailbox servers

6.  IIS7 (Server 2008) Servers

Here is what ALL customers should implement…  A rule that watches for the event, when the restart script is called.

Here is the event that gets created, in the OpsMgr event log on the agent, when this script tries to restart the HealthService:

Event Type:         Warning
Event Source:      Health Service Script
Event Category:   None
Event ID:             6024
Date:                    3/26/2009
Time:                   9:22:33 AM
User:                    N/A
Computer:            DC01
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Health Service exceeded Process\Handle Count or Private Bytes threshhold.

Here is the event created, when it is a problem with the MonitoringHost process:

Event Type:          Warning
Event Source:       Health Service Script
Event Category:    None
Event ID:              6025
Date:                     4/21/2009
Time:                     5:23:41 AM
User:                      N/A
Computer:             EX2CLN1
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Monitoring Host exceeded Process\Handle Count threshhold.

or

Event Type:          Warning
Event Source:       Health Service Script
Event Category:    None
Event ID:              6026
Date:                    3/26/2009
Time:                   10:14:30 AM
User:                    N/A
Computer:            DC01
Description:
LaunchRestartHealthService.js : Launching Restart Health Service. Monitoring Host exceeded Process\Private Bytes threshhold.

Here is the event after the restart is noted as a success:

Event Type:            Information
Event Source:         Health Service Script
Event Category:     None
Event ID:                6062
Date:                      3/26/2009
Time:                      10:35:30 AM
User:                       N/A
Computer:              DC01
Description:
RestartHealthService.js : Restarting Health Service. Service successfully restarted.

Here is the event after the restart failed for some reason:

Log Name:      Operations Manager
Source:        Health Service Script
Date:          5/18/2009 11:58:28 AM
Event ID:      6061
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      SQL3CLN1.opsmgr.net
Description:
RestartHealthService.js : Restarting Health Service. Failed to restart service.

So – in order to see if you are impacted by this – the simplest thing to do – is to create a custom rule – that alerts when these events happen.

Create a new rule – target “Windows Server Operating System”, (or whatever is your standard).  Look in the OpsMgr event log, with an expression of “Event ID Equals 6024OR “Event ID Equals 6025” OR “Event ID Equals 6026OR “Event ID Equals 6062OR “Event ID Equals 6061”.    You could also just as easily write individual rules – once for each… and use a different name for the alert.

image

Once this is created – you will get an alert for any agent that is restarting, or failed to restart.  This will tell you if you need to bump up the default values, for specific agents, or a group of agents, such as SQL servers, domain controllers, etc….  What I have found, is that bumping this number up to 250MB will generally address most agent’s issues…. but you need to monitor this in your environment and see how much memory your HealthService.exe and/or MonitoringHost.exe process needs.

Next – create a view in the Monitoring Console – just for these alerts.  New view – Alert View – for all alerts with a given Name.  Use something that matches your alert name.  Here is mine:

image

Now – watch this view for any alerts that come in….

image

What we see is – that I have many agents restarting from time to time… due to the default threshold not being enough.  The best way to monitor this… is to watch and see which agents are affected, and then bump up their threshold via overrides.  I would recommend bumping the number up in small increments, like 100MB at a time, and see where your “happy place” is.

This article specifically addresses OpsMgr SP1.  I am not sure yet if this is going to need to be done in R2, so I will update this article when R2 releases.

As a side note – one of the symptoms I see, is when the OpsDB StateChangeEvent table is one of the largest tables.  This is caused – because the constant restart of the agent, causes state to be recalculated for every monitor on the agent.  It sends this recalculated state on every restart of the agent, flooding the database with state data.

I’d recommend putting these rules in place to alert on this event, for ANY customer…. knowing is power.

 

 

 

As far as some recommended values…. the best thing is to find your own “happy place”… but as of this writing, I start with 250MB as an initial adjustment.  You can create a group of Windows Computer objects that are affected, and simply add computers to this group that seems to need more privatebytes.  Use this custom group for your override on the HealthService and/or Monitoring host workflows.

 

Recommendation examples:

I don't recommend overriding the Health Service Private Bytes Threshold Monitor “for all objects of type: Health Service”.  I have seen this impact the Health Service on the management servers – even though there is an override for the management servers which should be more specific in a conflict case – but this doesn't always work in the field.

You can use groups of Windows Computer objects (Or groups of Health Service Instances if you so desire) for this override.

If we wanted to override this value – and increase it to 200MB for all agents, override for a group – and use the “Agent Managed Computer Group”:

image

 

You have to be VERY careful when doing the above – because this override will potentially conflict with other group overrides… if you want specific overrides for Exchange, DHCP, DNS, SQL, etc…

A better approach is like so:

image

 

Just make SURE that whatever groups you use – they don't have/share any of the same computers… or you can get some conflicting values here.

 

I am attaching a management pack below.  This MP contains a simple rule as described about to alert on the HealthService restart, and contains a view which are scoped only to these alerts.  You should keep an eye on these and take action on them.  I added alert suppression on these rules – they will create a single alert for each identical computer and event ID, and increment repeat counts on the worst offenders.  This is one warning alert you should NOT ignore.

Published Thursday, March 26, 2009 7:50 PM by kevinhol
Filed under:

Attachment(s): Custom.Agent.Restart.Alert.MP.zip

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: Are your agents restarting every 10 minutes? Are you sure?

Wednesday, April 22, 2009 1:09 PM by Layne

Kevin, any idea why when trying to override the Health Service Private Bytes Threshold monitor you cannot override it for a group of computer objects that you've created if you choose override "for a group"?  The only groups that appear in the list to choose from are groups that are created by management packs, etc., not any user created groups.  In order to override for a group you've created you have to choose override  "for all objects of another type", view all targets, and then choose the group you've created.

Conversly, the rule for Monitoring Host Private Bytes Threshold will allow you to override for a group that you've created if you choose "override for a group".

Great articles, keep up the good work.

# re: where is my group?

Wednesday, April 22, 2009 1:27 PM by kevinhol

This is a bug in SP1 - when you try and do this from the Authoring pane.

Open Health Explorer - create the override there, and you will see your group.  :-(

# re: Are your agents restarting every 10 minutes? Are you sure?

Sunday, May 24, 2009 9:57 PM by DilipManchala

Hi Kevin, As suggested by you in the above blog I have made changes Health Service Private bytes to 200MB for a particular SCOM agent. But I still see my agent not getting restarted.

I am running nworks application on this particular computer and also see my healthservicestore.edb size growing 220 MB.Is this normal??

I also see the below errors in the event log on the same agent machine.

Event ID:4506

Event Source:HealthService

Description:

Data was dropped due to too much outstanding data in rule "many" running for instance "many" with id:"many" in management group "XXXXXX".

# re: Nworks

Tuesday, May 26, 2009 8:56 AM by kevinhol

The Nworks MP causes an agent to act as a proxy - and load workflows and collect data for potentially a HUGE number of machines.  Therefore it is common that this agent will need more memory for this.

This is documented in the Nworks documentation I believe.

I would disable these monitors for those instances - and measure how much they *consume* and how fast they consume it - and stop bouncing them.

If you must set a value - I would start at 600-800MB for privatebytes.... and monitor the consumption closely.

# re: Are your agents restarting every 10 minutes? Are you sure?

Wednesday, May 27, 2009 3:16 PM by JHBoricua

I've created the custom rules and the custom view per your post, but the 'Source' column on my alerts show "Microsoft(R) Windows(R) Server 2003, Enterprise Edition" rather than the actual server name as shown on your screenshot. How did you get it to show the server name in the Source column of your custom view?

# re: servername in the view?

Wednesday, May 27, 2009 4:08 PM by kevinhol

You need to personalize the view - and add "Path" next to "source".

This is true for any alert view - depending on the target class of an alert, the FQDN will either be in Source or Path.... not always both and not consistent.

# re: Are your agents restarting every 10 minutes? Are you sure?

Tuesday, June 09, 2009 6:39 AM by snajgel

When working with multihomed agents you need to do the override in the other management group as well.

# re: Multi-homed agents

Tuesday, June 09, 2009 9:44 AM by kevinhol

Yep - I just ran into that yesterday with a customer.  He had some restarting all the time - because they were multi-homed with his pre-prod management group.

# re: Are your agents restarting every 10 minutes? Are you sure?

Tuesday, June 09, 2009 3:55 PM by snajgel

Do you think it is becouse they are multi-homed they restart?

# re: are the restarting because they are multi-homed?

Tuesday, June 09, 2009 4:39 PM by kevinhol

No - they are restarting because BOTH management groups are monitoring the SAME healthservice process... and the lower value from either MG will bounce the service.

Overrides will need to be kept in synch for each management group - for this.

# re: R2

Tuesday, June 09, 2009 4:40 PM by kevinhol

By the way - this has changed quite a bit in R2 - when I have some time - I am going to document how that works.... and how it is different than SP1.

# re: Are your agents restarting every 10 minutes? Are you sure?

Wednesday, October 21, 2009 6:57 PM by Dominique

The Agents do not restart but I have "MonitoringHost.exe Handle Count Threshold Alert Message" how could I measure the actual value used so I could override with something closed to it.

Thanks,

Dom

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker