Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Fixing troubled agents

Fixing troubled agents

  • Comments 15
  • Likes

Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

  • DNS name resolution (Agent to MS, and MS to Agent)
  • DNS domain membership (disjointed)
  • DNS suffix search order
  • Kerberos connectivity
  • Kerberos SPN’s accessible
  • Firewalls blocking 5723
  • Firewalls blocking access to AD for authentication
  • Packet loss
  • Invalid or old registry entries
  • Missing registry entries
  • Corrupt registry
  • Default agent action accounts locked down/out (HSLockdown)
  • HealthService Certificate configuration issues.
  • Hotfixes required for OS Compatibility
  • Management Server rejecting the agent

 

How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

 

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

 

Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:

 

image

 

 

This view is important – because it gives us a perspective of the agent from two different points:

1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".

 

If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren't – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

 

Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

 

  • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.

 

  • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

 

  • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.

 

  • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane - “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

 

  • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

 

  • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
    • Uninstall the agent via add/remove programs.
    • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
    • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
    • Ensure that the following registry keys are deleted:
      • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
    • Reboot the agent machine (if possible)
    • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
    • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
    • Remember to import certificats at this point if you are using those on the individual agent.
    • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

 

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

  1. Review OpsMgr event log on agent
  2. Bounce HealthService
  3. Bounce HealthService clearing \Health Service State folder.
  4. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

 

Also – make sure you see my other posts on agent health and troubleshooting during deployment:

Console based Agent Deployment Troubleshooting table

Agent discovery and push troubleshooting in OpsMgr 2007

Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility

Agent Pending Actions can get out of synch between the Console, and the database

Which hotfixes should I apply-

Comments
  • Thats a very good article Kevin.

    I have seen one more problem where agents are hung in one of the tables of SQL, specially during new installations.

    I have seen that once you delete that information from the tables, then you could install the agent again fine.

    Thanks for the great article once again

  • We really try hard to come up with ways to solve the problem without resrting to editing a SQL table directly.... doing so is really unsupported and should only be done under the direct guidance (or should I say order) of PSS in a case with Microsoft.  There are a few circumstances, where that seems to be the only recourse... but we should exhaust all other options first.

  • Thanks Kevin, very good blog.  Can you provide any additional advice or reasons why an agent health turns grey, we get this a lot ?  the agents are multi-homed, could this be impacting ?

  • This Blog is very useful.  As to other potential issues with grey agents, check out this kb  support.microsoft.com/.../2288515.

  • Simply Log on to DC and run the following commands

    1. hslockdown /L

    you will see NT Authority\system is in denied state

    Then run the command to bring it in allowed state

    hslockdown /A "NT AUTHORITY\System"

    Cheers

    Saad

  • hi Kevin,

    Great article.  I have used this advice a few times to help with agetns issues.  However I have come across with an issue I am having a hard time with.  I have an agent deployed and teh agent is showing healthy in the Agent State view.  This particular  agent is on a Windows 208 R2 server.  For some reason the disovery of this windows 2008 server is not working.  I have other windows 2008 servers that are working fine.  The agent knows enough that it is on a windows server, but all of the OS specifc monitors are not active.  The logs show nothing.  I am at a loss here.  I have cleared the cache, repaired the agent.  Any help is apprecieated.  thanks.

  • Useful information Kevin,

    I'm also into a situation where 1-2 agents are not healthy. While checking i found config.xml file is not updated though i cleared the cached and even allowed the system to recreate Health Service State folder but that failed too in updating the xml file. I've also noticed Temp folders are not getting created on these agents. I've reinstalled the agent as well. Agent gets into gryed state after a while even if i restart the service.

    In the event log i see lot of logs generated   Rule/Monitor "Microsoft.SystemCenter.LearningModule.FailedInitialization.Alert"  cannot be initialized and will not be loaded" and many more similar to this.

    Any help is appreciated.

  • Kevin,

    My problem lies on the Root Management Server. Absolutely everything is running with no issues but for some reason, the Server is greyed out I can restart the service and it is okay for a few minutes, then goes right back into the greyed out status... Operationally and all functions correctly but it just never looks good to see the RMS greyed out... Any ideas?

  • Jayson, your issue may be related to SPNs. We just wrestled with it and finally got it right.

    www2.wolzak.com/.../15-the-opsmgr-connector-could-not-connect-to-msomhsvcrms01local

    mymomexperience.blogspot.com/.../ops-manager-2007-agent-not-connecting.html

  • After upgrading System Center Essentials 2007 with the latest OS Management Pack, the owner’s agent of the Hyper-V cluster became grayed out.

    If I change the cluster current host server to other server, it becomes grayed out and the previous one (which was the current host server before) becomes healthy again.

  • How to Flush the Health Service State and Cache on multiple machines at a time? any command line utility available?

  • It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

  • It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

  • "The problem might be that they are not showing up in the console at all!"....any suggestions for diagnosing this problem? This particular agent also has no "OperationsManager". A few of the other logs are there, but many that I typically see in a client are missing. This was a manual installation of the ccm client.

  • i have this problem and i tried all the steps that i know but the problem still exist.can any one help me..i am using SCOM 2012 ,
    "The System Center Management Health Service 1EC09CB7-1B1E-EAC9-D15A-D2C927046DE2 running on host xxx-xx-xxx.Root.net and serving management group with id {0407FB6F-896A-7389-EA01-D60C72ABBD5A} is not healthy. Some system rules failed to load."

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs