Randomly, you might see a single MonitoringHost.exe process on an agent, consuming 100% CPU. (Or 50%, or 25% depending on how many cores you have). This process will stay at this level, and will not recover. If you restart the OpsMgr HealthService, the problem goes away, and might not return for days or even weeks.
This particular symptom, might be due to an XML spinlock issue… this is a core Windows OS issue, and there is a hotfix available, which I have on my HOTFIX LINK
The KB is 968967 :
“The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed”
I have seen that most customers are affected by this issue from time to time. I have seen it very commonly in my lab, on Server 2008 Domain controllers, and my Server 2008 Hyper-V hosts…
A note on patching Server 2008:
When you go to download this hotfix for a server 2008 machine – it is very misleading on which hotfix to even get. Here is the list of all available fixes:
For patching Server 2008 – you need to download the “Windows Vista” hotfix – in either x86 or x64, depending on your OS version:
Monitoring for this condition:
You can easily write a threshold monitor targeting agent or HealthService, to track the monitoringhost process \ %processor time threshold, and set it to alert when it has multiple consecutive samples above a defined threshold.
Here is an example of creating this monitor:
Authoring Pane > Monitors > New Unit Monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Consecutive Samples over Threshold.
Give it a custom name that follows your documented custom Monitor naming standard, target “Health Service”, and put this under Performance rollup.
Hit the “Select” button (in SP1 – select “Browse”) In the perf counter picker – choose a server with an installed agent, choose the Object “Process” the counter “%Processor Time” and the Instance “MonitoringHost”, and click OK.
Since there are multiple MonitoringHost processes… we will add a Wildcard to the Instance name in the monitor…. this will monitor ANY MonitoringHost process for high CPU. Set the Interval to every 1 minute.
For the number of consecutive samples, and threshold… that is up to you. For me – I will say that if I detect a single MonitoringHost process using more than 50% CPU, over all 5 consecutive samples (5 minutes) then I consider that bad:
At this point…. you can simply alert on the condition, or event try and add a recovery script – that will bounce the health service. Generally, bouncing the HealthService when one of the processes is using all the CPU is not always 100% reliable… especially from a “NET STOP & NET START” type command. I have found it more reliable to just kill the MonitoringHost process in this condition, and allow it to respawn…. but your mileage may vary.
Thanks Kevin. Yes, we are seeing this and currently have a PSS case open. Only problem is KB968967 requires SP2, we are only on SP1 and can't upgrade at the moment. So, threshold monitor it is. But, I'm having some problems choosing the Object and Counter in trying to target the Monitoringhost.exe process. If you could provide a little more specific guidance for creating this monitor, it would help us out greatly.
Why/How is KB968967 for SP2 only?
There shows to be a version for Server 2008 RTM and/or SP2?
Apparently the SP1 version wasn't available for download, even though it said it was. Our pss case engineer was able to get this resolved on the Microsoft side. Thanks
Hey Kevin, the monitor is working correctly. Meaning, it's generating the alerts,but I'm not receiving the email alerts via our subscription. In my subscription I have the following parameters selected:
Check mark on "Raised by any instance of a specific class" to this I have added "Health Service".
Check mark on "created by specific rules or monitors (e.g. sources)" to this I have added the newly created Rule "MonitoringHost.exe process CPU monitor"
Any idea why this subscription is not working? Have I missed something in the configurations?
Hi Kevin, I think the System.Performance.ConsecutiveSamplesCondition does not work well with multiple instances. From My experience with SCOM 2007 R2 ALL the instances have to be over the threshold.
Interesting point on that one - in that case it would be better to make two or three rules - once for each possible instance of MH.
From most of what I have worked with - this might cause the monitor to trigger - but flip flop when another instance is not above the threshold. Good point.
Actually I have a script that does the monitoring of CPU for multiple instances and that I use when I know multiple instances exist. It allows me to do the monitoring with only 1 monitor and without discovery.
would you mind sharing how you're monitoring for CPU utilization for multiple instances of MonitoringHost.exe?
Is it a vbscript? powershell? We're running into the issue described and want to get monitoring in place for this behavior until we can apply the hotfix.
looks like this will point me in the right direction for now
Hi, I see that loads of people are having teh same issue with high CPU utlilization. I have the same problem and if I check the solutions, they are all refering to updating MSXML6 to SP1. Problem is..... I already have SP2 loaded and am still getting teh same issue. The work around I have put in place for now is to set process affinity so that the server at least responds in the interim till I find a better solution. Has anyone had a similar issue?
There arent any common issues where a MH process spikes to 100% and stays there, unless it is the MSXML spinlock.
The only other situations I am aware of is when you have some bad MP's are really sick machines.
When you say "high cpu utilization" you need to be more specific. Which process (or processes) is spiking, is it going to 100%, how long does it stay there, whats in the OpsMgr event logs, etc.... What is the OS of the agent, what MP's are loaded against it, what is the machine's role in life, etc.
A dump of the process can be analyzed when it is in this condition to determine whats eating the CPU.
I am new to OPSMGR so don't understand create 2 or 3 rules per MH instance. Do I need to reconfigure the original monitor I built. Will this monitor run against all servers in my environment? Last question I have windows 2003 servers with MSXML 6 sp2 does KB 968967 apply in this case?
We are experiencing the issue as described in kb968967. It has been applied to a windows 2008 server successfully (have seen no issue in 2 days). I have a Windows 2003 R2 Enterprise x64 edition Service Pack 2 running SQL having the monitoringhost.exe issue. The MSXML 6 is SP2 (version6.20.2003.0) noted with (KB973686). If I try to apply the hotfix it will state that it is an older version. Does it need to be applied anyway? Doesn’t the SP2 version have spinlock the fix? In the meantime, I am having to daily rename the health service state. Our SQL admin would like to set a priority on the monitoringhost.exe process. Could that be another work around?
We had this issue even running the MSXML6 hotfix and VBScript 5.7 while still running the converted Exchange MP. (CPU Spikes were 8-20 seconds long, but causing CPU resource contention on the monitored systems and caused service impact.) After much debugging, I got the issue down to the Execute: Test-Mailflow* cmdlets. As soon as I turned both internal and remote execution off, the CPU spikes stopped completely.
Why on EARTH are you running that old conversion MP? That thing is BAD news and no amount of tuning will make up for the enhancements in the SP1 native MP.