Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Getting lots of Script Failed To Run alerts? WMI Probe Failed Execution? Backward Compatibility Script Error?

Getting lots of Script Failed To Run alerts? WMI Probe Failed Execution? Backward Compatibility Script Error?

  • Comments 19
  • Likes

In OpsMgr 2007, it is likely that your most common alert is not really a MP based alert from a technology management pack…. it could be a built-in alert that a script failed, or WMI could not be accessed.  This is because when WMI is broken on a machine, almost EVERYTHING fails to execute properly on that agent. 

 

At a recent health check at a customer site, we found the top 5 alerts in his environment (by cumulative repeat count) were:

  • WMI Probe Module Failed Execution
  • Service Check Data Source Module Failed Execution
  • Backward Compatibility Script Error
  • Script or Executable Failed to run
  • Service Check Probe Module Failed Execution

Sometimes – these alerts are normal…. the server is busy, or someone rebooted it without putting it into maintenance mode and allowing the workflows to unload gracefully.

However, if you have a high repeat count on these, it is typically indicative of something seriously broken on that agent(s).  Most of the time – the failure is in WMI.  Many customers get frustrated with these script errors, because they see them as “false alerts” because they don't know how to resolve the root cause, and we just tell you “this action broke”, we don't tell you why.  It is critical that you examine these alerts, however, because these alerts will indicate something seriously wrong with an agent, such as broken WMI/cscript/OS issue.  If you ignore them, or disable them – you will never know that monitoring is not functioning 100%.

 

Generally – here is how I attack script/WMI failures.

1.  If the repeat count is 0 or 1, I ignore these as random failures, and close the alerts from time to time.

2.  If the repeat count is very high, then something is wrong with the agent, and needs remediation on the agent OS.  Investigate the OpsMgr event log on the agent for Warning/Critical events – to see if a lot of workflows are failing due to this issue.

 

The FIRST thing I do – is to see if WMI is responsive.  I run WBEMTEST, and connect to “root\cimv2”.   I then hit “query” and execute a “select * from win32_operatingsystem” to see if it returns results, or an error.  Next – I look at the namespace from the alert in SCOM…. perhaps it is “root\MicrosoftDNS”, or “root\CCM”.  Then – I try and run the query that is failing from the alert.

If EITHER of the above connections/queries fail…. then I know what's wrong.  WMI has a core issue, and I punt this to my platform or application team to fix it.  Sometimes it needs a MOF recompile, sometimes it needs WMI service bounced or the OS bounced.

If these all appear to work correctly, or, the problem is resolved after a WMI service bounce, then re-appears later – check out the following:

There are many things you can do to resolve/remediate these issues.  Here is a list of the most common fixes:

 

1.  Apply http://support.microsoft.com/kb/933061  This resolves a LOT of issues on the Windows 2003 OS with WMI.  This should be one of your first steps.  This applies to x86 or x64 Windows Server 2003 SP1 or SP2.

2.  Registry modification for WMI buffer thresholds (see below)

“HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WBEM\CIMOM\Low Threshold On Events (B)" to 35000000 (default is 10000000)
”HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\WBEM\CIMOM\High Threshold On Events (B)" to 70000000 (default is 20000000)

The registry modification to WMI buffers increases the amount of objects that WMI can hold before injecting sleep delays to the WMI service.

3.  Apply http://support.microsoft.com/kb/955360  This updates the Windows Scripting Host (cscript) to version 5.7.  This resolves script timeouts, and scripts consuming a LOT of CPU during execution, and problems with multiple scripts running at the same time.  This applies to x86 or x64 Windows Server 2003 SP1 or SP2.  This is a very good hotfix for DNS servers, DHCP servers, and Domain Controllers.  This has been seen to lessen the impact of VBscripts consuming a large amount of CPU during runtime.

 

 

Making these three modifications should resolve the majority of systemic issues out there, unless WMI is completely corrupt/unresponsive and needs repair.  Sometimes, rebooting a server, or bouncing WMI will temporarily resolve these issues as well, if you cannot apply the fixes immediately.

If you have applied all three of these above, and are still experiencing a systemic repeat of a WMI query/script failure…. the next step would be to try running the query directly, accessing the namespace in WBEMtest.  I’d like to hear about any experiences here.

Comments
  • Thanx for this post. I'm going to try this. This needs a couple of days to find out if it work.

  • Hi Kevin,

    I got quite a lot alerts for "Script or Executable Failed to run" on almost every server. there are two types of such alert, please see below -

    The process started at 5:11:48 PM failed to create System.Discovery.Data, no errors detected in the output. The process exited with 4294967295 Command executed: "C:\WINNT\system32\cscript.exe" /nologo "GetServerNames.vbs" {A18B826D-DFA9-DC21-F94C-68A8A95ADD4C} {86302D45-B0E8-BD9E-F57C-1580AC05ADEB} deuntp008.wwy.wrigley.net Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 158\1517\ One or more workflows were affected by this. Workflow name: Microsoft.Office.Sharepoint.Server.2007.MOSS.Server.Discovery Instance name: DEUNTP008 Instance ID: {86302D45-B0E8-BD9E-F57C-1580AC05ADEB} Management group: MSMOM

    The Event Policy for the process started at 8:07:43 PM has detected errors in the output. The 'StdErr' policy expression: \a+ matched the following output: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 33\433\Collect_Server_Information.vbs(328, 3) Active Directory: An invalid directory pathname was passed Command executed: "C:\WINNT\system32\cscript.exe" /nologo "Collect_Server_Information.vbs" cnguap013 Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 33\433\ One or more workflows were affected by this. Workflow name: Collect_Server_Information Instance name: cnguap013 Instance ID: {F949270A-16BA-29B1-428F-63BCB20BA774} Management group: MSMOM

    So I think it's impossible for me to install the hotfix for all server, it may be something I can do on SCOM2007 SP1?

  • I get quite a few of these as well, even after R2 was installed; 2003 as well as 2008 servers. Most of them seem to be related to Health VBS scripts.

    Does anybody have any ideas to get these corrected?

    Is this indicative of a true server performance issue or a matter of how the MPs were written? If it's the MPs, I would have expected that MS to have known about this and corrected it by now.

  • Larry - did you read my section on how to troubleshoot these?

    Have you discovered a pattern?  

    What is a health VBS?  Can you give some examples of exact script names?

    Did you have high repeat counts, or low?  Is WMI healthy on the server?

    Have you updated the hotfixes I mentioned on 2003?

    I cant tell what steps you have taken to troubleshoot these, or specifically what scripts and repeat counts from your post.  I'll try to help if I can.

  • Is it just me, or does the hotfix ONLY apply to i386? How can I update WSH to version 5.7 for 64bit OS Windows 2003?

  • Click the "show for all platforms" in the download - there is one for x86 and one for x64.

  • Awesome document! I merely stumbled upon this. I don't know how many searches I have done on WMI issues and came up with completely trivial responses "Restart WMI", Reboot the system" etc. The explanation of how to test proved to be a huge help and teh Registry mods I've never seen before in any documentation.

  • Kevin, this is right on and it helps to have the process roughed in.

    Is there a script one could run on the ill server which would either fix or at least diagnose the problem?  I have too many agents with these sorts of errors to fix them by hand.

    The flow of these events into our scom environment is affecting the overall performance of SCOM.  Fixing the individual problem servers is an ongoing and apparently never ending process.  Is there a way to set the event collection to do it less frequently or to summarize the data flowing into SCOM?  I am considernig disabling the "Collect WMI Probe Module Events" rule.  What kinds of reports, rules, monitors, etc. would "break" if the event data stops flowing in?

  • @Ted -

    I seriously doubt these alerts themselves can have THAT much impact on overall SCOM performance.  I have been talking with the product group on this topic - and there are some enhancements scheduled that will quiet these down and reduce the alert traffic for this stuff in OM12.

    That said - there is no magic bullet.  MOST of the time - there is a problem with Cscript or WMI.  You can start by applying the hotfixes mentioned in this article - then attacking the ones that are really high repeat counts.  Most of the time its WMI.  My thought process - is create a monitor that runs a script that executes a WMI query.  If the output does not equal whats expected - then turn monitor red..... WMI is busted.  Something simple like that will help a lot more than all these misc errors.

    Again - focus on high repeat count ones first....

  • I have some machines that run very heavy processes for several hours each day.  I have noticed those machines report many of these errors, more of the script failures than WMI issues, and they are all time stamped during the heavy resource use times.  Is there a way to keep the scripts and WMI queries from running while the server is low on available resources to see if this quiets down the failures?

    Great Post, Thanks!

  • James - I would run a procmon and capture the cscript.exe processes and scripts they are running, and record what scripts are running, and when.  Armed with this data - it would help you solve the problem, as you might not even need all that stuff running all the time, or as frequent.

    Additionally, you could place the instances of classes which have script workflows targeting them, into scheduled maintenance mode, for the known busy periods, accepting the fact that whatever these scripts do - they wont be doing, during these busy times.  

  • Can anyone help me in creating simple (basic) SCOM monitor or rule using WMI event alert as every time i try to make one it fails .

    A basic WMI query is requested

  • Hi keven, it's my first time on your awsome blog. (excuse my poor english, i'm French  Canadian)

    I know this is a pretty old post, but i was wondering if there is any fix for those types of errors for 64bits OS Windows Server 2008 sp2.

    The 5 alerts above are also the top 5 of my site. I monitor a lot of 2008 windows servers.

    Thanks.

  • Hi Patrick -

    The fixes are the same for any OS - look at the alert - what the alert description describes as the reason, then look at how often it is happening.  From this - you should be able to determine if this is a transient issue, or an indication of a serious server health problem where scripts aren't running, or WMI is unhealthy.

  • Can we create a WMI health monitor to clearly idently the root cause?  And then ideally suppress the other alerts....

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs