Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

How to monitor a process on a multi-CPU agent using ScaleBy

How to monitor a process on a multi-CPU agent using ScaleBy

  • Comments 12
  • Likes

The business need:

It is a very common request to monitor a process on a given set of servers, and collect that data for reporting, or monitor it for a given threshold.

One thing you might notice when trying to monitor some performance counters, is that not all perf counters in perfmon behave the way you might assume.

For instance, I want to monitor “how much CPU a process is using”.  Perhaps we wish to monitor our SQLServer.exe process on our SQL servers?

This is easy – because Perfmon already has a Performance Object, Counter, and Instance for that.  In perfmon, we would use:

Process > % Processor Time > Sqlserver.exe

image

 

Easy enough!

So, we can quite easily create a performance threshold monitor, and a performance collection rule using this.  Let’s say we set the monitor to alert anytime the SQLserver.exe process is consuming more than 80% of the CPU sustained for 5 minutes.

 

The issue:

 

However, quite quickly we might notice erratic behavior from our monitor and rule.  The monitor is generating TONS of alerts from almost all our SQL servers, and then quickly closing them… essentially flip-flopping.  When we check the performance data we have collected, we see the process is using up to 800% CPU!!!  So – thinking something is wrong with OpsMgr – we inspect a busy SQL server in perfmon directly… but observe the exact same behavior:

image

 

As you can see – this process is using almost 400% CPU.  Why?  How is this possible?

 

This is because the Process monitoring counters in Windows are not multi-CPU aware.  When a server has 4 CPU’s (like this one above does) a process can use more than one CPU at a time, provided it is spawning multiple threads.  This way, it can be using up to 100% of each CPU or Core (logical processor).  A process on a 4 processor server can consume up to 400% of that process counter.  So if a process is really only consuming 20% of the total CPU, that will show up as 80% on a 4-core machine.  Think about today’s hardware… many boxes have up to 16 cores these days, which would register as 320% processor utilization for something really only using 20% of the total CPU.

As you can see – this causes a BIG problem for monitoring processes.  As an IT Pro – you need to know when a process is consuming more than (x) percent of the *total system resources*…. and every server will likely have a different number of processors.

 

The solution:

 

In OpsMgr R2 – a new XML based function was created to help resolve this challenge.  This is known as <ScaleBy>

The <ScaleBy> function essentially gives you the ability to take the monitoring data collected by something (that is an integer), and divide by something else (integer).

I can input a fixed value here, in integer form, or I can input a variable.  For the variable, I can actually pull data from discovered properties of monitoring classes.  This is GREAT in this instance, because we already discover the number of Processors a Windows Computer has.  We can use this discovered data, along with this <ScaleBy> function, to fix our monitors and collection rules that need a little massaging to the data we get from perfmon.

Here are the Windows Computer class properties:

image

 

Let’s walk through an example using the authoring console.

  • Open the Authoring console.
  • Create a new empty management pack.
  • Go to Health Model, Monitors, right click and create a new monitor. 
  • Windows Performance > Static Thresholds > Consecutive Samples.
  • Give your workflow an ID, Display Name, and choose a good target class which will contain your process.  I will use Windows Server Operating System for example purposes, but you want to always try and choose a target class that will have your process counter in perfmon.
  • Select System.Health.PerformanceState as the parent Monitor:

 

image

 

  • Browse a SQL server for the process object you will need – or type in the relevant data.  I will set my samples for the monitor to inspect every minute.  This data is not collected and inserted in the database for a monitor – this sample data is kept on the agent for inspection of a threshold match… so we can monitor the process with a MUCH higher sample rate than we would ever do a performance collection rule.

 

image 

 

  • I set my monitor to change state when 5 consecutive samples have all been over 80% CPU:

 

image

 

  • Click finish – then open the properties of the monitor you just created.  Go to the configuration tab.  Here are all the typical configurable items in a performance monitor workflow. 

 

image

 

  • However – we need to add one more – the <ScaleBy> function.

We have to do this in XML – as there is no UI that added this capability.  Click “Edit” on the configuration tab which will pop up the XML of this configuration.

We are going to add a single line after <Frequency> which will be this line:

<ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy>

What this does – is tell the workflow to take the numeric value received from perfmon, and then divide by the numeric value that is a property of the Windows Computer class for number of logical processors.  Then take THIS calculated output and use that for collection or threshold evaluation.

Here is my finished XML snippet:

 

  <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
  <CounterName>% Processor Time</CounterName>
  <ObjectName>Process</ObjectName>
  <InstanceName>sqlservr</InstanceName>
  <AllInstances>false</AllInstances>
  <Frequency>60</Frequency>
  <ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy>
  <Threshold>80</Threshold>
  <Direction>greater</Direction>
  <NumSamples>5</NumSamples>
</Configuration>

 

Now – the authoring console was not updated to fully understand this new function, so you might see an error for this.  Simply hit ignore.

Your new monitor configuration now looks like this:

image

You can do the exact same operation on a performance collection rule as well to “normalize” this counter into something that makes more sense for reporting.

 

Some other uses of this might be for situations where a counter in bytes…. and you want it reported in Megabytes.  You could hard code a <ScaleBy> 1000000 (one million).  That way – if you wanted to report on how many megabytes a process was consuming over time… instead of representing this as 349,000,000 on a chart (bytes) you can represent this as a simple 349 Megabytes.  That XML would simply be:

<ScaleBy>1000000</ScaleBy>

Ok… I hope this made some sense…. this is a valuable method to normalize some perfmon data that might not be in what I call “human format”.  Keep in mind – you can ONLY use this XML functionality on an R2 management group, and it will only be understood by an R2 agent.

You can quickly go back to your previously written process monitors, and add this single line of XML really easily, using your XML editor of choice.

 

One last thing I want to point out…..  some of the previously delivered MP’s that Microsoft shipped might be impacted by this issue.  For instance – in the current ADMP version 6.0.7065.0 there is a monitor “AD_CPU_Overload.Monitor” (AD Processor Overload (lsass) Monitor) which does not take into account the number of logical processors.  This is often one of the MOST noisy monitors in my customer environments, especially on a busy domain controller.  This is simply because MOST DC’s have more than one CPU – and this skews the ability for this monitor to work.  The issue is – they could not add this <ScaleBy> functionality to this MP – because that would make the ADMP R2-only… which we don't want to do.

You have two workarounds for SP1 management groups:  Monitor processes using a script that will query WMI for the number of CPU’s and handle the math for this function (ugly) OR create groups of all Windows Computers based on their number of logical processors (easy) and then override these types of monitor thresholds with relevant numeric's for their processor count.

For R2 customers – I recommend disabling this monitor in the ADMP – and replacing it with a custom one that utilizes the <ScaleBy> functionality.

 

Comments
  • Thanks man! Good posting. This explains a lot!

  • Is it possible to apply the same for a performance collector rule; I want the lsass processor time value also scaled by in the performance data

  • Absolutely.

  • Thank you for info on the <ScaleBy> tag, however, I'm not finding where I can add that tag to a rule. Am I missing something?

  • How to import or enable this custom monitor? and how to verify whether it is enabled or not?

  • Kevin, How to Edit Rule to include <scaleby> ?  I am trying to get datafile size in MB.

  • Kevin, I have one more question. How  can we create report in Execl/pdf format for  sql servers database size? These performance views doesn't serve much purpose... somtimes.

  • 2 notes about this:

    1: I put <ScaleBy> at the "end" of the configuration, the authoring console hated that, so I manually edited the XML, the RMS wouldn't import it. Apparently this has to be in a specific place, so I put it after <Frequency> and it works just fine

    2: For the alert description, if you want the value of the counter to show up, you need to use $Data/Context/SampleValue$ (I think the authoring console puts in $Data/Context/Value$ which is wrong)

  • Hello,

    Did anyone figure how to Edit the Rule to include <scaleby> ? I have SCOM 2007 R2 and do not see “Edit” on the configuration tab.  I tried to export the MP add the scaleby line and reimport the MP but still doesn;t work.  I still see DB size in KB instead of MB

    Thanks in advance

  • Hello The problem is solved in Scom 2012? Thanks for Help

  • @Ruben - this is an OS artifact of how the perfmon counters work, so SCOM 2012 will not natively change anything about this.

  • HI , I am using SCOM 2012 SP1 but I did not see to Configuration edit option as posted above in blog. Any help please?

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs