Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Writing monitors to target Logical or Physical Disks

Writing monitors to target Logical or Physical Disks

  • Comments 12
  • Likes

This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.

When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple.  “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent.  “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement.  Therefore – your performance unit monitors for this example work just like you’d think.

However – Logical Disk is very different.  On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc…   We must write our monitors to take this into account. 

For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target.  The only way this would work, is if we SPECIFICALLY chose the instance in perfmon.  I will explain:

Bad example #1:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.

I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold. 

image

I target a generic class, such as “Windows Server Operating System”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target.  This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances.  If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues.  HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter.  For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert. 

This is SERIOUS.  This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.

A quick review of Health Explorer will show what is happening:

This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:

image

Then went “healthy” in the same SECOND from the _Total Instance:

image

Then flipped back to unhealthy, at the same time – for the D: instance.

image

 

I think you can see how bad this is.  I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn't understand the class model, or multi-instance perf counters.  In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!

 

Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter.  So – what should I have used?  Well, in this case – I should use something like “Windows 2008 Logical Disk”  But we can still screw that up!  :-)

Bad example #2:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

Ack!  The SAME problem!  Why????

The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING.  This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.

Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.

image

 

When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn't event their drive letter!  I will examining the F: drive monitor:

I can see it was turned unhealthy because of the free space threshold hit on the D: drive!

image

and then flipped back to healthy due to the available space on the C: instance:

image

This is very, very bad.  So – what are we supposed to do???

 

We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object.  Make sense?  Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM.  This is actually easier than it sounds:

 

Good example:

 

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2.  Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:

image

 

This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class.  You can see all of these if you viewed that class in discovered inventory.  What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name.  In perfmon – we see the instance names are “C:” or “D:”

image

In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:

image

 

So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.

image

 

The wildcard parameter is actually something like this:

$Target/Property[Type="MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice"]/DeviceID$

This simply is a reference to the MP that defined the “Device ID” property on the class.

 

Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.

 

 

You can use this same process for any multi-instance perf object.  I have a (slightly less verbose) example using SQL server HERE.

 

To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List.  These should indicate lots of alerts, and monitor flip-flop, and should be investigated.

Comments
  • Excellent information, thank you Kevin.

  • why some times the green satet becomes gry but it is still healthy ?

  • Saved the bacon again Kevin! Thankyou

    JB

  • Hi Kevin,

    I have a Drive D:\ of 2 GB and i have 2 LUNS attached as D:\Log and D:\Data

    will i be able to monitor free space on LUNS using same process.

    Rakesh

  • Hi Kevin,

    I have followed the same steps and have set up the threshold as 20%. I was hoping that I would be getting alerts when free space is less than 20 % but I am getting alerts only when free space is above 20%. How Can i correct that.

  • Kevin,

    I tried to use deviceid$ instead of all instances, but the monitor would not detect any mounted disks under mouting point, wihich is driving me nuts.

    Do you have any ideas?  BTW, i'm a big fans of your blog, which helped me a lot on my daily work.

    Thanks,

    Jack

  • Thank you very much, great work kevin

  • Hi Kevin - Great blog....

    We cannot seem to get 'Physical Disk - Availability' working....

    I've turned on monitoring physical disks as described here - technet.microsoft.com/.../dd262052.aspx

    ...afterwards, all we see within Health Explorer of any given server, is that for the Disk 0, 1, 2 etc only the 'Performance' element is monitored. The 'Availability' element is not checked. Can you help?

    Thanks

  • Great work Kevin, can we apply the same methodology for CPU monitoring?

  • Hi Kevin
    Thanks for your great post.
    I Have HP-UX servers to monitor and everything is fine until we change the servers disks. when we change a disk "Logical Disk Health" monitors State goes Critical. I changed the discovery interval and no luck. what am i missing?

  • Great post Kevin !!. I have created the monitor exactly as above. Availability is not checked in health explorer for any agent. can you please help?

  • Hi Kevin
    Thanks for your post, Keep up the good work!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs