Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Self Tuning Thresholds - love and hate

Self Tuning Thresholds - love and hate

  • Comments 21
  • Likes

******  The references in this post to the Exchange Management Pack have been changed... many of these issues below have been addressed in the updated MP:  http://blogs.technet.com/kevinholman/archive/2008/08/22/updated-exchange-2003-mp-released-version-6-0-6387-0.aspx

 

 

Self tuning thresholds are a new concept for OpsMgr.  They are awesome - because they will "learn" what is normal for a performance counter, and alert when the value is outside of the learned baseline.  This is great when we have performance counters that will vary widely from company to company, and we don't know a good static setting. 

Problem is.... if the counter being monitored varies widely on a regular basis... these monitors are extremely noisy... and generate the massive amount of alerts and state changes that they were designed to control. 

There are a couple good blog posts on these already.  Probably the best one I have read is here:  http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!183.entry   We will be referring to this several times.

One of the complaints about self-tuning thresholds.... is that the numbers reflected in the baseline don't tell us anything about the actual values.  This is true.... these are based on an internal algorithm... so people see this "2.81" or "3.31" and don't understand what it has to do with anything about our performance counter.

First - lets take a look at the basic components of a STT:  We will create a new unit monitor.  Under windows performance, self tuning thresholds.... we have several types to choose from.  The most common are going to be a 2-state or 3-state baselining.... depending on how many states we want.  For this example - we will choose a 2-state baselining.

Let's give it a name, choose Windows Server as the target, and choose the performance Parent Monitor.  To keep this simple - lets choose Processor\% Processor Time\_Total as our performance Object\Counter\Instance.  Set the interval to 1 minute.

image

Now - we get to adjust the business cycle.  I'm picking one day for this example.  Typically - you would choose a week.... especially if your server behaves differently on different days of the week.

We can choose how many business cycles to wait before alerting.... most of the time 1 business cycle is fine.

On "Sensitivity" we have a nice slider from "Low" to "High".  In general.... we will be choosing a low sensitivity for our custom rules.  Low = lest alerts, wider baseline range.  I will explain the numeric values for each setting later.

image

On the "Configure Health" screen.... within the envelope will be Healthy, and above the envelope will generate a state change and (optionally) an alert.

Groovy.  So - what did we just really create?  Well certainly - we created a monitor: 

image

But if we look at rules, on the same target.... we also created some rules:

image

One of the rules is to simply collect the performance data.  The other collects signature data.  Both on the same frequency we specified earlier.

So now.... on to the most important thing - the numbers.  When we created our 2-state baselining monitor - we pretty much accepted all defaults.... except we pick low sensitivity.  To see these numbers - create an override for all objects of type, and you can see what defaults, and low equal:

image

So "inner" is 4.01 while "outer" is 4.51    We will look at these numbers more later.  This is important - because we will use these to adjust and override other counters later.

Also - on the signature collection rule that was created - a sensitivity value was placed:

image

So.... lets try and find out how each setting affects these numbers - to better understand them. 

I created 5 Self-Tuning 2-state baseline monitors.... each with a different sensitivity setting.... starting with low:

Low:  Inner: 4.01  Outer: 4.51   Rule Sensitivity:  4.01

Low-Mid:  Inner: 3.77  Outer: 4.27   Rule Sensitivity:  3.77

Mid:  Inner: 3.29   Outer: 3.79   Rule Sensitivity:  3.29

Mid-High: Inner: 2.81  Outer: 3.31  Rule Sensitivity:  2.81

High:  Inner: 2.57  Outer: 3.07  Rule Sensitivity:  2.57

 

That will give us a good baseline to use - when tuning these rules.  We can see that default inner sensitivity ranges from 2.57 to 4.01, and outer ranges from 3.07 to 4.51.   The larger the numbers.... the less sensitive the baseline range, and therefore fewer alerts.  The difference between the numbers is always .5 

To tune these self tuning alerts..... we simply need to adjust these values, for the Performance signature rule, and the corresponding baselining monitor.

Here is a list of some very common noisy STT's - taken from the link above:

ALERT=Information Store Transport Temp Table is outside the calculated baseline
RULE=Baseline Collection Rule for Information Store temp table number of entries (Rules, of type Exchange Queue)
MONITOR=IS Transport Temp Table Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT= Mailbox Store Send Queue is outside the calculated baseline
RULE=Baseline Collection Rule for Mailbox Store Send Queue Length (Rules, of type Exchange Queue)
MONITOR=MB Store Send Queue Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT=SMTP Local queue is outside calculated baseline
RULE=Baseline Collection Rule for SMTP Server Local Queue (Rules, of type Exchange Queue)
MONITOR=SMTP Local Queue Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT=SMTP Messages in the Queue Directory is outside calculated baseline
RULE=Baseline Collection for SMTP Message Queue Directory (Rules, of type Exchange Queue)
MONITOR=SMTP Message Queue Directory Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT=SMTP Remote Queue is outside the calculated baseline
RULE=Baseline Collection Rule for SMTP Server Remote Queue Length (Rules, of type Exchange Queue)
MONITOR= SMTP Remote Queue Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT=SMTP Remote Retry Queue is outside the calculated baseline
RULE=Baseline Collection Rule for SMTP Server Remote Retry Queue Length (Rules, of type Exchange Queue)
MONITOR=SMTP Remote Retry Queue Monitor (Exchange Queue, Entity Health, Performance)
 
ALERT=IS Virtual Bytes is outside the calculated baseline
RULE=Baseline Collection Rule for IS Virtual Bytes (Rules, of type Exchange IS Service)
MONITOR=IS Virtual Bytes Monitor (Exchange IS Service, Entity Health, Performance)
 
ALERT= Number of RPC requests is outside the calculated baseline
RULE=Baseline Collection Rule for IS RPC Requests (Rule, of type Exchange IS Service)
MONITOR=IS RPC Requests Monitor (Exchange IS Service, Entity Health, Performance)

 

What we see - is that most of the default STT's in the management packs are set to "Medium-High" sensitivity.... or a Inner of 2.81 and outer of 3.31.  This is likely too sensitive, and needs to be adjusted.  Essentially... start by bumping up to the next set of numbers for both values, and adjusting them from Mid-High, to Mid, Mid-Low, or Low.

Here are the steps from the above blog post... with a few changes:

 Steps to resolve: (perform all of these steps for each Alert in your environment which needs to be tuned)

  1. Find the rule that applies to the alert. (To find the rules, it’s easiest to change the scope to filter by the two areas that we need - which are the Exchange Queue and Exchange IS Service. Both of these are available when you click on scope and choose the option to view all targets. Then find rules with “Baseline Collection” as the start. This scopes it down to about 17 rules versus over 6000.) Details on the names of each of the above rules are listed below. Disable the rule (Right-click on the rule, overrides, disable the rule for all objects of type: Exchange Queue, click yes to accept).
  2. Change the rule sensitivity to 3.29 (Right-click on the rule, Overrides, Override the rule, For all Objects of type: Exchange Queue, check the Sensitivity parameter and set it to 3.29 (or higher if needed), click OK).
  3. Find the monitor that applies to the alert. This can be found by searching or scoping to the type of object identified for the rule. Disable the monitor (Right-click on the monitor, Overrides, Disable the monitor for all objects of type: Exchange Queue, click yes to accept).
  4. Change the monitor inner sensitivity to 3.29 (Right-click on the monitor, Overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Inner Sensitivity parameter and set it to 3.29 if it’s not already set to that value, click Ok).
  5. Change the monitor outer sensitivity to 3.79 (Right-click on the monitor, overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Outer Sensitivity parameter and set it to 3.79 if it’s not already set to that value, click Ok).
  6. Re-enable the monitor. (Right-click on the monitor, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).
  7. Go back to the rule identified in step #1 and re-enable the rule. (Right-click on the rule, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).

 

NOTE:

The "outer" sensitivity does not matter.  It is an early design leftover, and does not have an impact.  Only the inner sensitivity makes a difference in tuning.  There has been some conflicting information in the newsgroups, but this information has been verified with the dev team.

The only requirements... on the outer, is that it be a larger number than the inner.  So when adjusting - focus on bumping the inner in .5 increments, and just make sure the outer is any number higher than the inner.... such as .1 higher than inner.

 

In Summary:

1.  Not all counters are good candidates for STT’s based on the performance counter pattern.

2.  Some of our built in STT’s are a bit on the sensitive side and should be tuned.  If the alert noise is high - start by tuning - lower the sensitivity.

3.  Some of our built in STT’s are targeting a perf counter that is not a good candidate for an STT. (eg… STMP queue, or any perf counter that is often “zero value” when healthy). 

4.  There is no simple way to view the learned baseline of an STT…. the “show baseline” in graph view does not display a range.

5. Any time a customer is not happy with the results of a STT monitor – they should simply create a static threshold monitor.  This is very basic and provides the best solution.  If you cant tune noise out of a STT, or you NEED to know at what threshold an alert will be generated.... then simply turn off the STT, and create an identical static threshold monitor, of the average, or consecutive samples above, type.

 

Comments
  • ADMP: http://blogs.technet.com/cliveeastwood/archive/2008/03/17/operations-manager-2007-active-directory-management-pack-admp-6-0-6278-3-update-released-to-the-web.aspx

  • Great post! I am getting all of these warnings and they have become quite a  mess.

    However, following your steps, when I try to input the value 3.79 or any 3. something Ops Mgr changes it to 379. I can input 3,79 but even the default value says 3.31 so why can't I enter that?

    Quite frustrating!

    Thanks again for the great article/guide.

    /Sebastian

  • I havent seen that.  Is this on SP1 or RTM?  Can you give me the name of a specific Monitor that you are having the issue with?  What locale are you using?

  • Hi Kevin

    Great blog .. problem is that it still doesn't really tell us the mechanics of how the baselining works.

    What is the algorithm? If I have a series of values how can I tell the 'acceptable' boundaries are for the next value? What does the sensitivity value actually represent e.g standard variation? It is all smokes and mirrors at present and Microsoft seem unwilling (or unable) to provide some transparency.

    Cheers

    Graham

  • How important is it to disable the rule and monitor before making the change?  There are so many of these noisy rules to correct that removing the disable/enable step for each rule & monitor definitely makes it a lot easier if it's not required.

  • I have followed thru this self tuning baseline steps you stated in this blog all the way to the Low sensitivity baseline values. Temporarily the alerts stopped and now it has started again. Today alone, I have recieved more than 30 alerts.

    Any help with this would be appreciated.

    Thanks

  • Some performance counters will NEVER be a good candidate for a self tuning threshold.  The learning process, and subsequent alerting, does not work well for any counter that routinely goes to zero for a period of time, then comes off zero.  For any counter that is still noisy after adjusting sensitivity... you should disable the self tuning rules and monitors, and create a simple threshold monitor based on a static value - either using an average value, or a consecutive samples above threshold value.

  • Awesome post.  Removed tons of frustration I was having over the noise and also a good illustation of the Override feature instead of our typical deleting and recreating.

  • Great post.

    The item mentioned about the dots (by Sebastian Haraldsson) is what we encounter as well.

    We use OpsMgr SP1 and have been changing the locale (NL, US) but that didn't make a difference.

    Any updates?

    Regards, Walter.

  • I had the same problem with dots as decimal separators. After setting regional settings as: thousands sep.: "," and decimal sep.: ".", I could set decimal values with .

    I hope this can help you.

    Gustavo.

  • Hi Kevin

    Great post. I have a question for you regarding this post. We are seeing a lot of noise from these baseline alerts. When I follow your instruction to disable, override and re-enable those rules I am little confused on one thing.

    When you disable a rule or a monitor it is saved on the default management pack. If you set a override after the disable then your value is going to stay with the default management pack. Is that correct? Please explain.

  • I think you might be mis-reading the instructions, not sure......  

    When you disable a rule or a monitor, you *NEVER* right click and choose "disable".

    Following the instructions above... you (Right-click on the monitor, Overrides, Disable the monitor for all objects of type: Exchange Queue, click yes to accept)

    This makes you pick a management pack.  I did not tell you which MP to save your temporary overrides to... because that is understood that you would just choose one....  The Exchange MP guide specifically instructs you to create an override MP just for Exchange overrides... and so this post assumes you will be using those best practices.

    Regardless, even is you DID save them to the default MP - the instructions have you delete the override that you applied to the rule.

  • Hi Kevin,

    I assume the learned baseline values are stored temporarily during self tuning cycles, if this is the case, do you know where the values are held and can it be queried?

    Regards,

    Andy

  • I dont know of a simple way to read the learned baselines - no.

  • Hello Kevin. Can i set sensitivity to 5, for example? Or its biggest value must be 4.01? Thank you

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs