Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Tuning tip – turning off some over-collection of events

Tuning tip – turning off some over-collection of events

  • Comments 28
  • Likes

We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

 

At a high level – here is how I like to review these events:

  1. Review the Most Common Events query that your OpsDB has.
  2. Create a “My Workspace” view for each event that has a HIGH event count.
  3. Examine the event details for value to YOU.
  4. View the rule that collected the event.
    1. Does the rule also alert or do anything special, or does it simply collect the event?
    2. Do you think the event is required for any special reporting you do?
  5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
  6. Continue to the next event in the query output, and evaluate it.

 

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

 

image

 

image

 

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

 

Here is why I consider it totally cool to disable these uninteresting event collection rules:

  • If they are really important – there will be different alert generating rule to fire an alert
  • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
  • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
  • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
  • If we find we do need one later – simply remove the override.

 

Here is an example of this one:

image

 

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

 

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

 

1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103

 

I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.

Comments
  • Hi Kevin,

    thanks for sharing this Information. To give you some feedback on the Events i see in our Environment:

    Top1 (1.8 Million! Events) - EventID 31707 (Error monitoring parent directory. Directory = %SMS_INSTALL_DIR_PATH%)

    followed by Event 1501, 10409, 21024, 10403 with about 200k each. So maybe 31707 is an issue for other environments too.

    Regards Marco

  • Thanks Marco -

    The 31707 is a known issue - from you not configuring your SMS MP according to the guide.  There is a variable for the SMS logs path in the MP - and you need to set this variable on ALL your SMS servers.  I would STRONGLY recommend you set this up correctly - otherwise you arent monitoring your logs, and you are flooding opsmgr with these events.

    I dont have any 1501 events - what are they when you create the view to look at those?

    The others are known issues - and I would diable them.

  • Event 1501 is from the DHCP Scope Monitoring, collecting the address status. From the Product Knowledge of the Rule:

    Summary

    This rule collects the following DHCP related information:

    DHCP superscopes and scopes

    DHCP superscope and scope relationships

    DHCP superscopes and scope utilization

    Caution:

    Disabling this rule prevents the DHCP server superscope and scope monitoring and reports from functioning.

  • I searched the XML of all the current DHCP MP's - and 1501 is not in them.  What DHCP MP are you using, what OS version is your DHCP server, and what is the EXACT rule or monitor name, and target, that is responsible for inserting the 1501?

  • I ran you little query and this is the result:

    TotalEvents EventID EventSource

    1155157 1206 HealthService

    136169 117 nworksSource

    38788 21024 OpsMgr Connector

    15032 29102 OpsMgr Config Service

    14846 29103 OpsMgr Config Service

    14481 21025 OpsMgr Connector

    13551 1210 HealthService

    13144 74 nworksSource

    12354 77 nworksSource

    10575 10378 Health Service Modules

    9824 72 nworksSource

    9737 68 nworksSource

    6154 89 nworksSource

    5689 10376 Health Service Modules

    5614 10403 Health Service Modules

    4505 1102 HealthService

    3783 10102 Health Service Modules

    2355 31901 Health Service Modules

    2248 6022 Health Service Script

    2225 31902 Health Service Modules

    The Top 5 matches your favorites :-)

    The nworksSource is from the VMware MP by Veeam, will start checking these out.

    Cheers,

    Serge

  • Re:  Serge

    So here is an example where collecting too many events might be a good thing.  :-(   You are hammered with event 1206.  THis is bad.  However - we dont have any good alerting to "detect" this condition... so analyzing your event flooding might be the only way to detect this.  A 1206 is:  Rule/Monitor "%2", running for instance "%3" with id:"%4" failed, got unloaded and reached the failure limit that prevents automatic reload. Management group "%1".   A completely healthy management group will have ZERO 1206 events.

    You should create a view for this event - and try to determine if you have a systemic problem with a MP, a rule, or just sick machines all over the place.  THis isnt good - but might just be a badly written event.  I have never seen that one so high before.  So - it STILL isnt valuable to collect the 1206 event... as it simply fills up your DB - but you DO want to fix the root cause of it.... so I would not turn this off until you are no longer seeing it happen so much.  Or - create an alert-generating rule for this event and enable alert supression.

    Re:  117 - I would determine if Nworks really needs this event.

    Re:  21024, 21025, 29102, 29103, 1210.... I would turn those off.

  • Hi Kevin,

    I've checked and figured out the 1206.

    Apparently 1 (ONE!) server was going ballistic a couple of days ago. Unfortunately it was an nWorks Virtual Infrastructure Collector. These servers collect all info on VM Hosts & Guests. Typically I saw all kinds of Events like this one:

    Rule/Monitor "nworks.VMware.VEM.VC2Alarm.VMGUEST.CPU.toRed", running for instance "_Total" with id:"{C5AC8DDB-DE26-A276-9177-1D9E5D854400}" failed, got unloaded and reached the failure limit that prevents automatic reload.

    The 117 is also an interesting one :-)

    According to Veeam: This is intended as an update "hint" to the mom/scom MP. This event drives the performance data consumer in the MP.

    The description contains this kind of info: SV110 Performance data for 'VMDiskProperties' class published in WMI

    Guess I'm gonna drop the guys at nworks a couple of questions.

    Cheers,

    Serge

  • Hi Kevin,

    regarding the 1501 Events. We currently have about ~170 DHCP Servers included in our SCOM Monitoring, running Windows Server 2003.

    The exact Rule Name is "DHCP Scope Monitoring", the Rule Target is "Microsoft Windows 2003 DHCP Servers Installation". The MP is V6.0.5000.33, probably a rather old Version.

  • Re: DHCP

    Yes - thats a very old MP.  That makes sense now.

    I would normally say go upgrade that MP.... but if you are happy with the monitoring it provides - you might just keep it.  The current updated Native DHCP MP 6.0.6452.0 has some significant monitoring limitations, due to some advanced monitoring that it performs, and I am not 100% sure those limitations are present in the conversion MP.  I just dont know.  Like I said - if you are happy, I'd probably stick with it.

  • When I run this query against DW then receving below error message

    Msg 208, Level 16, State 1, Line 1

    Invalid object name 'EventAllView'.

  • To:  Dinesh

    That is because this query is not for the warehouse database.  It is for the Operations database.

  • We find Event ID 10409 events with high number, which generated from Rule: Collect WMI Probe Module Events. This rule collects many other evnts also, how can I disable event collection for Event ID 10409 only?

  • Generally - if that event is high - you have a problem that requires investigation - either a bad MP or some very sick agents.

    That said - you cannot disable the event collection for a single event - when the rule collecting has multiple events in the data source.  The only way to do that - is to simply disable the rule - then recreate it, and leave out any ID's you dont want.  That said - if you arent using events in troubleshooting on a regular basis - why not just turn off the whole rule?  As long as it doesnt also ALERT, that is.

  • Hello Kevin,

    I ran your query and got:

    31707 62549 Health Service Modules

    11771 46004 Health Service Modules

    1199 34200 Health Service Script

    10401 29589 Health Service Modules

    7000 21774 Service Control Manager

    1112 15782 Health Service Script

    21024 13758 OpsMgr Connector

    10409 12846 Health Service Modules

    29103 11987 OpsMgr Config Service

    29102 11985 OpsMgr Config Service

    21025 11942 OpsMgr Connector

    1210 11012 HealthService

    9100 10634 Health Service Modules

    6001 7777 DNS

    10403 6918 Health Service Modules

    1740 4352 ConfigMgr 2007 Monitor State Message Summary Tasks

    1077 3870 W3SVC

    10375 2904 Health Service Modules

    1135 2896 Health Service Script

    21405 2467 Health Service Modules

    So I have the 31707 on top as Marco and what exactly you are referring to when saying "The 31707 is a known issue - from you not configuring your SMS MP according to the guide.  There is a variable for the SMS logs path in the MP - and you need to set this variable on ALL your SMS servers.  I would STRONGLY recommend you set this up correctly - otherwise you arent monitoring your logs, and you are flooding opsmgr with these events."

    I will review the documentation but already three of us did it and could not find the issue ... it was the holidays so maybe we are too tired... :) otherwise what should I do with other Events?

    Thanks

    Dom

  • Dom:

    As for the other events - you should follw exactly what the blog posts says - create views for them in "My Workspace" - look at them and see if this is indicative of a big problem - or something you just wanna turn off.  

    Several of the ones in your list are ones I turn off collection rules for.  The others that are not in my list... I would investigate.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs