There have been a couple good articles briefly covering this topic…. you might have read them. I will reference some below. Config churn is basically, when your RMS is in an almost never-ending loop of generating config. This can be caused by “less than optimized” management packs, pushing agents all the time, or injecting major changes into a management group, such as overrides or custom rules and monitors, or importing updated management packs. By examining this topic in depth – we will re-state some already known best practices with maintaining a healthy management group, and get some deeper knowledge as to why they are best practices in the first place.
Any time you push agents, or create rules and monitors, or overrides for widespread classes….. you can create a config update on the RMS that must be sent down to ALL agents in the management group. For small management groups (under 500 agents) this is generally not a big deal and processes rather quickly. For large management groups over 1000 agents, this can cause high resource utilization of the RMS and SQL Database, in terms of CPU, Memory, and Disk I/O. This can impact data insertion, and console performance during these times. For these reasons, we like to keep those activities down to a minimum during working hours, and schedule these major changes in an off-hours maintenance window.
What about “less than optimized” management packs? What does that mean? Well, this means management packs that you might be using, that have poorly written discoveries.
We have long known that a worst practice in Management Pack development, is to have a discovery that discovers instances of a class, that has properties for those instances that are likely to change frequently. Here is a write-up from OpsManJam on the topic: LINK
Ok… wait… Whaaaaat?
Let me put that in English:
Say we have a discovery for a Logical Disk. This will discover any logical disk, like C:, D:, E:, Q:, etc…. When we write the discovery for a logical disk, we can add properties to that discovery. These are attributes of the discovered instances. So – in this case – lets say we decided to add “Size” of the disk as a property, and “Free Space” as a property. And for the discovery frequency – we will run this discovery every hour, looking for new disks.
“Size” is an excellent property for the Logical Disk class. We like to know the size of the disks…. we can use this property group them if needed. “Size” of a logical disk is not something that we would expect to change very often.
“Free Space” is a horrible property for the Logical Disk class. Free space is something that will likely change, just a small amount even, between each run of the discovery. Free space is a property that is likely to change frequently, therefore – it should NOT be used in a discovery.
Make sense?
Ok – so… what's the big deal?
Well, the agent will run almost all discoveries that it knows about when the health service starts up (like when you bounce the service, or after a reboot). It will always send this discovery data to the management server. (this is another reason why agents restarting all the time is very bad) Then, it will run then based on the “Interval” frequency specified on the discovery. Sometimes this is as frequent as once per hour, sometimes as long as once per day. When the discovery runs, the agent will inspect the discovery data that it gets, and compare it to the last discovery data it sent to the management server. If nothing changed – the agent drops the discovery data and does nothing. IF anything changed in the values of the discovery data – it will re-submit the new data to the management server, which will submit this data to the database. The RMS will detect the change, and will have to recalculate (regenerate) configuration. You will see this on the RMS as a 21025 event:
Log Name: Operations Manager
Source: OpsMgr Connector
Date: 9/27/2009 11:51:49 PM
Event ID: 21025
Task Category: None
Level: Information
Keywords: Classic
User: N/A
Computer: OMRMS.opsmgr.net
Description:
OpsMgr has received new configuration for management group PROD1 from the Configuration Service. The new state cookie is "D7 9B A4 BE 00 90 CF 13 35 B5 9B 5F 3B 14 FF 78 D6 13 9A 2D "
The 21025 event isn’t really “bad”… it simply means the config service did its job. It re-generated its configuration file from the database data, and wrote it to: \Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<MGNAME>\OpsMgrConnector.Config.xml The problem is – when this config file gets large (like in large agent count environments) and when the “Config Instance Space” is large (number of discovered objects in total). Recalculating this config can have a significant impact on the disk where the file exists on the RMS, use lots of memory and CPU on the RMS for the config service, and use significant disk I/O on the SQL database.
If the RMS is in a perpetual cycle of recalculating config, and sending these config updates to all agents…. the performance of the management group is impacted.
Daniele Grandini of Quaue Nocent Docent is pretty much the “godfather” of good information researching the 21025 event. Read his 3 part series on config churn here:
http://nocentdocent.wordpress.com/2009/07/09/troubleshooting-21025-events-wrap-up/
So – what can I do if I think I have too much config churn?
The biggest problem causing the most frequent config updates is management packs with noisy discoveries. However, lets wrap up all the issues that can cause it, and what you can do:
- New agents. Discover/install/approve new agents in bulk and off-hours.
- Overrides. Set overrides during off-hours, or create override MP’s in a lab, then synch to production management groups during schedule off-hours times.
- Custom rules and monitors. Create these during off-hours, or create using the authoring console, test in a lab, then import to production during off-hours.
- Newly discovered instances. For instance – someone adds a new disk, or SQL database, or DNS zone, to an existing agent. Not much we can do about this, except the expectation that this would be done during off hours.
- Management packs with noisy discovery properties. See below.
Ok – the remainder of this article will touch on #5.
How can I tell which discoveries are noisy?
Danilele Grandini has put together a good query on this, from his link: http://nocentdocent.wordpress.com/2009/05/23/how-to-get-noisy-discovery-rules/
I will repost these (slightly modified) below:
/* Top Noisy Rules in the last 24 hours */
select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes'
from
(select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())
) As #T
group by ManagedEntityTypeSystemName, DiscoverySystemName
order by count(*) DESC
and
/* Modified properties in the last 24 hours */
select distinct
MP.ManagementPackSystemName,
MET.ManagedEntityTypeSystemName,
PropertySystemName,
D.DiscoverySystemName, D.DiscoveryDefaultName,
MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
ME.Path, ME.Name,
C.OldValue, C.NewValue, C.ChangeDateTime
from dbo.vManagedEntityPropertyChange C
inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
where ChangeDateTime > dateadd(hh,-24,getutcdate())
ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName
Wow – that returned a LOT of discoveries running all the time! What can I do?
- Don't import too many MP’s! The FIRST line of defense – is NOT to import ANY management packs into a management group that you don't absolutely need RIGHT THEN. Management packs are constantly updated, and by the time you have an actual SLA in a technology area – there will likely be a newer, better MP available for it. The biggest mistake many customers make is to import any available MP for a technology that they have internally. They end up with a FLOOD of alerts, big fat databases, slow consoles, and lots of weird errors. MP’s should be transitioned slowly, one at a time – tuning and resolving as you go.
- Disable the noisy discoveries. Probably not a great solution, unless they discover objects that you really don't care about – but there are other objects in the MP that you DO want to monitor.
- Increase the interval of the discovery frequency. This means… essentially – change any “bad” discoveries to run only once per day.
- Add a “synch time” override to the discovery – if possible. This option is not available unless the MP author of the discovery exposed it. What this will do – it cause all the agents to ONLY run the discovery at a distinct and specified time every day (say…. 1AM). This might cause too much discovery data to flood in at one time… but since it will all come in at the same time – it wont cause constant config churn all throughout the day.
- Re-write the discovery. If this is a custom MP – rewrite the discovery/MP, and remove that property which changes too often.
- Make sure your hardware and software is optimized for scalability. On your RMS – it is good to place your config file on fast disks, especially in large environments. I have worked with very large customers who were experiencing config churn, but had zero ill effects, because their RMS disk I/O was on a 4 spindle RAID10 with 15K spindles, CPU and memory were really good, and their SQL database disk I/O for the OpsDB was excellent with plenty of breathing room. I have also worked with smaller agent counts, where config churn has a serious impact…. mostly due to the RMS config file being places on the same RAID spindle set as that OS and pagefile, using only 2 older 10,000 RPM disks. The SQL disk I/O was also just borderline for their agent count.
- Re-run the queries periodically – especially after importing/upgrading to a new management pack in your management group. This “instance space change” report should be part of your testing and evaluation of a new MP when brought into your lab…. if you have a large agent count environment.
Some very common discoveries I have seen – that have properties that change very frequently – are listed below. I often recommend these be overridden to run once per day (86,400 seconds)
| Discovery Display Name | Discovery Target Class | Discovered Type | Default frequency | Modified frequency |
| Discover File Groups and Files | SQL 2005 DB Engine | SQL 2005 DB File | 7260 | 86400 |
| Discover File Groups and Files | SQL 2005 DB Engine | SQL 2005 DB File Group | 7260 | 86400 |
| Discover SQL 2000 Databases | SQL 2000 DB Engine | SQL 2000 DB | 1800 | 86400 |
| Discover Databases for a Database Engine | SQL 2005 DB Engine | SQL 2005 DB | 7200 | 86400 |
| DNS 2003 Component Discovery | DNS 2003 Server | DNS 2003 Zone | 21600 | 86400 |
| DNS 2008 Component Discovery | DNS 2008 Server | DNS 2008 Zone | 21600 | 86400 |
| Windows Internet Information Services Base Classes Discovery Rule | IIS 2003 Server Role | IIS FTP Site | 3600 | 86400 |
| Windows Internet Information Services Base Classes Discovery Rule | IIS 2000 Server Role | IIS NNTP Virtual Server | 3600 | 86400 |
The above is just a sample – you should examine the query output of the query above and see what is impacting your management group the most.