Blog Roll:
OpsMgr Websites:
We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.
However – one area of OpsMgr that often goes overlooked, is event overcollection. This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat. I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it. They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.
MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP. These events are items like "config requested”, “config delivered”, “new config active”. They might be interesting, but there is no advanced analysis included to use these to detect a problem. In small environments, they are not usually a big deal. But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them. I have yet to see a customer who did that.
At a high level – here is how I like to review these events:
So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:
Most common events by event number and event publishername:
SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource FROM EventAllView eav with (nolock) GROUP BY Number, Publishername ORDER BY TotalEvents DESC
The trick is – to run this query periodically – and to examine the most common events for YOUR environment. The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it: (I will use a common event 21024 as an example:)
What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.
Next – I will examine the rule. I will look at the Data Source section, and the Response section. The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section. If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.
If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it. You will find you can disable most of the top consumers in the database.
Here is why I consider it totally cool to disable these uninteresting event collection rules:
Here is an example of this one:
So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.
Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:
1206 1210 1215 1216 10102 10401 10403 10409 10457 10720 11771 21024 21025 21402 21403 21404 21405 29102 29103
I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value. Just knocking out the top 10 events will often free up 90% of the space they were consuming.
The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them. You might find you have some other events as your top consumers. I recommend you review them in the same manner as above – methodically. Then revisit this every month or two to see if anything changed.
I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them. I’d be happy to give feedback on those, or add any new ones to my list.
It has been a well known requirement for most customers, to be able to Create Groups of Windows Computers that also contain corresponding Health Service Watcher objects. This was needed for Alert Notification subscriptions so that different teams could receive alert notifications filtered by groups, but also include alerts from the Watcher, such as Heartbeat failure and Computer Unreachable. There are several articles on this but I will reference a very popular one, on Tims’ site:
http://www.scom2k7.com/dynamic-computer-groups-that-send-heartbeat-alerts/
Essentially, we needed to add an extra membership rule, to the XML, that would also add any Health Service Watcher objects that have a relationship to the Windows Computer objects already in the group. We did this with the following XML:
<MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MicrosoftSystemCenterInstanceGroupLibrary!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="Windows!Microsoft.Windows.Computer"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$Target/Id$</MonitoringClass> </Contained> </Expression> </Contained> </Expression> </Contains> </Expression> </MembershipRule>
However, what if we ONLY want a group of Health Service Watcher objects, and NOT the Windows Computers. BUT – we wish to based the HSW membership list from another group of Windows Computers. This is useful if we want to create availability reports for a group of Windows Computers, but need to based the report on the availability of a specific up/down monitor, and not anything related to Windows Computer objects.
Here is a code example of exactly that:
In this sample – we will create a simple group of Windows Computers, that start with the name “DB”. Then – we will create another group only containing HSW objects, corresponding the SQL computers group.
<ManagementPack ContentReadable="true" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <Manifest> <Identity> <ID>grouptest</ID> <Version>1.0.0.8</Version> </Identity> <Name>grouptest</Name> <References> <Reference Alias="MSCIGL"> <ID>Microsoft.SystemCenter.InstanceGroup.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="SC"> <ID>Microsoft.SystemCenter.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Windows"> <ID>Microsoft.Windows.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Health"> <ID>System.Health.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="System"> <ID>System.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> </References> </Manifest> <TypeDefinitions> <EntityTypes> <ClassTypes> <ClassType ID="grouptest.compgroup" Accessibility="Internal" Abstract="false" Base="SC!Microsoft.SystemCenter.ComputerGroup" Hosted="false" Singleton="true" /> <ClassType ID="grouptest.SQLWatchers" Accessibility="Internal" Abstract="false" Base="MSCIGL!Microsoft.SystemCenter.InstanceGroup" Hosted="false" Singleton="true" /> </ClassTypes> </EntityTypes> </TypeDefinitions> <Monitoring> <Discoveries> <Discovery ID="grouptest.DiscoverSQLServersComputerGroup" Enabled="true" Target="grouptest.compgroup" ConfirmDelivery="true" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryRelationship TypeID="SC!Microsoft.SystemCenter.ComputerGroupContainsComputer" /> </DiscoveryTypes> <DataSource ID="GP" TypeID="SC!Microsoft.SystemCenter.GroupPopulator"> <RuleId>$MPElement$</RuleId> <GroupInstanceId>$MPElement[Name="grouptest.compgroup"]$</GroupInstanceId> <MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="Windows!Microsoft.Windows.Computer"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="SC!Microsoft.SystemCenter.ComputerGroupContainsComputer"]$</RelationshipClass> <Expression> <RegExExpression> <ValueExpression> <Property>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Property> </ValueExpression> <Operator>MatchesWildcard</Operator> <Pattern>DB*</Pattern> </RegExExpression> </Expression> </MembershipRule> </MembershipRules> </DataSource> </Discovery> <Discovery ID="grouptest.DiscoverSQLWatchers" Enabled="true" Target="grouptest.SQLWatchers" ConfirmDelivery="true" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryRelationship TypeID="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities" /> </DiscoveryTypes> <DataSource ID="GP" TypeID="SC!Microsoft.SystemCenter.GroupPopulator"> <RuleId>$MPElement$</RuleId> <GroupInstanceId>$MPElement[Name="grouptest.SQLWatchers"]$</GroupInstanceId> <MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="grouptest.compgroup"]$</MonitoringClass> </Contained> </Expression> </Contains> </Expression> </MembershipRule> </MembershipRules> </DataSource> </Discovery> </Discoveries> </Monitoring> <LanguagePacks> <LanguagePack ID="ENU" IsDefault="true"> <DisplayStrings> <DisplayString ElementID="grouptest"> <Name>Group Test</Name> <Description /> </DisplayString> <DisplayString ElementID="grouptest.compgroup"> <Name>SQL Servers Computer Group</Name> </DisplayString> <DisplayString ElementID="grouptest.DiscoverSQLServersComputerGroup"> <Name>Discovery for SQL Servers Computer Group</Name> </DisplayString> <DisplayString ElementID="grouptest.DiscoverSQLWatchers"> <Name>Discovery for SQL Health Service Watchers Group</Name> <Description /> </DisplayString> <DisplayString ElementID="grouptest.SQLWatchers"> <Name>SQL Health Service Watchers Group</Name> </DisplayString> </DisplayStrings> </LanguagePack> </LanguagePacks> </ManagementPack>
The key to this is the specific reference of the other group – shown here:
<MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="grouptest.compgroup"]$</MonitoringClass> </Contained> </Expression> </Contains> </Expression> </MembershipRule> </MembershipRules>
There have been a couple good articles briefly covering this topic…. you might have read them. I will reference some below. Config churn is basically, when your RMS is in an almost never-ending loop of generating config. This can be caused by “less than optimized” management packs, pushing agents all the time, or injecting major changes into a management group, such as overrides or custom rules and monitors, or importing updated management packs. By examining this topic in depth – we will re-state some already known best practices with maintaining a healthy management group, and get some deeper knowledge as to why they are best practices in the first place.
Any time you push agents, or create rules and monitors, or overrides for widespread classes….. you can create a config update on the RMS that must be sent down to ALL agents in the management group. For small management groups (under 500 agents) this is generally not a big deal and processes rather quickly. For large management groups over 1000 agents, this can cause high resource utilization of the RMS and SQL Database, in terms of CPU, Memory, and Disk I/O. This can impact data insertion, and console performance during these times. For these reasons, we like to keep those activities down to a minimum during working hours, and schedule these major changes in an off-hours maintenance window.
What about “less than optimized” management packs? What does that mean? Well, this means management packs that you might be using, that have poorly written discoveries.
We have long known that a worst practice in Management Pack development, is to have a discovery that discovers instances of a class, that has properties for those instances that are likely to change frequently. Here is a write-up from OpsManJam on the topic: LINK
Ok… wait… Whaaaaat?
Let me put that in English:
Say we have a discovery for a Logical Disk. This will discover any logical disk, like C:, D:, E:, Q:, etc…. When we write the discovery for a logical disk, we can add properties to that discovery. These are attributes of the discovered instances. So – in this case – lets say we decided to add “Size” of the disk as a property, and “Free Space” as a property. And for the discovery frequency – we will run this discovery every hour, looking for new disks.
“Size” is an excellent property for the Logical Disk class. We like to know the size of the disks…. we can use this property group them if needed. “Size” of a logical disk is not something that we would expect to change very often.
“Free Space” is a horrible property for the Logical Disk class. Free space is something that will likely change, just a small amount even, between each run of the discovery. Free space is a property that is likely to change frequently, therefore – it should NOT be used in a discovery.
Make sense?
Ok – so… what's the big deal?
Well, the agent will run almost all discoveries that it knows about when the health service starts up (like when you bounce the service, or after a reboot). It will always send this discovery data to the management server. (this is another reason why agents restarting all the time is very bad) Then, it will run then based on the “Interval” frequency specified on the discovery. Sometimes this is as frequent as once per hour, sometimes as long as once per day. When the discovery runs, the agent will inspect the discovery data that it gets, and compare it to the last discovery data it sent to the management server. If nothing changed – the agent drops the discovery data and does nothing. IF anything changed in the values of the discovery data – it will re-submit the new data to the management server, which will submit this data to the database. The RMS will detect the change, and will have to recalculate (regenerate) configuration. You will see this on the RMS as a 21025 event:
Log Name: Operations Manager Source: OpsMgr Connector Date: 9/27/2009 11:51:49 PM Event ID: 21025 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OMRMS.opsmgr.net Description: OpsMgr has received new configuration for management group PROD1 from the Configuration Service. The new state cookie is "D7 9B A4 BE 00 90 CF 13 35 B5 9B 5F 3B 14 FF 78 D6 13 9A 2D "
The 21025 event isn’t really “bad”… it simply means the config service did its job. It re-generated its configuration file from the database data, and wrote it to: \Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<MGNAME>\OpsMgrConnector.Config.xml The problem is – when this config file gets large (like in large agent count environments) and when the “Config Instance Space” is large (number of discovered objects in total). Recalculating this config can have a significant impact on the disk where the file exists on the RMS, use lots of memory and CPU on the RMS for the config service, and use significant disk I/O on the SQL database.
If the RMS is in a perpetual cycle of recalculating config, and sending these config updates to all agents…. the performance of the management group is impacted.
Daniele Grandini of Quaue Nocent Docent is pretty much the “godfather” of good information researching the 21025 event. Read his 3 part series on config churn here:
http://nocentdocent.wordpress.com/2009/07/09/troubleshooting-21025-events-wrap-up/
So – what can I do if I think I have too much config churn?
The biggest problem causing the most frequent config updates is management packs with noisy discoveries. However, lets wrap up all the issues that can cause it, and what you can do:
Ok – the remainder of this article will touch on #5.
How can I tell which discoveries are noisy?
Daniele Grandini has put together a good query on this, from his link: http://nocentdocent.wordpress.com/2009/05/23/how-to-get-noisy-discovery-rules/
I will repost these (slightly modified) below:
/* Top Noisy Rules in the last 24 hours */ select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes' from (select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ) As #T group by ManagedEntityTypeSystemName, DiscoverySystemName order by count(*) DESC
/* Top Noisy Rules in the last 24 hours */
select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes' from (select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ) As #T group by ManagedEntityTypeSystemName, DiscoverySystemName order by count(*) DESC
and
/* Modified properties in the last 24 hours */ select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName
/* Modified properties in the last 24 hours */
select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName
Wow – that returned a LOT of discoveries running all the time! What can I do?
Some very common discoveries I have seen – that have properties that change very frequently – are listed below. I often recommend these be overridden to run once per day (86,400 seconds) or once per week (604800 seconds) if the problem is serious, or still existing when running once per day (large agent counts)
The top noisy MP’s with bad discoveries I find in customer environments – are almost ALWAYS some order of the following:
The above is just a sample – you should examine the query output of the query above and see what is impacting your management group the most.
**Note on the SQL MP – There is a new SQL MP which resolved config churn for that MP, version 6.1.314.36 and later. This new MP no longer churns on the database class properties for DB and log size. I strongly recommend upgrading to this version of the SQL MP. See: http://blogs.technet.com/b/kevinholman/archive/2010/08/16/sql-mp-version-6-1-314-36-released-adds-support-for-sql-2008r2-and-many-other-changes.aspx
**Note on the DNS MP – There is a new DNS MP which resolved config churn for that MP, version 6.0.7000.0 and later. This new MP no longer churns on the PrimaryServer and SerialNumber class properties. I strongly recommend upgrading to this version of the DNS MP. See: http://blogs.technet.com/b/kevinholman/archive/2011/02/24/dns-mp-update-ships-support-for-dns-on-windows-server-2008-r2-and-many-fixes.aspx
Note some deeper level information on this topic:
What is the maximum value I can set a discovery frequency to???? Supposedly – the MAX value in seconds is 2419200 which is 4 weeks. Normally – discoveries should not have to be stretched out so long – only if they are creating a problem Setting this number to 4 weeks essentially negates the discovery…. which is no big deal if it is a discovery that is running for something already discovered. However – for something like SQL databases – that means it might take 4 weeks to start monitoring a new database. That is not good. There is a workaround however – for being able to use the extended frequencies and still discover items – when you restart the HealthService of an agent – it will immediately run all discoveries that apply to it that don't have a synch time set. This means – that as a workaround to the workaround here – you can simply restart the agent if you add a new database, or IIS website, and need sooner monitoring than the max frequency time.
RMS Churn: When a discovery property change comes in for an instance that is hosted by an agent – the RMS creates new config and send it to that agent. This is a normal process – but we want to control this from happening too frequently. It isn't terribly expensive unless the number of instances hosted by the Agent is very high. (as in – a typical agent might have 40 instances, but a SQL server with 1000 databases has 1040 instances)
Next up – if the discovery property change occurs, and that instance that sent up the change is a member of a group. This is worse – because this now causes a config recalc for the agent, AND a config recalc for the RMS. This is because the RMS has to evaluate group population membership since it hosts a group of instances and one or more of those instances changed – which might affect group membership. For instance – if the SQL Database size property changes – this is no big deal. UNLESS you have created groups of SQL databases somewhere in the management group – and this changed database is a member of one or more groups. This will cause the RMS to updates its own config.
Lastly – when a discovery property comes in for an instance of a class, that is hosted by the RMS – this causes the RMS to completely recalculate its own config as well, and update its local health service config file. This is very expensive…. and these instances should be given top consideration in fixing their discoveries, or extending them to reduce the issue. The most common ones of these I see are the DNS Domain, DNS Zone, and AD Connection objects, which I have highlighted in red above. Changes to these instances are VERY expensive – because since these are logical instances and not hosted by any SINGLE agent – they get hosted by the RMS. When they change – it forces the RMS to regenerate its own config. This will be evident by a LARGE number of 21025 events showing up in the RMS OpsMgr event log. Generally – we only would like to see this file updated when necessary – two to three times per hour is ideal. However – if you are running the DNS Management pack or ADMP, you are likely seeing this even every few MINUTES. These DNS discoveries should be evaluated and overridden.
Other items hosted by the RMS are groups. When group membership changes – this impacts RMS performance. This is due to the fact that the RMS hosts the group instances, and the relationships to what each group contains. When group membership changes – the RMS generates new config. This will also show up as a 21025 Event in the RMS OpsMgr event log. So if you have tackled the discoveries from MP’s changing frequently – the next thing to look at is groups. If you have a large management group, and you think this might be impacting you – one of the things you can do is to slow down the group populator module. By default – this runs every 30 seconds.
We have a registry setting to make group calculation run less often to lower the performance hit on the database. When making this setting less frequent, group calculation will poll the database less often, if you understand that the latency of group membership discovery will increase:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\GroupCalcPollingIntervalMilliseconds
Default is 30,000 milliseconds (30 secs) You can create this new DWORD value to control this setting.
If you want to see all the instance types hosted by the RMS, run this query against the Operations Database:
DECLARE @RelationshipTypeId_Manages UNIQUEIDENTIFIER SELECT @RelationshipTypeId_Manages = dbo.fn_RelationshipTypeId_Manages() SELECT bme.FullName, dt.TopLevelEntityName, dt.BaseEntityName, dt.TypedEntityName FROM BaseManagedEntity bme RIGHT JOIN ( SELECT HBME.BaseManagedEntityId AS HS_BMEID, TBME.FullName AS TopLevelEntityName, BME.FullName AS BaseEntityName, TYPE.TypeName AS TypedEntityName FROM BaseManagedEntity BME WITH(NOLOCK) INNER JOIN TypedManagedEntity TME WITH(NOLOCK) ON BME.BaseManagedEntityId = TME.BaseManagedEntityId AND BME.IsDeleted = 0 AND TME.IsDeleted = 0 INNER JOIN BaseManagedEntity TBME WITH(NOLOCK) ON BME.TopLevelHostEntityId = TBME.BaseManagedEntityId AND TBME.IsDeleted = 0 INNER JOIN ManagedType TYPE WITH(NOLOCK) ON TME.ManagedTypeID = TYPE.ManagedTypeID LEFT JOIN Relationship R WITH(NOLOCK) ON R.TargetEntityId = TBME.BaseManagedEntityId AND R.RelationshipTypeId = @RelationshipTypeId_Manages AND R.IsDeleted = 0 LEFT JOIN BaseManagedEntity HBME WITH(NOLOCK) ON R.SourceEntityId = HBME.BaseManagedEntityId ) AS dt ON dt.HS_BMEID = bme.BaseManagedEntityId Where Fullname like '%RMSNAME%' order by typedentityname
Change “RMSNAME” above to your RMS name. You will see most will be groups – but might be surprised to see what all is hosted by the RMS.
Jalasoft recently updated their network device simulator, which is useful for testing/demo of OpsMgr network monitoring capabilities.
You can download the simulator here:
http://www.jalasoft.com/xian/snmpsimulator
This article will walk through the setup, configuration, and initial monitoring.
You will need a computer or VM (Windows 2003 or above, including Win7 or Win8 apparently). Then, you will need to add multiple IP addresses, one IP address for each device you want to monitor:
In the example above – 10.10.10.20 is the primary IP for my server. Network devices will be simulated on 10.10.10.21 through 10.10.10.25
Run Setup.exe and install the defaults, the Agent Service and Simulator Console.
Provide a service account in order to run the simulator as a service (a new and much needed feature!)
Select the IP address that is the primary IP for the server.
When install is complete – open the Device Simulator console.
Connect to the agent on your primary IP.
Click the + to add a new device.
Lets add a Cisco Router:
On the first secondary IP:
And leave defaults for SNMP (V2 and “public”)
Now lets add additional devices, such as switches, firewalls, etc…
When done – click the Green arrow to save the config.
Next up – we need to give each device a DNS A record so that SCOM can discover it. In AD DNS, create new A records with associated PTR records, and give each device a name:
Once you have added the DNS records in AD – we are ready to discover the devices in SCOM:
Administration > Network Management > Discovery Rules. Run the discovery wizard and discover network devices.
Give the discovery rule a name, choose a management server to run the discovery, and select a resource pool to monitor the network devices
(Hint – you should always create a dedicated resource pool for monitoring network devices, even if you only have a single management server. This allows you to scale these out to dedicated servers in the future without making any other changes)
Choose Explicit discovery.
Create a Run As account for the “public” SNMP community string. Select it:
Add in each device and select the appropriate community string Run As account:
Then choose to run the discovery manually:
And click “Create”, and leave the box checked to “Run the network discovery rule”
In the console – you can see the discovery rule and the status:
In the event log of the management server that runs the discovery – you will soon see network discovery events:
Once this is complete – you should see the network devices in the console views:
You can run Health Explorer and view the out-of-the-box monitoring:
Or look at the network node and summary dashboards to view summary and historical data
This may be old news, but it is a handy reference for OpsMgr admins, when asked to monitor for specific events from security event logs:
Windows Server 2003: http://technet.microsoft.com/en-us/library/cc163121.aspx
Windows Server 2008: http://www.microsoft.com/download/en/details.aspx?id=17871
Windows Server 2008 R2: http://www.microsoft.com/download/en/details.aspx?id=21561
A time may come when you need to migrate your existing DHCP services to new servers/hardware. Windows Server 2012 ships with powershell cmdlets to make this a simple transition.
You can read about the process here: http://blogs.technet.com/b/teamdhcp/archive/2012/09/11/migrating-existing-dhcp-server-deployment-to-windows-server-2012-dhcp-failover.aspx
I have two DHCP servers (Windows Server 2012) in a failover configuration, leveraging the new capabilities in failover DHCP for Server 2012, which you can read about here: http://technet.microsoft.com/en-us/library/jj200226.aspx. I will be migrating these to Windows Server 2012 R2 DHCP servers.
I start by installing the DHCP server role on my new 2012 R2 DHCP servers. Then, a quick configure using the Post install wizard from server manager to authorize the DHCP servers in AD.
Next up – I need to export the DHCP server configuration in powershell, from the old server. In this case, I will be migrating from two DCHP servers (DC1,DC2) and migrating them to the new ones (DC01,DC02).
Create a folder on the new DHCP primary server (DC01) for C:\export. Open an administrator powershell session. Run the following command to export remotely from the old DCHP primary server:
Export-DhcpServer –ComputerName DC1.opsmgr.net -Leases -File C:\export\dhcpexp.xml –verbose
Next up, we need to create a backup path for the DHCP server database on the new DHCP server, DC01. Create a folder C:\dhcp\backup. Then, we can import the old DHCP server configuration using the following command:
Import-DhcpServer –ComputerName DC01.opsmgr.net -Leases –File C:\export\dhcpexp.xml -BackupPath C:\dhcp\backup\ -Verbose
The last import we need to run, is to import the server configuration ONLY to the secondary, or failover DHCP server. First, on DC02 (the new failover DHCP server) create a backup folder at C:\dhcp\backup. Then, go back to DC01 where you have the local export files, and run the following command to import server config to DC02:
Import-DhcpServer –ComputerName DC02.opsmgr.net –File C:\export\dhcpexp.xml –ServerConfigOnly –verbose –BackupPath C:\dhcp\backup\
At this point, we have imported the server configuration to BOTH new DHCP servers, and we have imported all the lease and scope to the new primary DCHP server. What we need to do in order to complete the configuration, is to set up the failover configuration on the new pair. This is covered here: http://technet.microsoft.com/en-us/library/hh831385.aspx#failover_1
On DC01, open the DHCP control applet, and right click IPv4 (all scopes) or specific scopes, and click “Configure Failover”
Step through the wizard, and choose “Load balance" mode.
Provide a shared secret for the DCHP servers to authenticate with each other for replication.
You will now see the failover configuration data for each scope:
Open the DHCP applet on the secondary failover DHCP server, and you should see the replicated scope and lease information:
You can de-activate the scopes on the old DHCP server. After testing and functional approval, you can remove DHCP services from the legacy DHCP server computers.
There is a new Base OS MP version 6.0.6972.0 available here: http://www.microsoft.com/en-us/download/details.aspx?id=9296
Be very careful updating to this new version – there are multiple changes and potential issues you should plan for and test with, that might impact your existing environments. I will discuss them below.
I previously wrote about the last MP update HERE and HERE. Then I wrote about some issues in the MP’s with Logical Disk monitoring HERE. Additionally, there were some problems with the network monitoring utilization scripts HERE. All of these items have been addressed in this latest MP update. (somewhat)
First – lets cover the list of updates from the guide:
Changes in This Update • Updated the Cluster shared volume disk monitors so that alert severity corresponds to the monitor state. • Fixed an issue where the performance by utilization report would fail to deploy with the message “too many arguments specified”. • Updated the knowledge for the available MB monitor to refer to the Available MB counter. • Added discovery and monitoring of clustered disks for Windows Server 2008 and above clusters. • Added views for clustered disks. • Aligned disk monitoring so that all disks (Logical Disks, Cluster Shared Volumes, Clustered disks) now have the same basic set of monitors. • There are now separate monitors that measure available MB and %Free disk space for any disk (Logical Disk, Cluster Shared Volume, or Clustered disk).
Note : These monitors are disabled by default for Logical Disks, so you will need to enable them if you want to use them in place of the default Logical Disk monitor for free space.
• Updated display names for all disks to be consistent, regardless of the disk type. • The monitors generate alerts when they are in an error state. A warning state does not create an alert. • The monitors have a roll-up monitor that also reflects disk state. This monitor does not alert by default. If you want to alert on both warning and error states, you can have the unit monitors alert on warning state and the roll-up monitor alert on error state. • Fixed an issue where network adapter monitoring caused high CPU utilization on servers with multiple NICs. • Updated the Total CPU Utilization Percentage monitor to run every 5 minutes and alert if it is three consecutive samples above the threshold. • Updated the properties of the Operating System instances so that the path includes the server name it applies to so that this name will show up in alerts. • Disabled the network bandwidth utilization monitors for Windows Server 2003. • Updated the Cluster Shared Volume monitoring scripts so they do not log informational events. • Quorum disks are now discovered by default. • Mount point discovery is now disabled by default.
Notes: This version of the Management Pack consolidates disk monitoring for all types of disks as mentioned above. However, for Logical Disks, the previous Logical Disk Free Space monitor, which uses a combination of Available MB and %Free space, is still enabled. If you prefer to use the new monitors (Disk Free Space (MB) Low Disk Free Space (%) Low), you must disable the Logical Disk Free Space monitor before enabling the new monitors. The default thresholds for the Available MB monitor are not changed, the warning threshold (which will not alert) is 500MB and the error threshold (which will alert) is 300MB. This will cause alerts to be generated for small disk volumes. Before enabling the new monitors, it is recommended to create a group of these small disks (using the disk size properties as criteria for the group), and overriding the threshold for available MB.
Ok, sounds good. But what does all that mean to me?
I will summarize the fundamental changes below:
1. Disk discovery and monitoring has changed. We now will UNDISCOVER any “Logical Disks” that are hosted by a Windows Server 2008 R2 cluster, and REDISCOVER those as a new entity, of the “Cluster Disk” class. This discovery only pertains to Windows Server 2008 R2 and later, it does not affect Server 2008 and older clusters.
There are now THREE types of disks we will discover and monitor:
Logical Disks include disks that are not part of/hosted by a cluster, and include disks with a drive letter, and any disks without a drive letter (which are discovered as mount points). Cluster Disks include any disk that is hosted by a Microsoft Cluster as a shared resource, but not a specific Cluster Shared Volume. Cluster Shared Volumes are a specific type of cluster disks, that is leveraged by Hyper-V clusters for placement of virtual machines.
Logical Disks include disks that are not part of/hosted by a cluster, and include disks with a drive letter, and any disks without a drive letter (which are discovered as mount points).
Cluster Disks include any disk that is hosted by a Microsoft Cluster as a shared resource, but not a specific Cluster Shared Volume.
Cluster Shared Volumes are a specific type of cluster disks, that is leveraged by Hyper-V clusters for placement of virtual machines.
For most customers, the impact will be if you have placed any instance or group specific overrides for your cluster disks, these will no longer apply, as these disks are going to be re-discovered as a new entity of a new class, “Cluster Disk”. This new class will have entirely different monitoring targeting it, described below.
However, this is a GOOD thing! In the past, if you had a disk that was part of a cluster, it was undiscovered and rediscovered on each NODE when a failover occurred. If you did overrides for the disk while it was on one node, your changes would no longer apply when it failed over to another node, because it was literally discovered as a different disk! (basemanagedentity) This is now resolved – the disk will retain the same BaseManagedEntityId (its unique GUID under the covers in SCOM) as it moves from node to node. It is also now “hosted” by the cluster, and not the Operating System class.
I put together a state dashboard that demonstrates these different disk types:
There are also distinct views for these that ship inside the management pack:
Another point to make here – is that the Mount Point discovery, which has been enabled in all previous Base OS MP’s, is now DISABLED. This means you will no longer discover mount points by default. You can enable this via override if you want mount point discovery, or selectively enable it only for specific servers that you know host a mount point that you wish to monitor.
Our mount point discovery is a bit misleading. We don’t actually only discover mount points, we actually use the mount point discovery to discover ANY disk that does not have a drive letter assigned. For instance, you may have noticed on your Server 2008 R2 machines, that you discovered a 100MB logical disk.
These 100MB disks are System Reserved for Bitlocker use, to hold the boot loader. Once you upgrade to the new MP version – new mounted disks (non-clustered disks with no drive letter) will no longer be discovered, as this discovery is disabled by default. This will NOT remove the previously discovered disks, however. Neither will running Remove-DisabledMonitoringObject. The reason that Remove-DisabledMonitoringObject does NOT remove these discovered disks, is because it will only remove objects if there is an explicit *override* for a discovery, disabling it. If we change the default configuration of a discovery to disabled, the cmdlet has no impact. So if you wanted to remove these from your management group, you simply need to add an explicit override disabling the mount point discovery, and THEN run the cmdlet. Keep in mind – doing this will undiscover ALL your mounted disks, possibly including real mount points if you have those. As there is ZERO value in discovering and monitoring these 100MB disks, I’d recommend disabling the mounted disk discovery with an explicit override, then create instance specific or group specific overrides for your servers that DO host a mounted disk.
2. Logical Disk free space monitoring, along with Cluster Disk and Cluster Shared Volume monitoring has changed. Here are the details:
The default configuration of the “Logical Disk Free Space” monitor is largely UNCHANGED from MP version 6.0.6958.0, which I wrote about HERE. This was done to create the lowest possible impact on you, the admin, who is using this monitor, and likely already has many overrides and has implemented this alert into any ticketing systems. There were many complaints that this monitor (once it was modified to allow for consecutive samples) no longer generated alerts that contained free space and MB free in the alert description. This is still the case in this version – the monitor was not modified. This monitor will also generate alerts for warning state AND critical state, which is NOT a good thing. When a single monitor generates alerts on both warning and critical state, a *new* alert is *not* generated when the monitor changes from warning to critical. We simply modify the existing alert from warning to critical (if it exists in an open state). This modification will NOT generate a new notification subscription, nor will it route the alert to a connector subscription set with a filter for “critical” severity alerts, because it has already been inspected and watermarked. For this reason I never recommend using three state monitors and alerting on a warning and a critical state.
However, another complaint we often got was that customers didn’t understand how this monitor worked, in that we inspect BOTH % free threshold AND MB free threshold, and BOTH conditions need to be met before we will change the state of the monitor and generate an alert. This is a very good design, because it helps cut out the majority of noise and remains flexible for disks of different sizes. That said, many customers would say “I just want a simple monitor to alert on % free ONLY, or MB free ONLY…” which was easier for them to understand. Therefore, we have added THREE new monitors for disk space monitoring of logical disks.
These new monitors are disabled by default, to allow customers to choose if they want to implement them. What we have done is to create two new Unit monitors, one for % free and one for MB free. Then place both of these under an aggregate rollup monitor.
If enabled, the customer can pick if they want only %, or only MB free, or both, via overrides. These new Unit monitors also provide a richer alert description as seen below:
The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 28 free Mbytes. The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 4% free space.
The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 28 free Mbytes.
The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 4% free space.
Additionally, if the customer DOES want alerts on warning state for these monitors, they can enable this, and additionally enable alerting on the Aggregate rollup monitor above, to issue critical alerts only. This way, you can have unique alerting for a warning state, but if any monitor is critical, we can roll up health and generate a NEW alert for critical state, which can be used to send a notification or send to a ticketing system.
As you can see, a lot of thought went into this new design, trying to make the new format fit as many customer requested scenarios as possible. You essentially have three options now:
For Cluster Disks, and Cluster Shared Volume disks – both of those are using the new format for free disk space monitoring:
Based on this, I’d recommend considering and testing a move of your logical disk free space monitoring over to the new style as well, to have a consistent experience. I welcome your feedback on this point.
***Note – if you enable the new Logical Disk free space monitors, the MB Free monitor will go into a critical state for any Logical disk that is under 2GB (non-system) or 500MB (system). This means if you have any tiny disks, such as the 100MB bitlocker disks, this monitor will alert on all of those disks, potentially creating a large number of alerts. I’d recommend undiscovering those 100MB disks (see #1 above) or create a dynamic group of disks in your override MP, based on “size is less than a specific numerical size”, and use this group to disable free space monitoring.
3. The previous “Cluster Shared Volume” MP with was “Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp” has a new displayname of “Windows Server Cluster Disks Monitoring” and the new classes for Cluster disks mentioned above are included in this MP, so if you didn’t import it previously because you weren't using Hyper-V Cluster Shared Volumes, you need this MP now to discover and monitor clustered disks.
4. We have disabled the Network Utilization scripts by default on Server 2003, and fixed them for Server 2008 to make them consume less resources. I wrote about this previously HERE. This now should be addressed, so if you previously disabled these, but want that counter for alerting or perf collection, you can consider enabling it. It should REMAIN disabled for Windows 2003, as there is an issue with Netman.dll which causes the crash of services.
5. The “Total CPU Utilization Percentage” monitor was changed. In previous management packs, it would inspect the value every 2 minutes, and if the AVERAGE of 5 samples for “CPU Queue length”AND “% Processor Time” were over their default thresholds, we would generate an alert. Now, we inspect the value every 5 minutes, and if the AVERAGE of 3 samples for both counters are over the thresholds, then an alert is generated. I am told this change was made on customer request, I have to assume to spread out the time period over a longer time span…. not really sure. Seems fairly insignificant.
Known Issues/Things to remember:
1. Which MP’s to import: This MP update contains the following files:
Don’t import management packs that you don’t need or use. Don’t import the BPA management pack if you don’t want to see alerts for this new feature. Don’t import the Microsoft.Windows.Server.Reports.mp if your back-end SQL is still running SQL 2005, this MP is supported on SQL 2008 and newer only. It will cause your reporting to break if you import this MP and your management group leverages SQL 2005 on the back-end. DO import the Microsoft.Windows.Server.ClusterSharedVolume.mp because this contains the discovery and monitoring for Cluster Disks, not just Cluster Shared Volumes. If you don’t import this your monitoring of clustered disks will disappear.
Don’t import management packs that you don’t need or use.
Don’t import the BPA management pack if you don’t want to see alerts for this new feature.
Don’t import the Microsoft.Windows.Server.Reports.mp if your back-end SQL is still running SQL 2005, this MP is supported on SQL 2008 and newer only. It will cause your reporting to break if you import this MP and your management group leverages SQL 2005 on the back-end.
DO import the Microsoft.Windows.Server.ClusterSharedVolume.mp because this contains the discovery and monitoring for Cluster Disks, not just Cluster Shared Volumes. If you don’t import this your monitoring of clustered disks will disappear.
2. The knowledge for the Total CPU Utilization Percentage is incorrect – the monitor was updated to a default value of 3 samples but the knowledge still reflects 5 samples.
3. There is no free space perf collection rules for “Cluster Disks”. We have multiple performance collection rules for Logical Disks, and for Cluster Shared Volumes, however there are none for the new Cluster Disks class. If you want performance reports on free space, disk latency, idle time, etc, you will need to create these.
4. Perf collection and disk monitoring for cluster disks and CSV’s only works when the resource group hosting the disks, are on the same node that is hosting the cluster name (quorum) resource. If the disk’s resource group is running on a different node than the cluster name itself, perf collection and monitoring will cease.
Grooming of the OpsDB in OpsMgr 2012 is very similar to OpsMgr 2007. Grooming is called once per day at 12:00am…. by the rule: “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “All Management Servers Resource Pool” and is part of the System Center Internal Library.
It calls the “p_PartitioningAndGrooming” stored procedure. This SP calls two other SP's: p_Partitioning and then p_Grooming
p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1" (note - we partition event and perf into 61 daily tables - just like MOM 2005/SCOM 2007)
The PartitionAndGroomingSettings table:
The p_PartitionObject SP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates the PartitionTables table in the database, to update the IsCurrent field to the next numeric table for events and perf. It also sets the current time as the partition end time in the previous “is current” row, and sets the current time in the partition start time of the new “is current” row. Then it calls the p_PartitionAlterInsertView sproc, to make new data start writing to the “new” current event and perf table.
To review which tables you are writing to - execute the following query: select * from partitiontables where IsCurrent = '1'
A select * from partitiontables will show you all 61 event and perf tables, and when they were used. You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database). If partitioning is failing to run, then we wont see this date changing every day.
Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning. Now - if that is all successful, we will start grooming!
The p_Grooming is called after partitioning is successful. One of the first things it does - is to update the InternalJobHistory table. In this table - we keep a record of all partitioning and grooming jobs. It is a good spot check to see what's going on with grooming. To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId DESC
The p_Grooming sproc then calls p_GroomPartitionedObjects
p_GroomPartitionedObjects will first examine the PartitionAndGroomingSettings and compare the “days to keep” column value, against the current date, to figure out how many partitions to keep vs groom. It will then inspect the partitions (tables) to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate. A truncate command is just a VERY fast and efficient way to delete all data from a table without issuing a highly transactional DELETE command. The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column, to reflect when grooming last ran.
Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects.
p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0. The following stored procedures are present in my database as non-partitioned data:
Now, for the above sprocs, each one could potentially return a success or failure. They will also likely call additional sprocs, for specific tasks. You can see, the rabbit hole is deep. This is just an example of the complexity involved in self-maintenance and grooming. If you are experiencing a grooming failure of any kind, and the error messages involve any of the above stored procedures when you execute p_PartitioningAndGrooming manually, you should open a support case with Microsoft for troubleshooting and resolution. The theory is, that each of the above procedures grooms a specific non-partitioned dataset. Under NORMAL circumstances, each should be able to complete in a reasonable time frame. The challenge becomes evident when you have something go wrong, like alert storms, state change even storms from monitors flip-flop, lots of performance signature data from using self-tuning threshold monitors, huge amounts of pending SDK datasource data from large Exchange 2010 environments, or from other MP’s that might leverage this. Grooming non-partitioned data is slow, and highly resource intensive and transactional. These are specific delete statements, from tables directly, often combined with creating temp tables in TempDB. Having a good presized high performance TempDB can help, as will ensuring you have plenty of transaction log space for the database, and having the disk subsystem offer as many IOPS as possible. http://technet.microsoft.com/en-us/library/ms175527(v=SQL.105).aspx
Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)
If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming. To start with - I recommend setting all these to just 2 days, from the default of 7. This keeps your OpsDB under control until you have time to tune all the noise from the MP's you import. So just reduce this number, then open up query analyzer, and execute EXEC p_PartitioningAndGrooming When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId DESC The last groom job should be present, and successful. The OpsDB size should be smaller, with more free space. And to validate, you can always run my large table query, found at: Useful Operations Manager 2007 SQL queries
There are several examples in blogs on how to create a generic text log rule to monitor for a local text file (Unicode, ASCII, or UTF8).
This will be a step-by-step example of doing the same, however, using this to monitor the log file on a remote UNC path instead of a local drive. This is useful when we want to monitor a file/files on a NAS or an a share that is hosted by a computer without an agent.
This is a bit unique… instead of applying this rule to ALL systems that might have a specific logfile present in a specific directly – we are going to target this rule to only ONE agent. This agent will monitor the remote fileshare similar to the concept of a “Watcher Node” for a synthetic transaction. Therefore we will be creating this rule disabled, and enabling it only for our “Watcher”.
In the Ops console – select the Authoring pane > Rules.
Right click Rules, and select Create a new rule. We will chose the Generic Text Log for this example:
Choose the appropriate MP to save this new customer rule to, and click Next.
For this rule name – I will be using “Company Name – Monitor remote logfile rule”
Set the Rule Category to “Alert”
For the target – I like to use “Windows Server Operating System” for generic rules and monitors.
UNCHECK the box for “Rule is enabled”
Click Next.
The directory will be the UNC path. Mine is “\\VS2\Software\Temp”
The pattern will be the logfile(s) you want to monitor. We can use a specific file, such as “logfile.log” or a wildcard, such as “*.log”.
You should not check the “UTF8” box unless you know the logfile to be UTF8 encoded.
On the event expression, click Insert for a new line. Essentially – log file monitors look at each new line in a logfile as one object to read, and this is represented by “Params/Param[1]” This “Parameter 1” is the entire line in the logfile, and is the only value that is valid for this type of monitor – so just type/paste that in the box for Parameter Name.
Since we want to search the logfile line for a specific word, the Operator will be “Contains”.
For the value – this can be the word you are looking for in the line, that you want to alert on. For my example, I will use the word “failed”.
On the alert screen – we can customize the alert name if desired, set the severity and priority, and build a better Alert Description. If you are using SP1 – the default alert description is blank. If you are using R2 – the default alert description is “Event Description: $Data/EventDescription$” HOWEVER – this is an invalid event variable for this type of event (logfile)…. so we need to change that right away. I keep a list of common alert description strings HERE
For this – I will recommend the following alert description. Feel free to customize to make good sense out of your alert:
Logfile Directory : $Data/EventData/DataItem/LogFileDirectory$ Logfile name: $Data/EventData/DataItem/LogFileName$ String: $Data/EventData/DataItem/Params/Param[1]$
Click “Create” to create the rule.
Find the rule you just created in the console – right click it and choose “Properties”. On the Configuration tab, under responses (to the right of “Alert”) click Edit.
Click the “Alert Suppression” button. You should consider adding in alert suppression on specific fields of an alert – in order to suppress a single alert for each match in the logfile. If you don't – should the monitored logfile ever get flooded with lines containing “failed” from the application writing the log – SCOM will generate one alert for each line written to the log. This has the potential to flood the SCOM database/Console with alerts. By setting alert suppression here – we will create one alert, and increment the repeat count for each subsequent line/alert. I am going to suppress on LoggingComputer and Parameter 1 for this example:
Click OK several times to accept and save these changes to the rule.
Now – we created this rule as disabled – so we need to enable it via an override. I will find the rule in the console – and override the rule “For a specific object of class: Windows Server Operating System”
Now – pick one of these machines to be the “watcher” for the logfile in the remote share.
**Note – the default agent action account will make the connection to the share and read the file. In my case – the default agent action account is “Local System” so this will be the domain computer account of the “Watcher” agent which connects to the remote share and reads the file. This account will need access to the share, folder, and files monitored. Keep that in mind.
Set the override to “Enabled = True” and click OK.
At this point, our Watcher machine will download the management pack again with the newly created override, and apply the new config. Once that is complete – it will begin monitoring this file. You can create a log file in the share path, and then write a new line with the word “failed” in it. You need a carriage return after writing the line for SCOM to pick up on the change.
You should see a new alert pop in the console, based on matching the criteria. Subsequent log file matches will only increment the repeat count. Customize the alert suppression as it makes sense for you.
Then – create additional rules just like this – for different UNC paths.
As you deploy the latest OpsMgr R2 core MP updates version 6.1.7599.0 which I blogged about HERE, you will probably notice a new script error popping up in your environment:
Alert Description: The process started at 10:36:57 AM failed to create System.PropertyBagData. Errors found in output: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\SCOMpercentageCPUTimeCounter.vbs(125, 5) SWbemRefresher: Invalid class Command executed: "C:\Windows\system32\cscript.exe" /nologo "SCOMpercentageCPUTimeCounter.vbs" servername.domain.com false 3 Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\ One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeMonitor Instance name: servername.domain.com Instance ID: {50E57AC1-08CC-6E7E-149A-1E8690881BBD} Management group: MGNAME
Alert Description:
The process started at 10:36:57 AM failed to create System.PropertyBagData. Errors found in output:
C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\SCOMpercentageCPUTimeCounter.vbs(125, 5) SWbemRefresher: Invalid class
Command executed: "C:\Windows\system32\cscript.exe" /nologo "SCOMpercentageCPUTimeCounter.vbs" servername.domain.com false 3 Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\
One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeMonitor Instance name: servername.domain.com Instance ID: {50E57AC1-08CC-6E7E-149A-1E8690881BBD} Management group: MGNAME
This is caused by a new script based monitor (Agent Processor Utilization) and collection rule (Collect agent processor utilization) that was added to this core MP… which measures the agent CPU impact, including the Healthservice, MonitoringHost, and all ancillary processes spawned by the SCOM agent. This monitor and rule targets the Health Service class:
Both the monitor and rule share the same script datasource – so make sure if you override ANYTHING on one, you make the SAME override on the other…. otherwise you will break cookdown for the datasource.
The problem is – there is a fair percentage of Windows Servers – that for some reason randomly will have a problem with WMI health, or (more likely) will not have the WMI performance counters enabled for Perfproc.dll, which this script needs in order to measure the CPU.
This is not a big deal… but until you fix this – the monitor wont work…. and will throw these script errors on a regular basis.
The good thing is – this issue is fully documented in the guide that ships with these new MP’s. You DID read the guide first – didn't you? :-) The guide lists many possible steps to go through in order… I will discuss the most common resolution I am seeing in the field, below.
So – gather a list of all servers throwing this specific script error for “Invalid class” on “SCOMpercentageCPUTimeCounter.vbs”, then make plans to fix them. This will likely require a reboot – so keep that in mind. Another alternative is to fix them now, and then let them get rebooted on their next patching cycle if that works better for you.
What is typically seen – is that there is a missing WMI class, due to the WMI perf counter being disabled. Right now this appears pretty random…. I have three VM’s all built from the same media on the same day – and one out of those three had this issue. I recently worked with a customer who had 4 machines out of 16 missing this perf counter. Here is an example of the WMI class we are looking for: (Run wbemtest, connect to root\cimv2, then click Enum Classes, select recursive, OK)
Luckily – the fix is VERY simple. There is a tool you can download and install on your workstation – and remotely connect to each machine and fix them:
http://www.microsoft.com/downloads/details.aspx?familyid=7ff99683-b7ec-4da6-92ab-793193604ba4&displaylang=en
Using this tool – connect to \\servername and click refresh. Sroll down in the list until you find “PerfProc perfproc.dll”. What you will likely find is that this class is disabled.
Simply check the box to enable it… and then reboot the machine at your convenience.
This will persist the class in WMI and the script error should go away.
Other errors from this script:
Now – if you are getting some OTHER error – not “Invalid Class”… this is likely an environmental problem with your server. I would walk through all the steps called out in the guide for this issue. If those don't work – then try some of these:
If you don't get a XML propertybag returned – then something is broken…. you can look and see if you get an error returned, or if nothing…. then make sure you gave the script the correct expected parameters as above, then start debugging the script.
Another example I am seeing is the following:
The process started at 6:56:10 PM failed to create System.PropertyBagData. Errors found in output: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\1220\SCOMpercentageCPUTimeCounter.vbs(124, 5) Microsoft VBScript runtime error: ActiveX component can't create object: 'WbemScripting.SWbemRefresher'
The process started at 6:56:10 PM failed to create System.PropertyBagData. Errors found in output:
C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\1220\SCOMpercentageCPUTimeCounter.vbs(124, 5) Microsoft VBScript runtime error: ActiveX component can't create object: 'WbemScripting.SWbemRefresher'
This happens on Windows 2000 servers. This note came from a reader comment: It appears from http://msdn.microsoft.com/en-us/library/aa393838(VS.85).aspx that this Script API is only supported on Windows XP/Windows Server 2003 and up – so it is likely that this script will not work on Windows 2000 Servers. If that is the case, you can disable this monitor AND rule for all Windows 2000 Computers, by overriding this monitor AND rule – for a group – and choosing the “Windows 2000 Server Computer Group”. That should make these errors go away for old legacy systems you might still be monitoring.
The time has come to move my Warehouse Database and OpsMgr Reporting Server role to a new server in my lab. Today – both roles are installed on a single server (named OMDW). This server is running Windows Server 2008 SP2 x86, and SQL 2008 SP1 DB engine and SQL Reporting (32bit to match the OS). This machine is OLD, and only has 2GB of memory, so it is time to move it to a 64bit capable machine with 8GB of RAM. The old server was really limited by the available memory, even for testing in a small lab. As I do a lot of demo’s in this lab – I need reports to be a bit snappier.
The server it will be moving to is running Server 2008 R2 (64bit only) and SQL 2008 SP1 (x64). Since Operations Manager 2007 R2 does not yet support SQL 2008R2 at the time of this writing – we will stick with the same SQL version.
We will be using the OpsMgr doco – from the Administrators Guide:
http://technet.microsoft.com/en-us/library/cc540402.aspx
So – I map out my plan.
Move the Data Warehouse DB:
Using the TechNet documentation, I look at the high level plan:
Sounds easy enough. (gulp)
Now – I follow the guidance in the guide to check to make sure the move is a success. Lots of issues can break this – missing a step, misconfiguring SQL rights, firewalls, etc. When I checked mine – it was actually failing. Reports would run – but lots of failed events on the RMS and management servers. Turns out I accidentally missed a step – editing the DW DB table for the new name. Once I put that in and bounced all the services again – all was well and working fine.
Now – on to moving the OpsMgr Reporting role!
Hey – even fewer steps than moving the database!
***A special note – if you have authored/uploaded CUSTOM REPORTS that are not deployed/included within a management pack – these will be LOST when you follow these steps. Make sure you export any custom reports to RDL file format FIRST, so you can bring those back into your new reporting server.
Now – I follow the guide and verify that reporting is working as designed.
Mine (of course) was failing – I got the following error when trying to run a report:
Date: 8/24/2010 5:49:27 PM Application: System Center Operations Manager 2007 R2 Application Version: 6.1.7221.0 Severity: Error Message: Loading reporting hierarchy failed. System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception) --- End of inner exception stack trace --- at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context) at System.Net.HttpWebRequest.GetRequestStream() at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ReportingService.ReportingService2005.ListChildren(String Item, Boolean Recursive) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ManagementGroupReportFolder.GetSubfolders(Boolean includeHidden) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingSubtree(TreeNode node, ManagementGroupReportFolder folder) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTree(ManagementGroupReportFolder folder) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTreeJob(Object sender, ConsoleJobEventArgs args) System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)
Date: 8/24/2010 5:49:27 PM Application: System Center Operations Manager 2007 R2 Application Version: 6.1.7221.0 Severity: Error Message: Loading reporting hierarchy failed.
System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception) --- End of inner exception stack trace --- at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context) at System.Net.HttpWebRequest.GetRequestStream() at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ReportingService.ReportingService2005.ListChildren(String Item, Boolean Recursive) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ManagementGroupReportFolder.GetSubfolders(Boolean includeHidden) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingSubtree(TreeNode node, ManagementGroupReportFolder folder) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTree(ManagementGroupReportFolder folder) at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTreeJob(Object sender, ConsoleJobEventArgs args) System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80 at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)
The key area of this is highlighted in yellow above. I forgot to open a rule in my Windows Firewall on the reporting server to allow access to port 80 for web reporting. DOH!
Now – over the next hour – I should see all my reports from all my MP’s trickle back into the reporting server and console.
Relatively pain free.
The following document will cover a basic install of System Center Virtual Machine Manager 2012 at a generic customer. This is to be used as a template only, for a customer to implement as their own pilot or POC deployment guide. It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.
SVCMM can be scaled to match the customer requirements. This document will cover a single server model, where all server roles are installed on a single VM/Server.
This is not an architecture guide or intended to be a design guide in any way.
High Level Deployment Process:
1. In AD, create the following accounts and groups, according to your naming convention:
2. Add the “scvmmsvc” and “scvmmadmin” account to the “SCVMMAdmins” global group.
3. Add the domain user accounts for yourself and your team to the SCVMMAdmins group.
4. Install Windows Server 2008 R2 SP1 to all server role servers.
5. Install Prerequisites and SQL 2008 R2.
6. Install the SCVMM Server, Console, and Self Service Portal.
7. Deploy SCVMM Agent to Hyper-V hosts.
Prerequisites:
1. Install Windows Server 2008R2 SP1 to the SCVMM server.
2. Ensure server has a minimum of 2GB of RAM.
3. Add .Net 3.5.1 and IIS role. IIS is being added to support the self service portal.
From http://technet.microsoft.com/en-us/library/bb691354.aspx open powershell (as an administrator) and run the following:
Import-Module ServerManager
<then>
Add-WindowsFeature NET-Framework-Core,Web-Static-Content,Web-Default-Doc,Web-Dir-Browsing,Web-Http-Errors,Web-Http-Logging,Web-Request-Monitor,Web-Filtering,Web-Stat-Compression,Web-Mgmt-Console,Web-Metabase,Web-Asp-Net,Web-Windows-Auth -Restart
4. Install .NET 4.0 to all servers
5. Install all available Windows Updates.
6. Join all servers to domain.
7. Add the “DOMAIN\SCVMMAdmins” domain global group and the “DOMAIN\scvmmsvc” domain account explicitly to the Local Administrators group on each SCVMM server.
8. Install the Windows Automated Installation Kit (AIK) 2.0. http://go.microsoft.com/fwlink/?LinkID=194654
9. Install SQL 2008 R2 DB engine.
Step by step deployment guide:
1. Install SCVMM 2012:
2. Deploy an agent to an existing Hyper-V Host.
There is sometimes confusion over upgrades, if you can just upgrade to the next version, or if there is an CU (Cumulative Update) or UR (Update Rollup) minimum level that is required BEFORE a major version upgrade.
The MAJOR VERSION upgrade path is below, which Marnix talked about HERE.
This article covers IN PLACE upgrades only, not side by side migrations.
SCOM 2007R2 > SCOM 2012 RTM > SCOM 2012 SP1 > SCOM 2012 R2
No shortcuts… you have to upgrade to each major version before continuing to the next. So if you are way behind, considering a side-by-side migration can be MUCH less work. Depends on where you are at.
Ok, back to the Update Rollup requirements. Technically speaking, the upgrade from SCOM 2007 R2 > SCOM 2012 RTM has a documented CU level. Per http://technet.microsoft.com/en-us/library/hh476934.aspx which states the following:
Upgrading to System Center 2012 – Operations Manager is supported from Operations Manager 2007 R2 CU4, or from the latest available CU.
Which basically means if you are at CU4 or later on SCOM 2007R2, you can upgrade to SCOM 2012 RTM. There are no other statements in TechNet which state that any of the SCOM 2012 SP1 or R2 upgrades have a minimum UR requirement.
Next, the upgrade from 2012 RTM to 2012 SP1 has this page: http://technet.microsoft.com/en-us/library/jj628203.aspx This “recommends” updating to SCOM 2012 RTM UR2 (or the latest UR available.) I cannot say this is a requirement, it is likely what was tested at the time.
Same for SCOM 2012 SP1 to SCOM 2012 R2. http://technet.microsoft.com/en-us/library/dn521010.aspx This page states “We recommend that you update all of the System Center 2012 SP1 components with the most current update rollups.” The most current UR at the time of that publishing was SCOM 2012 SP1 UR4.
Therefore, it appears the “required” upgrade path looks like:
SCOM 2007R2 CU4+ > SCOM 2012 RTM > SCOM 2012 SP1 > SCOM 2012 R2
Our “Recommended” rolling upgrade path looks like the following:
SCOM 2007R2 CU4+ > SCOM 2012 RTM UR2+ > SCOM 2012 SP1 UR4+ > SCOM 2012 R2
If I were performing a rolling upgrade, this is most likely how I’d recommend doing it. If you are planning a VERY SLOW migration from one version to another, due to lots of additional work that is necessary, such as upgrading OS’s or SQL versions along the way, then you might consider going ahead and applying whatever the most recent Update Rollup available for SCOM 2012. There are documented here:
http://support.microsoft.com/kb/2906925
One word of caution. The latest word from the product group just came out, on supported interop scenarios:
http://blogs.technet.com/b/momteam/archive/2014/01/17/system-center-2012-operations-manager-supported-configurations-interop-etc.aspx
They specifically called out one little point:
*Latest CU or UR applies in all cases
This most likely means, this is what they tested, at the time of this posting. I cant say this is a “requirement”. If you are doing a rolling upgrade, applying the latest UR to SCOM 2012 RTM before upgrading to SCOM 2012 SP1, then applying the latest UR to SCOM 2012 SP1 before upgrading to SCOM 2012 R2, would be a lot of extra effort, and technically is not a requirement to make the upgrade. The best decision would be to figure out how long you plan to stay at each stage, how large your management group is, and how much effort it would be in order to deploy the latest UR in each case. Also to test this in your environment before rolling out the upgrade to production.
The upgrade from 2007R2 CU4+ to SCOM 2012 RTM is the biggest jump, because you must update ALL your agents to SCOM 2012 before finalizing the upgrade. After that, you can technically leave your agents alone, since a SCOM 2012RTM agent can report to both SCOM 2012 SP1 and R2 management servers. Then just update your agents at the end, to SCOM 2012 R2 (plus whatever the latest UR is at that time).
Resource links:
SCOM 2007R2 > SCOM 2012 RTM Upgrade Guide TechNet SCOM 2012 RTM > SCOM 2012 SP1 Upgrade Guide TechNet SCOM 2012 SP1 > SCOM 2012 R2 Upgrade Guide TechNet
SCOM 2007R2 > SCOM 2012 RTM Upgrade Guide TechNet
SCOM 2012 RTM > SCOM 2012 SP1 Upgrade Guide TechNet
SCOM 2012 SP1 > SCOM 2012 R2 Upgrade Guide TechNet
There are a TON of really good blogs out there with upgrade experiences, tips and tricks, so I cant possibly list them all. I did want to point out a really cool link from a colleague, Wei H Lim, who wrote an updated version of the “Upgrade Helper” MP but for moving from SCOM 2012 SP1 > SCOM 2012 R2. He does some amazing work so if you don’t follow his blog definitely add him to your lists.
OpsMgr- Sample Upgrade Helper MP for 2012 SP1 to 2012 R2 (SUHMP2012R2)
System Center Orchestrator 2012 SP1 is extremely easy to setup and deploy. There are only a handful of prerequisites, and most can be handled by the setup installer routine.
The TechNet documentation does an excellent job of detailing the system requirements and deployment process:
http://technet.microsoft.com/en-us/library/hh420337.aspx
The following document will cover a basic install of System Center Orchestrator 2012 at a generic customer. This is to be used as a template only, for a customer to implement as their own pilot or POC deployment guide. It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.
SCORCH can be scaled to match the customer requirements. This document will cover a typical two server model, where all server roles are installed on a single VM, and utilize a remote database server or SQL cluster.
Definitions:
SCORCH System Center Orchestrator
Server Names\Roles:
SCORCH Orchestrator 2012 role server Management Server Runbook Server Orchestrator Web Service Server Runbook Designer client application
SCORCH Orchestrator 2012 role server
DB1 SQL 2012 Database Engine Server
Windows Server 2012 will be installed as the base OS for all platforms. All servers will be a member of the AD domain.
SQL 2012 will be the base standard for all database services. SCORCH only requires a SQL DB engine (locally or remote) in order to host SCORCH databases.
a. DOMAIN\scorchsvc SCORCH Mgmt, Runbook, and Monitor Account b. DOMAIN\ScorchUsers SCORCH users security global group c. DOMAIN\sqlsvc SQL Service Account
a. DOMAIN\scorchsvc SCORCH Mgmt, Runbook, and Monitor Account
b. DOMAIN\ScorchUsers SCORCH users security global group
c. DOMAIN\sqlsvc SQL Service Account
2. Add the domain user accounts for yourself and your team to the ScorchUsers group.
3. Install Windows Server 2012 to all server role members.
4. Install Prerequisites.
5. Install the SCORCH Server.
1. Install Windows Server 2012 on all servers.
2. Join all servers to domain.
3. Ensure SCORCH server has a minimum of 1GB of RAM.
4. On the SCORCH server, .Net 3.5SP1 is required. Setup will not be able to add this feature on Windows Server 2012. Open an elevated PowerShell session (run as an Administrator) and execute the following:
Add-WindowsFeature NET-Framework-Core
6. On the SCORCH .Net 4.0 is required. This is included in the WS2012 OS (.NET 4.5)
7. Install all available Windows Updates as a best practice.
8. Add the “DOMAIN\scorchsvc” domain account explicitly to the Local Administrators group on the SCORCH server.
9. Add the “DOMAIN\ScorchUsers” global group explicitly to the Local Administrators group on the SCORCH server.
10. On the SQL database server, install SQL 2012.
1. Install SCORCH 2012:
2. Open the consoles.
Post install procedures:
1. Lets register and then deploy Integration Packs that enable Orchestrator to connect to so many outside systems.
Download the toolkit, add-ons, and IP’s for SCORCH 2012 SP1.
Additionally – you can download more IP’s at:
http://technet.microsoft.com/en-us/library/hh295851.aspx
Such as the VMware VSphere IP, or the IBM Netcool IP.
Additionally – check out Charles Joy’s blog on popular codeplex IP’s which have been updated for Orchestrator:
http://blogs.technet.com/b/charlesjoy/
Event Type: Warning
Event Source: OpsMgr SDK Service
Event Category: None
Event ID: 26371
Date: 12/13/2007
Time: 2:58:24 PM
User: N/A
Computer: RMSCOMPUTER
Description:
The System Center Operations Manager SDK service failed to register an SPN. A domain admin needs to add MSOMSdkSvc/rmscomputer and MSOMSdkSvc/rmscomputer.domain.com to the servicePrincipalName of DOMAIN\sdkaccount
This seems to appear in the RC1-SP1 build of OpsMgr.
Every time the SDK service starts, it tries to update the SPN’s on the AD account that the SDK service runs under. It fails, because by default, a user cannot update its own SPNs. Therefore we see this error logged.
If the SDK account is a domain admin – it does not fail – because a domain admin would have the necessary rights. Obviously – we don’t want the SDK account being a domain admin…. That isn’t required nor is it a best practice.
Therefore – to resolve this error, we need to allow the SDK service account rights to update the SPN. The easiest way, is to go to the user account object for the SDK account in AD – and grant SELF to have full control.
A better, more granular way – is to only grant SELF the right of modifying the SPN:
To check SPN's:
The following command will show all the HealthService SPN's in the domain:
Ldifde -f c:\ldifde.txt -t 3268 -d DC=DOMAIN,DC=COM -r "(serviceprincipalname=MSOMHSvc/*)" -l serviceprincipalname -p subtree
To view SPN's for a specific server:
"setspn -L servername"
Update Rollup 2 (UR2) for OpsMgr 2012 SP1 has shipped. This post will be a simple walk-through of applying it. This hotfix is included on my Hotfix page for SCOM: http://blogs.technet.com/b/kevinholman/archive/2009/01/27/which-hotfixes-should-i-apply.aspx
Description and download location:
http://support.microsoft.com/kb/2802159
Description of fixes in this release:
Unix and Linux fixes:
This Update Rollup is also required if you want to use the new System Center Advisor Connector: http://blogs.technet.com/b/momteam/archive/2013/04/09/system-center-advisor-connector-for-operations-manager-preview.aspx
That’s a LOT. Looks like some very important ones as well…. so lets get this one tested in our labs!
Download the update:
You can get this update “partially” applied by using Windows Update. However, since there are manual steps involved, and a specific recommended order of operations, I don’t really recommend using Windows Update in general. It is certainly an option, however.
To download all of the updates, you will need to click the link in the KB above, which will launch the catalog for the individual downloads.
You’ll notice some of these updates are a LOT bigger than the previous ones in UR1.
I also notice there is now an update for the “Console” which is new from UR1. The original release of UR2 was missing the update for the Gateway, which is now included and available to make UR2 truly “cumulative”.
Add these to your “basket” then “view basket” and choose a download location.
Build a plan:
Following the KB – the installation plan looks something like this:
***Note: One of the things you will notice – is that there is no update available for reporting servers. We will skip the reporting role.
My new list looks like:
Since I am monitoring Linux systems, I’ll need to add steps for that from the KB:
(The Unix/Linux MP location isn't available, and the previous location hasn’t been updated yet. So this part is still under investigation as well. I will update this section when I clear this part up)
Seems simple enough, lets get started.
Install the update rollup package
On the catalog site, I add all the updates to my basket, and click View Basket, and Download.
Next I copy these files to a share that all my SCOM servers have access too. These are actually .CAB files, so I will need to extract the MSP’s from these CAB files.
Once I have the MSP files, I am ready to start applying the update to each server by role.
***Note: You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.
My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:
This launches a quick UI which applies the update. It will bounce the SCOM services as well. The update does not provide any feedback that it had success or failure. You can check the application log for the MsiInstaller events for that.
You can also spot check a couple DLL files for the file version attribute.
Next up – run the Web Console update:
This runs much faster. A quick file spot check:
Lastly – install the console update:
Well, this one required a reboot. The KB article instructed “If you do not want to restart the computer after you apply the console update, close the console before you apply the update for the console role.” However – my console was closed….. so you had better prepare that these files might be locked and require a reboot.
After the reboot – a quick file spot check:
I now move on to my additional management servers, applying the server update, then the console update. My additional management servers did not require a reboot after the console update.
Next, I update the gateways.
The update launches a UI and quickly finishes.
I do a spot-check to ensure the right files were dropped. First I will check the Agent update files in C:\Program Files\System Center Operations Manager\Gateway\AgentManagement\
Then I will spot check the DLL’s:
Manually import the management packs?
We have two updated MP’s to import (MAYBE!).
These MP bundles are only used for specific scenarios, such as Global Service Monitoring, or DevOps scenarios where you have integrated APM with TFS, etc. If you are not currently using these MP’s, there is no need to import or update them. The Intellitrace MP will actually fail to import of you are not using these, because of a dependency. I’d skip this MP import unless you already have these MP’s present in your environment.
Apply the agent update
Approve the pending updates in the Administration console for pushed agents. Manually apply the update for manually installed agents.
100% success rate.
Be sure to check the “Agents By Version” view to find any agents that did not get patched:
***Note! The agents behind a gateway are NOT placed into pending actions for an update. These agents will need to be “repaired” via the administration console, or use Windows Update. On my agents behind a GW – a repair worked perfectly.
Update Unix/Linux MPs
Next up – I download and extract the updated Linux MP’s for SCOM 2012 SP1 UR2
(The link in the KB article doesn’t work at the time of this writing – here is the correct link)
http://www.microsoft.com/en-us/download/details.aspx?id=29696
7.4.3507 is SCOM 2012 SP1. 7.4.4112.0 is SCOM 2012 SP1 with UR1. 7.4.4119.0 is SCOM 2012 SP1 with UR2.
7.4.3507 is SCOM 2012 SP1.
7.4.4112.0 is SCOM 2012 SP1 with UR1.
7.4.4119.0 is SCOM 2012 SP1 with UR2.
Download the MSI and run it. It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 MPs for UNIX and Linux (7.4.4199.0)
Import the files in the 2012 SP1 folder, and the following:
Microsoft.Unix.ConsoleLibrary.mp Microsoft.Unix.Process.Library.mpb Microsoft.Unix.ShellCommand.Library.mpb
Microsoft.Unix.ConsoleLibrary.mp
Microsoft.Unix.Process.Library.mpb
Microsoft.Unix.ShellCommand.Library.mpb
Also add any platform specific MP’s for versions on Unix or Linux in your monitoring environment.
You will likely observe high CPU utilization of your management servers during these MP imports. Give it time to complete the process of the import and MPB deployments.
Next up – you would upgrade your agents on the Unix/Linux monitored agents. You can now do this straight from the console:
You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.
Lastly – refer to the KB article for the UR1 update, as if you are a heavy user of Linux process monitoring using our template – additional steps are required to address the fixes. You must open, edit, and re-save any process templates that you had previously created in order to apply the fixes to each.
Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.
Known issues:
See the existing list of known issues documented in the KB article.
Additional:
1. Agents behind a Gateway will not be placed into pending management for an update. If you are using Windows Update/WSUS/SCCM to update your agents, then no steps are necessary, as they will receive the agent update automatically.
2. OM12 SP1 UR#2 Web Console Error: System.Reflection.ReflectionTypeLoadException: [ReflectionTypeLoad_LoadFailed] Savision has released updated versions of the Live Maps Summary Widget management packs that resolve this issue. The versions can be downloaded here. The download contains the following files:
The problem should be fixed by importing the 2 management packs into your environment.
http://www.savision.com/resources/news/fix-om12-sp1-ur2-web-console
Here is an interesting little concept of how to use OpsMgr.
Because I have a lab, that is exposed to the internet over port 3389, I get a LOT of hacking attempts on this lab. Mostly the source is from bots running on other compromised systems. These bots just do brute force attacks against the typical Admin accounts and passwords via RDP. In this article, I am going to show how OpsMgr can not only alert on this condition, but also respond by configuring the Windows Firewall to block these attacks.
I will start by analyzing the Server 2008 event that occurs when someone tries to attack using my “Administrator” account:
Log Name: Security Source: Microsoft-Windows-Security-Auditing Date: 7/14/2009 12:44:05 PM Event ID: 4625 Task Category: Account Lockout Level: Information Keywords: Audit Failure User: N/A Computer: terminalserver.domain.com Description: An account failed to log on. Subject: Security ID: SYSTEM Account Name: TERMINALSERVER$ Account Domain: DOMAIN Logon ID: 0x3e7 Logon Type: 10 Account For Which Logon Failed: Security ID: NULL SID Account Name: administrator Account Domain: TERMINALSERVER Failure Information: Failure Reason: Account locked out. Status: 0xc0000234 Sub Status: 0x0 Process Information: Caller Process ID: 0x14f0 Caller Process Name: C:\Windows\System32\winlogon.exe Network Information: Workstation Name: TERMINALSERVER Source Network Address: 10.10.10.1 Source Port: 1261 Detailed Authentication Information: Logon Process: User32 Authentication Package: Negotiate Transited Services: - Package Name (NTLM only): - Key Length: 0
Log Name: Security Source: Microsoft-Windows-Security-Auditing Date: 7/14/2009 12:44:05 PM Event ID: 4625 Task Category: Account Lockout Level: Information Keywords: Audit Failure User: N/A Computer: terminalserver.domain.com
Description: An account failed to log on.
Subject: Security ID: SYSTEM Account Name: TERMINALSERVER$ Account Domain: DOMAIN Logon ID: 0x3e7
Logon Type: 10
Account For Which Logon Failed: Security ID: NULL SID Account Name: administrator Account Domain: TERMINALSERVER
Failure Information: Failure Reason: Account locked out. Status: 0xc0000234 Sub Status: 0x0
Process Information: Caller Process ID: 0x14f0 Caller Process Name: C:\Windows\System32\winlogon.exe
Network Information: Workstation Name: TERMINALSERVER Source Network Address: 10.10.10.1 Source Port: 1261
Detailed Authentication Information: Logon Process: User32 Authentication Package: Negotiate Transited Services: - Package Name (NTLM only): - Key Length: 0
So… for starters, I want to alert on this condition… when ANYONE is trying multiple times… to RDP into the server, with a disabled account, non-existent account, or valid account, but bad password. Therefore – I will create a monitor: Windows Events > Repeated Event Detection > Timer Reset.
The idea here is to only respond when multiple bad passwords are entered in a short time period…. representing an attack. (I don't want to lock out or block access from my normal users who sometimes mis-type their password on a couple attempts.)
So I create the monitor, target “Windows Server Operating System”, set it to “Security” for the Parent Monitor, and UNCHECK the box enabling it. (I will later override this monitor and ONLY enable it for my entry terminal server.)
I create my event expression for the security event log, event 4625, and I only want the Logon Type of 10, which is from RDP:
Next – I will set up my monitor, to Trigger on Count (of events), Sliding. Compare count will be set to 5 (events) within a 3 minute interval. Therefore, as soon as 5 events are captured, in ANY sliding 3 minute “window”, the monitor will change state.
Next… since my goal is really to execute a script/command/response…. (not really a state change is desired) I will set the timer reset to reset the state back to healthy after 2 minutes. This will free the workflow up to block any other source IP’s which might attack soon after.
I don't want to impact availability data, which assumes critical state = unavailable…. so I will use a Warning State:
Now – I will enable a unique alert for this condition. I want a critical, high priority alert in this case, and I will set this NOT to close the alert when we auto-resolve the state on the timer. I also will customize the alert description, to give me a richer alert based on the even details and my custom response. I talk more about these event parameters HERE. I will be adding:
$Data/Context/Context/DataItem/Params/Param[6]$ typed a bad password accessing directly from computer: $Data/Context/Context/DataItem/Params/Param[14]$ from IP: $Data/Context/Context/DataItem/Params/Param[20]$ The Windows Firewall will be modified to block this IP address in response to this monitor state.
Next – I will go back and find my monitor, and add a Recovery for the Warning State:
I will choose to Run Command. Give it a name “Modify Windows Firewall”
Next – for the command – I am going to run Netsh.exe which can configure the Windows Firewall running on the terminal server. Here is the command:
C:\Windows\System32\netsh.exe
advfirewall firewall set rule name="Block RDP" new remoteip=$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$
$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$ is based on an Event Parameter of the Server 2008 event, which I will pass to the command, so it will gather the IP address of the attacker, and pass that to the command which configures the firewall rule. Getting this variable was the most complicated for me….. Marius talked about how to derive this variable HERE Just understand that the variables you use in an alert description are not the same was used in a diagnostic or recovery.
Cool:
My Netsh.exe command modifies an existing custom rule in the Windows Firewall, so I need to make sure I create that and name it “Block RDP”.
Now – I will override this rule and enabled it for my published terminal server, and then test this monitor… by attempting to log into my terminal server via RDP 5 times in a short period, using a disabled account. This will cause the event in the security event log for each event, and eventually trip the repeated event detection monitor.
Alert generates:
Monitor changes state:
Recovery runs:
Windows Firewall rule gets modified:
Attack is stopped.
Pretty cool, eh?
System Center Orchestrator 2012 is extremely easy to setup and deploy. There are only a handful of prerequisites, and most can be handled by the setup installer routine.
SCORCH can be scaled to match the customer requirements. This document will cover a typical two server model, where all server roles are installed on a single VM, and utilize a remote database server or cluster.
a. DOMAIN\scorchsvc SCORCH Mgmt, Runbook, and Monitor Account b. DOMAIN\ScorchUsers SCORCH users security global group
3. Install Windows Server 2008 R2 SP1 to all server role members.
4. Add the DOMAIN\scorchsvc account to the local administrators group on the SCORCH server.
5. Add the DOMAIN\ScorchUsers global group to the local administrators group on the SCORCH server.
6. Install the SCORCH Server.
1. Install Windows Server 2008R2 SP1
2. Ensure server has a minimum of 1GB of RAM.
3. .Net 3.5SP1 is required. Setup will add this feature if not installed.
4. IIS7 (IIS Role) is required. Setup will add this role is not installed.
5. .Net 4.0 is required. This must be installed manually on Server 2008 R2 SP1. Download and install this prereq.
6. Install all available Windows Updates as a best practice.
7. Join all servers to domain.
Go to http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=28725 and download the toolkit, add-ons, and IP’s.
I recently blogged about the new Base OS MP that was recently released: HERE
One of the things you will notice RIGHT off the bat… is that a huge percentage of your logical disks will go into a warning state, if you don't already have some sort of scheduled defragmentation set up. This will be true for virtual machines and physical machines…. anything over 10 percent file fragmentation (or the OS recommended setting) will get hit:
You will also get many warning alerts on this monitor…. the first time the condition is detected and the state changes for this monitor. This monitor checks status every Saturday, at 3:00AM by default, for all logical disks discovered.
If you don't care about this monitoring in SCOM – disable this monitor using overrides.
If you do care about seeing the state change – but don't want the alerts – turn the “Generates Alert” property to False, using overrides.
You can adjust the threshold from 10% to some other number…. but make sure you take note – this monitor will ignore the “File Percent Fragmentation” property by default, and always use the OS recommended setting. If you want to control this – you also need to set “Use OS Recommendation” to FALSE.
Here is an example of hard coding the frag percentage to 20% from the OS default:
“Use OS Recommendation” property description:
Lastly – one thing of interest…. If you want SCOM to “fix” the fragmentation issue…. it can. There is a recovery on this very monitor that can run a VBScript that will run a defrag job against your logical disks. It is disabled by default.
Keep in mind – if you turn on this defrag…. on your physical boxes – it wont be a big deal… it will simply fix the fragmentation issue. However – this will also run on ALL yours VM’s. If this was triggered all at the same time – Saturday at 3:00AM by default – this can kill the disk I/O on the disk subsystem hosting your VM/VHD files. Keep this in mind if you decide to enable this…. This recovery will only run when the state change is detected… as a recovery to the condition, so any disks that are already in a warning state will not run this recovery should you enable it. This defrag has a timeout of 1 hour…. so it should kill the defrag if it cannot complete within an hour.
Another cool thing to do – is to use the recovery action as a single run-time task. You can do this right from health explorer, to fix the disks on your own schedule:
Just click the link, and run the task:
Minimize this…. and just let it run – you can come back in 1 hour – and see if it completed, or timed out.
You can also monitor for task status in the Task Status list in the console:
On the agent – you will see the following events logged in the OpsMgr event log:
Log Name: Operations Manager Source: Health Service Script Date: 9/28/2009 10:50:04 AM Event ID: 4002 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OMDW.opsmgr.net Description: Microsoft.Windows.Server.LogicalDisk.Defrag.vbs : Perform Defragmentation (disk: C:; computer: OMDW.opsmgr.net).
And when completed:
Log Name: Operations Manager Source: Health Service Script Date: 9/28/2009 11:03:44 AM Event ID: 4002 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OMDW.opsmgr.net Description: Microsoft.Windows.Server.LogicalDisk.Defrag.vbs : Defragmentation completed (disk: C:; computer: OMDW.opsmgr.net): FilePercentFragmentation = 0.
Randomly, you might see a single MonitoringHost.exe process on an agent, consuming 100% CPU. (Or 50%, or 25% depending on how many cores you have). This process will stay at this level, and will not recover. If you restart the OpsMgr HealthService, the problem goes away, and might not return for days or even weeks.
This particular symptom, might be due to an XML spinlock issue… this is a core Windows OS issue, and there is a hotfix available, which I have on my HOTFIX LINK
The KB is 968967 :
“The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed”
I have seen that most customers are affected by this issue from time to time. I have seen it very commonly in my lab, on Server 2008 Domain controllers, and my Server 2008 Hyper-V hosts…
A note on patching Server 2008:
When you go to download this hotfix for a server 2008 machine – it is very misleading on which hotfix to even get. Here is the list of all available fixes:
For patching Server 2008 – you need to download the “Windows Vista” hotfix – in either x86 or x64, depending on your OS version:
Monitoring for this condition:
You can easily write a threshold monitor targeting agent or HealthService, to track the monitoringhost process \ %processor time threshold, and set it to alert when it has multiple consecutive samples above a defined threshold.
Here is an example of creating this monitor:
Authoring Pane > Monitors > New Unit Monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Consecutive Samples over Threshold.
Give it a custom name that follows your documented custom Monitor naming standard, target “Health Service”, and put this under Performance rollup.
Hit the “Select” button (in SP1 – select “Browse”) In the perf counter picker – choose a server with an installed agent, choose the Object “Process” the counter “%Processor Time” and the Instance “MonitoringHost”, and click OK.
Since there are multiple MonitoringHost processes… we will add a Wildcard to the Instance name in the monitor…. this will monitor ANY MonitoringHost process for high CPU. Set the Interval to every 1 minute.
For the number of consecutive samples, and threshold… that is up to you. For me – I will say that if I detect a single MonitoringHost process using more than 50% CPU, over all 5 consecutive samples (5 minutes) then I consider that bad:
At this point…. you can simply alert on the condition, or event try and add a recovery script – that will bounce the health service. Generally, bouncing the HealthService when one of the processes is using all the CPU is not always 100% reliable… especially from a “NET STOP & NET START” type command. I have found it more reliable to just kill the MonitoringHost process in this condition, and allow it to respawn…. but your mileage may vary.
http://blogs.technet.com/kevinholman/archive/2008/03/26/using-a-recovery-in-opsmgr-basic.aspx
Resource Pools in SCOM 2012 are an advancement over SCOM 2007, where a resource pool can be used to host instances, that have targeted workflows, and make them highly available. This allowed the “All Management Servers Resource Pool” to host the instances that the RMS used to run in SCOM 2007. This allowed for all management servers, in the AMSRP, to automatically load balance the old RMS workflows, across all management servers.
This also is used for thing like the Notifications Resource pool, which hosts two instances (or Top Level Managed Entities) which are the Pool object itself, and the “Alert Notification Subscription Server” which have many monitoring workflows target it to monitor the notification process health.
Well, we can also write workflows and target resource pools. We might do this if we want a workflow to run on the management servers, but be highly available.
In this example, I will take a VERY simple script that does nothing but log an event, and target the All Management Servers Resource Pool.
First, here is my PowerShell script:
$api = new-object -comObject 'MOM.ScriptAPI' $api.LogScriptEvent("momscriptevent.ps1",9999,0,"this is a test event")
This script simply loads the MOM.ScriptAPI which is necessary to perform specific SCOM actions in script, such as logging events to the SCOM event lot, creating property bags, submitting discovery data, etc.
Then, it logs an informational event for the script in the SCOM event log wherever it is running.
Next up – write my rule to run the script.
We cannot use the SCOM 2007R2 Authoring Console to write this rule, as we need to target the Resource Pool object which SCOM 2007R2 does not understand, nor can it reference. If you are most familiar with authoring in that tool, and you really want to use that SCOM 2007R2 Authoring Console, you can do that, and just target something else, like “Windows Server Operating System” and then change the class later in an XML editor.
Here is my manifest section. Note – I need to reference the SCOM 2012 versions of these MP’s since this MP will not work on SCOM 2007:
<Manifest> <Identity> <ID>Target.ResourcePool.Example</ID> <Version>1.0.0.1</Version> </Identity> <Name>Target.ResourcePool.Example</Name> <References> <Reference Alias="SC"> <ID>Microsoft.SystemCenter.Library</ID> <Version>7.0.8427.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Windows"> <ID>Microsoft.Windows.Library</ID> <Version>7.5.8500.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Health"> <ID>System.Health.Library</ID> <Version>7.0.8427.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="System"> <ID>System.Library</ID> <Version>7.5.8500.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> </References> </Manifest>
Next, my simple rule. Notice – I target the AMSRP class, I add a simple scheduler module to run this workflows every 30 seconds, and I have a simple write action based on Microsoft.Windows.PowerShellWriteAction module.
<Monitoring> <Rules> <Rule ID="Target.ResourcePool.Example.RunSampleScriptRule" Enabled="true" Target="SC!Microsoft.SystemCenter.AllManagementServersPool" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>Custom</Category> <DataSources> <DataSource ID="SchedDS" TypeID="System!System.SimpleScheduler"> <IntervalSeconds>30</IntervalSeconds> <SyncTime></SyncTime> </DataSource> </DataSources> <WriteActions> <WriteAction ID="PoshWA" TypeID="Windows!Microsoft.Windows.PowerShellWriteAction"> <ScriptName>momscriptevent.ps1</ScriptName> <ScriptBody><![CDATA[ $api = new-object -comObject 'MOM.ScriptAPI' $api.LogScriptEvent("momscriptevent.ps1",9999,0,"this is a test event") ]]></ScriptBody> <TimeoutSeconds>30</TimeoutSeconds> </WriteAction> </WriteActions> </Rule> </Rules> </Monitoring>
That’s it! I will post my full XML as a sample attached to this article.
Now, when I import this MP, ONE of my management servers should start running this workflow. It will be whichever MS is hosting the AMSRP class at that time. This could change as loads are reshuffled, or as management servers are taken down for maintenance.
I have three management servers, SCOM01, SCOM02, and SCOM03. I can see this workflow is running happily on SCOM02:
I will stop the health service on SCOM02, or shut the OS down.
The last event I got from the test script was at 9:09:56 AM.
What happens now, is the other management servers are waiting for a heartbeat failure threshold to take a vote, and evict SCOM02 from the pool. The SCOM database is also a “default observer” and plays a role in the voting process.
At 9:12:36 AM, I start to see the pool manager events coming in, showing that they other management servers are redistributing the workflows. My 9999 event is now being created on SCOM03, with my first event showing up at 9:12:55 AM, or about 3 minutes after SCOM02 went down.
My sample XML is provided below.
A new Base OS MP Version 6.0.7026.0 has shipped. This management pack includes updated MP’s for Windows 2003 through Windows 2012 operating systems. This updated MP will import into OpsMgr 2007 or 2012 management groups.
http://www.microsoft.com/en-us/download/details.aspx?id=9296
Ok – so what's new in this MP?
The April 2013 update (version 6.0.7026.0) of the Windows Server Operating System Management Pack contains the following changes:
These fixes address the majority of known issues discussed in my last article on the Base OS MP:
http://blogs.technet.com/b/kevinholman/archive/2012/09/27/opsmgr-mp-update-new-base-os-mp-6-0-6989-0-adds-support-for-monitoring-windows-server-2012-os-and-fixes-some-previous-issues.aspx
A note on Processor utilization monitoring and collection:
Distinct rules and monitors were created for Windows Server 2008, and 2008 R2. Server 2008 will monitor and collect “Processor\% Processor Time” while Server 2008 R2 will monitor and collect “Processor Information\% Processor Time”. Overrides were included in the MP to disable the “2008” rules and monitors for the 2008 R2 instances. If for some reason you prefer to collect and monitor "from “Processor” instead of “Processor Information”, for instance if this breaks some of your existing reports, it is very simple to just override those rules and monitors back to enabled. An unsealed override will always trump a sealed override.
Known Issues in this MP:
1. The knowledge for the 2008 and 2008 R2 Total CPU Utilization Percentage is incorrect – the monitor was updated to a default value of 3 samples but the knowledge still reflects 5 samples. This is still an issue (no biggie) The 2012 monitors use 5 samples by default with correct knowledge.
2. There are now collection rules for Cluster disks and CSV for free space (MB), free space (%), and total size (MB), If you want performance reports on other perfmon objects that are available in perfmon but not included in our MP, such as disk latency, idle time, etc., you will need to create these. Since this can be complicated to get it right – I wrote an article on how to do this correctly, and offer a sample MP for download: http://blogs.technet.com/b/kevinholman/archive/2012/09/27/opsmgr-authoring-performance-collection-rules-for-cluster-disks-the-right-way.aspx
3. The new monitor for Max Concurrent API has some issues and will generate a false alert in some cases. If you have servers where this is happening – disable this monitor and it will be addressed in the next release of the MP.
So.... Say I am an Exchange Administrator in a global company.... in the good old USA.
My company has recently implemented OpsMgr 2007 to monitor our Exchange servers. I am going to configure my notification subscriptions so I can get an email anytime one of my Exchange servers has an issue.
Try #1: I start by creating a notification subscription, and I dont scope it by groups or classes (all groups, all classes). I think this sounds fine. However, instantly I find I am flooded with email notifications from every single alert coming into the console. This is NOT good!
Try #2: Therefore – I decide I really need to see only Exchange alerts. I scope the notification *classes* down to just Exchange classes. This will ensure I only receive notifications from Exchange target classes. Good? Nope.... I soon find that when an alert comes in from the base OS, or heartbeat, or hardware, we won’t get those. We need to add those classes back. If we add the heartbeat (Health Service Watcher) class – we will now get heartbeat failures for ALL machines… not just restricted to exchange servers. No good.
Try #3: So – we need to scope the subscription using groups. We create a group with all our Exchange Server Windows Computer objects in it. We can manually add these in (Explicit) or we can use a dynamic rule based on criteria - I chose NetBIOS name, and used a naming standard of EX* (all my exchange servers start with "ex"). I used an "OR" statement since the wildcard is case sensitive.
Now I create a subscriptions - and scope it to this group - and choose ALL classes.... thinking that this way, we should get ALL notifications, including base OS, exchange, and heartbeat alerts… right?
Nope. Because of the object oriented monitoring model – we will only receive alerts from a rule/monitor with a target class that has a child relationship to the Windows Computer class. This is the only class type in the group we created. So – using the model in #3, we will get notifications from pretty much any class needed – except heartbeats. These come from the Health Service Watcher class, and have no relation to the Windows Computer class.
Try #4: I am thinking, we must add the class type to our group – and any instances of that class we are interested in. Since most object classes are a child of Windows Computer, there should not be many of these that we will have to do.
In the group – add the Health Service Watcher display name instances, in the same way we add the Windows Computer NetBIOS names:
The AND/OR verbiage is misleading…. This was opened as a bug then closed – because it is “as designed”.
Essentially – The or group at the top will include ANY of the following and groups below it…. BOTH the windows computer objects AND the Health Service Watcher objects are included: (you can right click any group and choose to show members)
I tested all kinds of Exchange alerts, and heartbeat failures – and this works. It is possible there will be other alerts we wont get in this subscription.... IF the rule or monitor that created the alert was using a target class that was unique, and not a child of "Windows Computer"
I don’t think this will be a huge hassle moving forward… because MOST alerting is done on a target which is a child of Windows computer. If we find one that is not – we just need to go back and add that class’s instances to the groups we create for notifications.
Want alert by alert notifications? Where you can subscribe to a single alert, rule by rule, monitor by monitor? Check out:
http://code4ward.net/cs2/blogs/code4ward/archive/2007/09/19/set-notificationforalert.aspx
This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.
When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple. “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent. “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement. Therefore – your performance unit monitors for this example work just like you’d think.
However – Logical Disk is very different. On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc… We must write our monitors to take this into account.
For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target. The only way this would work, is if we SPECIFICALLY chose the instance in perfmon. I will explain:
Bad example #1:
I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.
I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I target a generic class, such as “Windows Server Operating System”.
I choose the perf counter I want – and select all instances:
And save my monitor.
The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target. This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances. If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues. HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter. For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert.
This is SERIOUS. This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.
A quick review of Health Explorer will show what is happening:
This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:
Then went “healthy” in the same SECOND from the _Total Instance:
Then flipped back to unhealthy, at the same time – for the D: instance.
I think you can see how bad this is. I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn't understand the class model, or multi-instance perf counters. In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!
Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter. So – what should I have used? Well, in this case – I should use something like “Windows 2008 Logical Disk” But we can still screw that up! :-)
Bad example #2:
I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.
I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.
Ack! The SAME problem! Why????
The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING. This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.
Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.
When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn't event their drive letter! I will examining the F: drive monitor:
I can see it was turned unhealthy because of the free space threshold hit on the D: drive!
and then flipped back to healthy due to the available space on the C: instance:
This is very, very bad. So – what are we supposed to do???
We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object. Make sense? Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM. This is actually easier than it sounds:
Good example:
I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2. Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:
This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class. You can see all of these if you viewed that class in discovered inventory. What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name. In perfmon – we see the instance names are “C:” or “D:”
In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:
So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.
The wildcard parameter is actually something like this:
$Target/Property[Type="MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice"]/DeviceID$
This simply is a reference to the MP that defined the “Device ID” property on the class.
Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.
You can use this same process for any multi-instance perf object. I have a (slightly less verbose) example using SQL server HERE.
To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List. These should indicate lots of alerts, and monitor flip-flop, and should be investigated.
Recently I discussed some of the changes in the Base OS MP version 6.0.6958.0
OpsMgr- MP Update- New Base OS MP 6.0.6958.0 adds Cluster Shared Volume monitoring, BPA, new rep
One of the changes in this newer version of the MP is the addition of a new datasource module, which runs a script to output the Network Adapter Utilization. The name of the datasource is “Microsoft.Windows.Server.2008.NetworkAdapter.BandwidthUsed.ModuleType”. This datasource module uses the timed script property bag provider, along with a generic mapper condition detection. The script name is: “Microsoft.Windows.Server.NetwokAdapter.BandwidthUsed.ModuleType.vbs”
There are 3 rules, and 3 monitors for each OS (2003 and 2008), which utilize this datasource:
Only the “Total” rules and monitors are enabled by default, the Read/Write workflows are disabled out of the box by design.
The good:
This new functionality is cool because it allows us to monitor the total utilization based on the network bandwidth as a percentage of the “total pipe”, report on this, and view the data in the console:
The issue:
Since there is no direct perfmon data to collect this, the information must be collected via script. I wrote about how to write this yourself HERE.
There are 4 known issues with this script in the current Base OS MP, which can cause problems in some environments:
1. When the script executes – it consumes a high amount of CPU (WMIPrvse.exe process) for a few seconds.
2. The script does not support cookdown, so it runs a cscript.exe process and an instance of the script for EACH and every network adapter in your system (physical or virtual). This makes the CPU consumption even higher, especially for systems with a large number of network adapters (such as Hyper-V servers).
3. The script does not support teamed network adapters very well, as they are manufacturer/driver dependent, and are often missing the WMI classes expected by the script, so you will see errors on each script execution, about “invalid class”
4. On some Windows 2003 servers, people have reported this script eventually causes a fault in netman.dll, and this can subsequently cause some additional critical services to fault/stop.
Event Type: Error Event Source: Application Error Event Category: (100) Event ID: 1000 Date: 16/10/2011 Time: 4:41:09 AM User: N/A Computer: WSMSG7104C02 Description: Faulting application svchost.exe, version 5.2.3790.3959, faulting module netman.dll, version 5.2.3790.3959, fault address 0x0000000000008d4f.
From a CPU perspective – below is an example Hyper-V server with multiple NIC’s. I set the rule and monitor which use this script to run every 30 seconds for demonstration purposes (they run every 5 minutes by default).
You can see WMI (and the total CPU) spiking every 30 seconds.
After disabling all the rules and monitors which utilize this data source, we see the following from the same server:
Based on these issues, I’d probably recommend disabling these rules AND monitors for Windows 2003 and Windows 2008. They seem to create a bit more impact than the usefulness of the data they provide.
To disable these monitor and rules:
Open the Authoring pane of the console.
Highlight “Monitors” in the left pane.
In the top line – click “Scope” until you see the “Scope Management Pack Object” pop up:
In the Look For box – type “Network”:
Tick the boxes next to “Windows Server 2003 Network Adapter” and “Windows Server 2008 Network Adapter” and click OK.
Now you will see a scoped view of only the monitors that target the windows server network adapter classes. Expand Windows Server 2003 Network Adapter > Entity Health > Performance:
You can see that Read and Write monitors are already disabled out of the box. You need to add a new override to disable the “Total” monitor. Set enabled = false and save it to your Base OS Override MP for Windows 2003.
Now, repeat this for the Server 2008 monitor for “Percent Bandwidth Used Total”.
After disabling the two monitors that run this script – we also need to disable the rules that also share this script. Highlight Rules in the left pane.
Again – the read/write rules are disabled out of the box, so you need to create two overrides for each rule, one for Server 2003 Percent Bandwidth Used Total, and then the same that targets Server 2008: