Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Kevin Holman's System Center Blog

Posts
  • Tuning tip – turning off some over-collection of events

    We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

    However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

    MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

     

    At a high level – here is how I like to review these events:

    1. Review the Most Common Events query that your OpsDB has.
    2. Create a “My Workspace” view for each event that has a HIGH event count.
    3. Examine the event details for value to YOU.
    4. View the rule that collected the event.
      1. Does the rule also alert or do anything special, or does it simply collect the event?
      2. Do you think the event is required for any special reporting you do?
    5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
    6. Continue to the next event in the query output, and evaluate it.

     

    So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

    Most common events by event number and event publishername:

    SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
    FROM EventAllView eav with (nolock)
    GROUP BY Number, Publishername
    ORDER BY TotalEvents DESC

    The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

     

    image

     

    image

     

    What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

    Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

    If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

     

    Here is why I consider it totally cool to disable these uninteresting event collection rules:

    • If they are really important – there will be different alert generating rule to fire an alert
    • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
    • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
    • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
    • If we find we do need one later – simply remove the override.

     

    Here is an example of this one:

    image

     

    So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

     

    Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

     

    1206
    1210
    1215
    1216
    10102
    10401
    10403
    10409
    10457
    10720
    11771
    21024
    21025
    21402
    21403
    21404
    21405
    29102
    29103

     

    I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

    The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

    I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.

  • Creating Groups of Health Service Watcher Objects based on other Groups

     

    It has been a well known requirement for most customers, to be able to Create Groups of Windows Computers that also contain corresponding Health Service Watcher objects.  This was needed for Alert Notification subscriptions so that different teams could receive alert notifications filtered by groups, but also include alerts from the Watcher, such as Heartbeat failure and Computer Unreachable.  There are several articles on this but I will reference a very popular one, on Tims’ site: 

    http://www.scom2k7.com/dynamic-computer-groups-that-send-heartbeat-alerts/

    Essentially, we needed to add an extra membership rule, to the XML, that would also add any Health Service Watcher objects that have a relationship to the Windows Computer objects already in the group.  We did this with the following XML:

    <MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MicrosoftSystemCenterInstanceGroupLibrary!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="Windows!Microsoft.Windows.Computer"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$Target/Id$</MonitoringClass> </Contained> </Expression> </Contained> </Expression> </Contains> </Expression> </MembershipRule>

    However, what if we ONLY want a group of Health Service Watcher objects, and NOT the Windows Computers.  BUT – we wish to based the HSW membership list from another group of Windows Computers.  This is useful if we want to create availability reports for a group of Windows Computers, but need to based the report on the availability of a specific up/down monitor, and not anything related to Windows Computer objects.

    Here is a code example of exactly that:

    In this sample – we will create a simple group of Windows Computers, that start with the name “DB”.  Then – we will create another group only containing HSW objects, corresponding the SQL computers group.

    <ManagementPack ContentReadable="true" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <Manifest> <Identity> <ID>grouptest</ID> <Version>1.0.0.8</Version> </Identity> <Name>grouptest</Name> <References> <Reference Alias="MSCIGL"> <ID>Microsoft.SystemCenter.InstanceGroup.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="SC"> <ID>Microsoft.SystemCenter.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Windows"> <ID>Microsoft.Windows.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Health"> <ID>System.Health.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="System"> <ID>System.Library</ID> <Version>6.1.7221.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> </References> </Manifest> <TypeDefinitions> <EntityTypes> <ClassTypes> <ClassType ID="grouptest.compgroup" Accessibility="Internal" Abstract="false" Base="SC!Microsoft.SystemCenter.ComputerGroup" Hosted="false" Singleton="true" /> <ClassType ID="grouptest.SQLWatchers" Accessibility="Internal" Abstract="false" Base="MSCIGL!Microsoft.SystemCenter.InstanceGroup" Hosted="false" Singleton="true" /> </ClassTypes> </EntityTypes> </TypeDefinitions> <Monitoring> <Discoveries> <Discovery ID="grouptest.DiscoverSQLServersComputerGroup" Enabled="true" Target="grouptest.compgroup" ConfirmDelivery="true" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryRelationship TypeID="SC!Microsoft.SystemCenter.ComputerGroupContainsComputer" /> </DiscoveryTypes> <DataSource ID="GP" TypeID="SC!Microsoft.SystemCenter.GroupPopulator"> <RuleId>$MPElement$</RuleId> <GroupInstanceId>$MPElement[Name="grouptest.compgroup"]$</GroupInstanceId> <MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="Windows!Microsoft.Windows.Computer"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="SC!Microsoft.SystemCenter.ComputerGroupContainsComputer"]$</RelationshipClass> <Expression> <RegExExpression> <ValueExpression> <Property>$MPElement[Name="Windows!Microsoft.Windows.Computer"]/PrincipalName$</Property> </ValueExpression> <Operator>MatchesWildcard</Operator> <Pattern>DB*</Pattern> </RegExExpression> </Expression> </MembershipRule> </MembershipRules> </DataSource> </Discovery> <Discovery ID="grouptest.DiscoverSQLWatchers" Enabled="true" Target="grouptest.SQLWatchers" ConfirmDelivery="true" Remotable="true" Priority="Normal"> <Category>Discovery</Category> <DiscoveryTypes> <DiscoveryRelationship TypeID="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities" /> </DiscoveryTypes> <DataSource ID="GP" TypeID="SC!Microsoft.SystemCenter.GroupPopulator"> <RuleId>$MPElement$</RuleId> <GroupInstanceId>$MPElement[Name="grouptest.SQLWatchers"]$</GroupInstanceId> <MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="grouptest.compgroup"]$</MonitoringClass> </Contained> </Expression> </Contains> </Expression> </MembershipRule> </MembershipRules> </DataSource> </Discovery> </Discoveries> </Monitoring> <LanguagePacks> <LanguagePack ID="ENU" IsDefault="true"> <DisplayStrings> <DisplayString ElementID="grouptest"> <Name>Group Test</Name> <Description /> </DisplayString> <DisplayString ElementID="grouptest.compgroup"> <Name>SQL Servers Computer Group</Name> </DisplayString> <DisplayString ElementID="grouptest.DiscoverSQLServersComputerGroup"> <Name>Discovery for SQL Servers Computer Group</Name> </DisplayString> <DisplayString ElementID="grouptest.DiscoverSQLWatchers"> <Name>Discovery for SQL Health Service Watchers Group</Name> <Description /> </DisplayString> <DisplayString ElementID="grouptest.SQLWatchers"> <Name>SQL Health Service Watchers Group</Name> </DisplayString> </DisplayStrings> </LanguagePack> </LanguagePacks> </ManagementPack>

     

    The key to this is the specific reference of the other group – shown here:

    <MembershipRules> <MembershipRule> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthServiceWatcher"]$</MonitoringClass> <RelationshipClass>$MPElement[Name="MSCIGL!Microsoft.SystemCenter.InstanceGroupContainsEntities"]$</RelationshipClass> <Expression> <Contains> <MonitoringClass>$MPElement[Name="SC!Microsoft.SystemCenter.HealthService"]$</MonitoringClass> <Expression> <Contained> <MonitoringClass>$MPElement[Name="grouptest.compgroup"]$</MonitoringClass> </Contained> </Expression> </Contains> </Expression> </MembershipRule> </MembershipRules>
  • What is config churn?

    There have been a couple good articles briefly covering this topic…. you might have read them.  I will reference some below.  Config churn is basically, when your RMS is in an almost never-ending loop of generating config.  This can be caused by “less than optimized” management packs, pushing agents all the time, or injecting major changes into a management group, such as overrides or custom rules and monitors, or importing updated management packs.  By examining this topic in depth – we will re-state some already known best practices with maintaining a healthy management group, and get some deeper knowledge as to why they are best practices in the first place.

     

    Any time you push agents, or create rules and monitors, or overrides for widespread classes….. you can create a config update on the RMS that must be sent down to ALL agents in the management group.  For small management groups (under 500 agents) this is generally not a big deal and processes rather quickly.  For large management groups over 1000 agents, this can cause high resource utilization of the RMS and SQL Database, in terms of CPU, Memory, and Disk I/O.  This can impact data insertion, and console performance during these times.  For these reasons, we like to keep those activities down to a minimum during working hours, and schedule these major changes in an off-hours maintenance window.

    What about “less than optimized” management packs?  What does that mean?  Well, this means management packs that you might be using, that have poorly written discoveries.

    We have long known that a worst practice in Management Pack development, is to have a discovery that discovers instances of a class, that has properties for those instances that are likely to change frequently.  Here is a write-up from OpsManJam on the topic:  LINK

     

    Ok… wait… Whaaaaat?

    Let me put that in English:

    Say we have a discovery for a Logical Disk.  This will discover any logical disk, like C:, D:, E:, Q:, etc….  When we write the discovery for a logical disk, we can add properties to that discovery.  These are attributes of the discovered instances.  So – in this case – lets say we decided to add “Size” of the disk as a property, and “Free Space” as a property.  And for the discovery frequency – we will run this discovery every hour, looking for new disks.

    “Size” is an excellent property for the Logical Disk class.  We like to know the size of the disks…. we can use this property group them if needed.  “Size” of a logical disk is not something that we would expect to change very often.

    “Free Space” is a horrible property for the Logical Disk class.  Free space is something that will likely change, just a small amount even, between each run of the discovery.  Free space is a property that is likely to change frequently, therefore – it should NOT be used in a discovery.

     

    Make sense?

    Ok – so… what's the big deal?

    Well, the agent will run almost all discoveries that it knows about when the health service starts up (like when you bounce the service, or after a reboot).  It will always send this discovery data to the management server.  (this is another reason why agents restarting all the time is very bad)  Then, it will run then based on the “Interval” frequency specified on the discovery.  Sometimes this is as frequent as once per hour, sometimes as long as once per day.  When the discovery runs, the agent will inspect the discovery data that it gets, and compare it to the last discovery data it sent to the management server.  If nothing changed – the agent drops the discovery data and does nothing.  IF anything changed in the values of the discovery data – it will re-submit the new data to the management server, which will submit this data to the database.  The RMS will detect the change, and will have to recalculate (regenerate) configuration.  You will see this on the RMS as a 21025 event:

    Log Name:      Operations Manager
    Source:        OpsMgr Connector
    Date:          9/27/2009 11:51:49 PM
    Event ID:      21025
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      OMRMS.opsmgr.net
    Description:
    OpsMgr has received new configuration for management group PROD1 from the Configuration Service.  The new state cookie is "D7 9B A4 BE 00 90 CF 13 35 B5 9B 5F 3B 14 FF 78 D6 13 9A 2D "

    The 21025 event isn’t really “bad”… it simply means the config service did its job.  It re-generated its configuration file from the database data, and wrote it to:  \Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<MGNAME>\OpsMgrConnector.Config.xml  The problem is – when this config file gets large (like in large agent count environments) and when the “Config Instance Space” is large (number of discovered objects in total).  Recalculating this config can have a significant impact on the disk where the file exists on the RMS, use lots of memory and CPU on the RMS for the config service, and use significant disk I/O on the SQL database.

    If the RMS is in a perpetual cycle of recalculating config, and sending these config updates to all agents…. the performance of the management group is impacted.

     

    Daniele Grandini of Quaue Nocent Docent is pretty much the “godfather” of good information researching the 21025 event.  Read his 3 part series on config churn here:

    http://nocentdocent.wordpress.com/2009/07/09/troubleshooting-21025-events-wrap-up/

     

     

    So – what can I do if I think I have too much config churn?

     

    The biggest problem causing the most frequent config updates is management packs with noisy discoveries.  However, lets wrap up all the issues that can cause it, and what you can do:

     

    1. New agents.  Discover/install/approve new agents in bulk and off-hours.
    2. Overrides.  Set overrides during off-hours, or create override MP’s in a lab, then synch to production management groups during schedule off-hours times.
    3. Custom rules and monitors.  Create these during off-hours, or create using the authoring console, test in a lab, then import to production during off-hours.
    4. Newly discovered instances.  For instance – someone adds a new disk, or SQL database, or DNS zone, to an existing agent.  Not much we can do about this, except the expectation that this would be done during off hours.
    5. Group membership changes
    6. Management packs with noisy discovery propertiesSee below:

     

    Ok – the remainder of this article will touch on #5.

     

    How can I tell which discoveries are noisy?

     

    Daniele Grandini has put together a good query on this, from his link:  http://nocentdocent.wordpress.com/2009/05/23/how-to-get-noisy-discovery-rules/

     

    I will repost these (slightly modified) below:

     

    /* Top Noisy Rules in the last 24 hours */

    select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes'
    from
    (select distinct
    MP.ManagementPackSystemName,
    MET.ManagedEntityTypeSystemName,
    PropertySystemName,
    D.DiscoverySystemName, D.DiscoveryDefaultName,
    MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
    ME.Path, ME.Name,
    C.OldValue, C.NewValue, C.ChangeDateTime
    from dbo.vManagedEntityPropertyChange C
    inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
    inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
    inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
    inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
    inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
    left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
    AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
    left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
    left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
    where ChangeDateTime > dateadd(hh,-24,getutcdate())
    ) As #T
    group by ManagedEntityTypeSystemName, DiscoverySystemName
    order by count(*) DESC

    and

    /* Modified properties in the last 24 hours */

    select distinct
    MP.ManagementPackSystemName,
    MET.ManagedEntityTypeSystemName,
    PropertySystemName,
    D.DiscoverySystemName, D.DiscoveryDefaultName,
    MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName',
    ME.Path, ME.Name,
    C.OldValue, C.NewValue, C.ChangeDateTime
    from dbo.vManagedEntityPropertyChange C
    inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId
    inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid
    inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId
    inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId
    inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId
    left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId
        AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%'
    left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId
    left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId
    where ChangeDateTime > dateadd(hh,-24,getutcdate())
    ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName

     

    Wow – that returned a LOT of discoveries running all the time!  What can I do?

     

    • Don't import too many MP’s!  The FIRST line of defense – is NOT to import ANY management packs into a management group that you don't absolutely need RIGHT THEN.  Management packs are constantly updated, and by the time you have an actual SLA in a technology area – there will likely be a newer, better MP available for it.  The biggest mistake many customers make is to import any available MP for a technology that they have internally.  They end up with a FLOOD of alerts, big fat databases, slow consoles, and lots of weird errors.  MP’s should be transitioned slowly, one at a time – tuning and resolving as you go.

     

    • Disable the noisy discoveries.  Probably not a great solution, unless they discover objects that you really don't care about – but there are other objects in the MP that you DO want to monitor.  However – what I like to do is to look at the discovery – and see what class it discovers.  Then, using MPViewer – look at the rules and monitors that target that class.  I might find I really don't need these rules and monitors in my business, so disabling the discovery is the simplest options to solve the performance problem.

     

    • Increase the interval of the discovery frequency.  This means… essentially – change any “bad” discoveries to run only once per day (86400 seconds) or once every 7 days (604800 seconds) or more (up to 4 weeks = 2419200 seconds)

     

    • Add a “synch time” override to the discovery – if possible.  This option is not available unless the MP author of the discovery exposed it.  What this will do – it cause all the agents to ONLY run the discovery at a distinct and specified time every day (say…. 1AM).  This might cause too much discovery data to flood in at one time… but since it will all come in at the same time – it wont cause constant config churn all throughout the day.  I have never done this because I don't know how bad the impact is for all the discovery data to come in at the same time is…. so this is more of an idea I had.

     

    • Re-write the discovery.  If this is a custom MP – rewrite the discovery/MP, and remove that property which changes too often, or fix it.  If this is a sealed MP – talk to the vendor and get them to fix their MP.  Or – consider disabling it – and re-writing the discovery yourself – and fix it until the vendor is able to release an update.

     

    • Make sure your hardware and software is optimized for scalability.  On your RMS – it is good to place your config file on fast disks, especially in large environments.  I have worked with very large customers who were experiencing config churn, but had zero ill effects, because their RMS disk I/O was on a 4 spindle RAID10 with 15K spindles, CPU and memory were really good, and their SQL database disk I/O for the OpsDB was excellent with plenty of breathing room.  I have also worked with smaller agent counts, where config churn has a serious impact…. mostly due to the RMS config file being places on the same RAID spindle set as that OS and pagefile, using only 2 older 10,000 RPM disks in a RAID1 mirror.  The SQL disk I/O was also just borderline for their agent count.  In these environments – I see config churn having a bigger impact.

     

    • Re-run the queries periodically – especially after importing/upgrading to a new management pack in your management group.  This “instance space change” report should be part of your testing and evaluation of a new MP when brought into your lab…. if you have a large agent count environment.

     

     

    Some very common discoveries I have seen – that have properties that change very frequently – are listed below.  I often recommend these be overridden to run once per day (86,400 seconds) or once per week (604800 seconds) if the problem is serious, or still existing when running once per day (large agent counts)

     

    The top noisy MP’s with bad discoveries I find in customer environments – are almost ALWAYS some order of the following:

    • IIS MP
    • SQL MP (old versions only – see note below)
    • DNS MP (old versions only – see note below)
    • ADMP
    • DPM

     

    Discovery Display Name Discovery Target Class Discovered Type Property that is changing too much Default frequency Modified frequency
    Windows Internet Information Services Base Classes Discovery Rule IIS 2000 Server Role IIS NNTP Virtual Server MaxMessageSize 3600 86400
    Windows Internet Information Services Base Classes Discovery Rule IIS 2003 Server Role IIS FTP Site   3600 86400
    Windows Internet Information Services Web Sites x-x Discovery Rule (4 of these) IIS 2003 Web Server IIS Web Site LoggingEnabled 3600 86400
    DNS 2003 Component Discovery DNS 2003 Server DNS 2003 Zone SerialNumber 21600 604800
    DNS 2008 Component Discovery DNS 2008 Server DNS 2008 Zone SerialNumber 21600 604800
    DNS 2003 Component Discovery DNS 2003 Server DNS Domain PrimaryServer 21600 604800
    DNS 2003 Component Discovery DNS 2008 Server DNS Domain PrimaryServer 21600 604800
    AD Remote Topology Discovery Active Directory Domain Controller Server 2000 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
    AD Remote Topology Discovery Active Directory Domain Controller Server 2003 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
    AD Remote Topology Discovery Active Directory Domain Controller Server 2008 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
    Discover SQL 2000 Databases** SQL 2000 DB Engine SQL 2000 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 1800 86400
    Discover Databases for a Database Engine** SQL 2005 DB Engine SQL 2005 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 7200 86400
    Discover Databases for a Database Engine** SQL 2008 DB Engine SQL 2008 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 7200 86400
    Discover File Groups and Files** SQL 2005 DB Engine SQL 2005 DB File, SQL 2005 DB File Group Size 7200 86400
    Discover File Groups and Files** SQL 2008 DB Engine SQL 2008 DB File, SQL 2008 DB File Group Size 7200 86400
    Discover Network Adapters (Only Enabled) Windows Server 2003 Operating System Windows Server 2003 Network Adapter Name, Description, IPAddress 86820 604800
    Discover Network Adapters (Only Enabled) Windows Server 2008 Operating System Windows Server 2008 Network Adapter Name, Description, IPAddress 86820 604800

     

     

    The above is just a sample – you should examine the query output of the query above and see what is impacting your management group the most.

     

    **Note on the SQL MP – There is a new SQL MP which resolved config churn for that MP, version 6.1.314.36 and later.  This new MP no longer churns on the database class properties for DB and log size.  I strongly recommend upgrading to this version of the SQL MP.  See:  http://blogs.technet.com/b/kevinholman/archive/2010/08/16/sql-mp-version-6-1-314-36-released-adds-support-for-sql-2008r2-and-many-other-changes.aspx

     

    **Note on the DNS MP – There is a new DNS MP which resolved config churn for that MP, version 6.0.7000.0 and later. This new MP no longer churns on the PrimaryServer and SerialNumber class properties. I strongly recommend upgrading to this version of the DNS MP. See: http://blogs.technet.com/b/kevinholman/archive/2011/02/24/dns-mp-update-ships-support-for-dns-on-windows-server-2008-r2-and-many-fixes.aspx

     

     

    Note some deeper level information on this topic:

     

     

    What is the maximum value I can set a discovery frequency to????  Supposedly – the MAX value in seconds is 2419200 which is 4 weeks.  Normally – discoveries should not have to be stretched out so long – only if they are creating a problem  Setting this number to 4 weeks essentially negates the discovery….  which is no big deal if it is a discovery that is running for something already discovered.  However – for something like SQL databases – that means it might take 4 weeks to start monitoring a new database.  That is not good.  There is a workaround however – for being able to use the extended frequencies and still discover items – when you restart the HealthService of an agent – it will immediately run all discoveries that apply to it that don't have a synch time set.  This means – that as a workaround to the workaround here – you can simply restart the agent if you add a new database, or IIS website, and need sooner monitoring than the max frequency time.

     

     

    RMS Churn:  When a discovery property change comes in for an instance that is hosted by an agent – the RMS creates new config and send it to that agent.  This is a normal process – but we want to control this from happening too frequently.  It isn't terribly expensive unless the number of instances hosted by the Agent is very high.  (as in – a typical agent might have 40 instances, but a SQL server with 1000 databases has 1040 instances)

    Next up – if the discovery property change occurs, and that instance that sent up the change is a member of a group.  This is worse – because this now causes a config recalc for the agent, AND a config recalc for the RMS.  This is because the RMS has to evaluate group population membership since it hosts a group of instances and one or more of those instances changed – which might affect group membership.  For instance – if the SQL Database size property changes – this is no big deal.  UNLESS you have created groups of SQL databases somewhere in the management group – and this changed database is a member of one or more groups.  This will cause the RMS to updates its own config.

    Lastly – when a discovery property comes in for an instance of a class, that is hosted by the RMS – this causes the RMS to completely recalculate its own config as well, and update its local health service config file.  This is very expensive…. and these instances should be given top consideration in fixing their discoveries, or extending them to reduce the issue.  The most common ones of these I see are the DNS Domain, DNS Zone, and AD Connection objects, which I have highlighted in red above.  Changes to these instances are VERY expensive – because since these are logical instances and not hosted by any SINGLE agent – they get hosted by the RMS.  When they change – it forces the RMS to regenerate its own config.  This will be evident by a LARGE number of 21025 events showing up in the RMS OpsMgr event log.  Generally – we only would like to see this file updated when necessary – two to three times per hour is ideal.  However – if you are running the DNS Management pack or ADMP, you are likely seeing this even every few MINUTES.  These DNS discoveries should be evaluated and overridden.

     

    Other items hosted by the RMS are groups.  When group membership changes – this impacts RMS performance.  This is due to the fact that the RMS hosts the group instances, and the relationships to what each group contains.  When group membership changes – the RMS generates new config.  This will also show up as a 21025 Event in the RMS OpsMgr event log.  So if you have tackled the discoveries from MP’s changing frequently – the next thing to look at is groups.  If you have a large management group, and you think this might be impacting you – one of the things you can do is to slow down the group populator module.  By default – this runs every 30 seconds. 

    We have a registry setting to make group calculation run less often to lower the performance hit on the database.  When making this setting less frequent, group calculation will poll the database less often, if you understand that the latency of group membership discovery will increase:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\GroupCalcPollingIntervalMilliseconds

    Default is 30,000 milliseconds (30 secs)   You can create this new DWORD value to control this setting.

     

    If you want to see all the instance types hosted by the RMS, run this query against the Operations Database:

     

    DECLARE @RelationshipTypeId_Manages UNIQUEIDENTIFIER
    SELECT @RelationshipTypeId_Manages = dbo.fn_RelationshipTypeId_Manages()
    SELECT bme.FullName, dt.TopLevelEntityName, dt.BaseEntityName, dt.TypedEntityName
    FROM BaseManagedEntity bme
    RIGHT JOIN (
    SELECT
    HBME.BaseManagedEntityId AS HS_BMEID,
    TBME.FullName AS TopLevelEntityName,
    BME.FullName AS BaseEntityName,
    TYPE.TypeName AS TypedEntityName
    FROM BaseManagedEntity BME WITH(NOLOCK)
    INNER JOIN TypedManagedEntity TME WITH(NOLOCK) ON BME.BaseManagedEntityId = TME.BaseManagedEntityId AND BME.IsDeleted =
    0 AND TME.IsDeleted = 0
    INNER JOIN BaseManagedEntity TBME WITH(NOLOCK) ON BME.TopLevelHostEntityId = TBME.BaseManagedEntityId AND
    TBME.IsDeleted = 0
    INNER JOIN ManagedType TYPE WITH(NOLOCK) ON TME.ManagedTypeID = TYPE.ManagedTypeID
    LEFT JOIN Relationship R WITH(NOLOCK) ON R.TargetEntityId = TBME.BaseManagedEntityId AND R.RelationshipTypeId =
    @RelationshipTypeId_Manages AND R.IsDeleted = 0
    LEFT JOIN BaseManagedEntity HBME WITH(NOLOCK) ON R.SourceEntityId = HBME.BaseManagedEntityId
    ) AS dt ON dt.HS_BMEID = bme.BaseManagedEntityId
    Where Fullname like '%RMSNAME%'
    order by typedentityname

     

    Change “RMSNAME” above to your RMS name.  You will see most will be groups – but might be surprised to see what all is hosted by the RMS.

  • Simulate monitoring of network devices with Jalasoft

    Jalasoft recently updated their network device simulator, which is useful for testing/demo of OpsMgr network monitoring capabilities.

    You can download the simulator here:

    http://www.jalasoft.com/xian/snmpsimulator

     

    This article will walk through the setup, configuration, and initial monitoring.

    You will need a computer or VM (Windows 2003 or above, including Win7 or Win8 apparently).  Then, you will need to add multiple IP addresses, one IP address for each device you want to monitor:

    image

     

    In the example above – 10.10.10.20 is the primary IP for my server.  Network devices will be simulated on 10.10.10.21 through 10.10.10.25

     

     

    Run Setup.exe and install the defaults, the Agent Service and Simulator Console.

    Provide a service account in order to run the simulator as a service (a new and much needed feature!)

    Select the IP address that is the primary IP for the server.

    When install is complete – open the Device Simulator console.

    Connect to the agent on your primary IP.

    image

     

    Click the + to add a new device.

    image

     

    Lets add a Cisco Router:

    image

     

    On the first secondary IP:

     

    image

     

    And leave defaults for SNMP  (V2 and “public”)

     

    image

     

    Now lets add additional devices, such as switches, firewalls, etc…

    image

     

    When done – click the Green arrow to save the config.

     

    Next up – we need to give each device a DNS A record so that SCOM can discover it.  In AD DNS, create new A records with associated PTR records, and give each device a name:

    image

    image

     

    Once you have added the DNS records in AD – we are ready to discover the devices in SCOM:

     

    Administration > Network Management > Discovery Rules.  Run the discovery wizard and discover network devices.

    Give the discovery rule a name, choose a management server to run the discovery, and select a resource pool to monitor the network devices

    (Hint – you should always create a dedicated resource pool for monitoring network devices, even if you only have a single management server.  This allows you to scale these out to dedicated servers in the future without making any other changes)

    image

     

    Choose Explicit discovery.

    Create a Run As account for the “public” SNMP community string.  Select it:

     

    image

     

    Add in each device and select the appropriate community string Run As account:

    image

     

    image

     

    Then choose to run the discovery manually:

    image

    And click “Create”, and leave the box checked to “Run the network discovery rule”

     

    image

     

    In the console – you can see the discovery rule and the status:

    image

    In the event log of the management server that runs the discovery – you will soon see network discovery events:

     

    image

     

    image

    image

    image

     

     

    Once this is complete – you should see the network devices in the console views:

     

    image

     

    You can run Health Explorer and view the out-of-the-box monitoring:

    image

     

    Or look at the network node and summary dashboards to view summary and historical data

     

    image

    image

  • A list of all possible security events in the Windows Security Event Log

    This may be old news, but it is a handy reference for OpsMgr admins, when asked to monitor for specific events from security event logs:

     

    Windows Server 2003:  http://technet.microsoft.com/en-us/library/cc163121.aspx

    Windows Server 2008:  http://www.microsoft.com/download/en/details.aspx?id=17871

    Windows Server 2008 R2:  http://www.microsoft.com/download/en/details.aspx?id=21561

  • Migrating DHCP services to 2012 R2 and configuring scope failover

    A time may come when you need to migrate your existing DHCP services to new servers/hardware.  Windows Server 2012 ships with powershell cmdlets to make this a simple transition.

    You can read about the process here:  http://blogs.technet.com/b/teamdhcp/archive/2012/09/11/migrating-existing-dhcp-server-deployment-to-windows-server-2012-dhcp-failover.aspx

    I have two DHCP servers (Windows Server 2012) in a failover configuration, leveraging the new capabilities in failover DHCP for Server 2012, which you can read about here:  http://technet.microsoft.com/en-us/library/jj200226.aspx.  I will be migrating these to Windows Server 2012 R2 DHCP servers.

    I start by installing the DHCP server role on my new 2012 R2 DHCP servers.  Then, a quick configure using the Post install wizard from server manager to authorize the DHCP servers in AD.

    Next up – I need to export the DHCP server configuration in powershell, from the old server.  In this case, I will be migrating from two DCHP servers (DC1,DC2) and migrating them to the new ones (DC01,DC02). 

    Create a folder on the new DHCP primary server (DC01) for C:\export.  Open an administrator powershell session.  Run the following command to export remotely from the old DCHP primary server:

    Export-DhcpServer –ComputerName DC1.opsmgr.net -Leases -File C:\export\dhcpexp.xml –verbose

    Next up, we need to create a backup path for the DHCP server database on the new DHCP server, DC01.  Create a folder C:\dhcp\backup.  Then, we can import the old DHCP server configuration using the following command:

    Import-DhcpServer –ComputerName DC01.opsmgr.net -Leases –File C:\export\dhcpexp.xml -BackupPath C:\dhcp\backup\ -Verbose

    The last import we need to run, is to import the server configuration ONLY to the secondary, or failover DHCP server.  First, on DC02 (the new failover DHCP server) create a backup folder at C:\dhcp\backup.  Then, go back to DC01 where you have the local export files, and run the following command to import server config to DC02:

    Import-DhcpServer –ComputerName DC02.opsmgr.net –File C:\export\dhcpexp.xml –ServerConfigOnly –verbose –BackupPath C:\dhcp\backup\

    At this point, we have imported the server configuration to BOTH new DHCP servers, and we have imported all the lease and scope to the new primary DCHP server.  What we need to do in order to complete the configuration, is to set up the failover configuration on the new pair.  This is covered here:  http://technet.microsoft.com/en-us/library/hh831385.aspx#failover_1

    On DC01, open the DHCP control applet, and right click IPv4 (all scopes) or specific scopes, and click “Configure Failover”

    image

    Step through the wizard, and choose “Load balance" mode.

    image

    Provide a shared secret for the DCHP servers to authenticate with each other for replication.

    You will now see the failover configuration data for each scope:

    image

    Open the DHCP applet on the secondary failover DHCP server, and you should see the replicated scope and lease information:

    image

    You can de-activate the scopes on the old DHCP server.  After testing and functional approval, you can remove DHCP services from the legacy DHCP server computers.

  • OpsMgr: MP Update: New Base OS MP 6.0.6972.0 Adds new cluster disks, changes free space monitoring, other fixes

    There is a new Base OS MP version 6.0.6972.0 available here:  http://www.microsoft.com/en-us/download/details.aspx?id=9296

     

    Be very careful updating to this new version – there are multiple changes and potential issues you should plan for and test with, that might impact your existing environments.  I will discuss them below.

     

    I previously wrote about the last MP update HERE and HERE.  Then I wrote about some issues in the MP’s with Logical Disk monitoring HERE.  Additionally, there were some problems with the network monitoring utilization scripts HERE.  All of these items have been addressed in this latest MP update. (somewhat)

     

    First – lets cover the list of updates from the guide:

    Changes in This Update

    •    Updated the Cluster shared volume disk monitors so that alert severity corresponds to the monitor state.
    •    Fixed an issue where the performance by utilization report would fail to deploy with the message “too many arguments specified”.
    •    Updated the knowledge for the available MB monitor to refer to the Available MB counter.
    •    Added discovery and monitoring of clustered disks for Windows Server 2008 and above clusters.
    •    Added views for clustered disks.
    •    Aligned disk monitoring so that all disks (Logical Disks, Cluster Shared Volumes, Clustered disks) now have the same basic set of monitors.
    •    There are now separate monitors that measure available MB and %Free disk space for any disk (Logical Disk, Cluster Shared Volume, or Clustered disk).

    Note :  These monitors are disabled by default for Logical Disks, so you will need to enable them if you want to use them in place of the default Logical Disk monitor for free space.

    •    Updated display names for all disks to be consistent, regardless of the disk type.
    •    The monitors generate alerts when they are in an error state.  A warning state does not create an alert.
    •    The monitors have a roll-up monitor that also reflects disk state. This monitor does not alert by default. If you want to alert on both warning and error states, you can have the unit monitors alert on warning state and the roll-up monitor alert on error state.
    •    Fixed an issue where network adapter monitoring caused high CPU utilization on servers with multiple NICs.
    •    Updated the Total CPU Utilization Percentage monitor to run every 5 minutes and alert if it is three consecutive samples above the threshold.
    •    Updated the properties of the Operating System instances so that the path includes the server name it applies to so that this name will show up in alerts.
    •    Disabled the network bandwidth utilization monitors for Windows Server 2003.
    •    Updated the Cluster Shared Volume monitoring scripts so they do not log informational events.
    •    Quorum disks are now discovered by default.
    •    Mount point discovery is now disabled by default.

    Notes:  This version of the Management Pack consolidates disk monitoring for all types of disks as mentioned above. However, for Logical Disks, the previous Logical Disk Free Space monitor, which uses a combination of Available MB and %Free space, is still enabled.  If you prefer to use the new monitors (Disk Free Space (MB) Low Disk Free Space (%) Low), you must disable the Logical Disk Free Space monitor before enabling the new monitors.
    The default thresholds for the Available MB monitor are not changed, the warning threshold (which will not alert) is 500MB and the error threshold (which will alert) is 300MB. This will cause alerts to be generated for small disk volumes. Before enabling the new monitors, it is recommended to create a group of these small disks (using the disk size properties as criteria for the group), and overriding the threshold for available MB.

    Ok, sounds good.  But what does all that mean to me?

     

    I will summarize the fundamental changes below:

     

    1.  Disk discovery and monitoring has changed.  We now will UNDISCOVER any “Logical Disks” that are hosted by a Windows Server 2008 R2 cluster, and REDISCOVER those as a new entity, of the “Cluster Disk” class.  This discovery only pertains to Windows Server 2008 R2 and later, it does not affect Server 2008 and older clusters.

     

    There are now THREE types of disks we will discover and monitor:

    • Logical Disks
    • Cluster Disks
    • Cluster Shared Volumes

    Logical Disks include disks that are not part of/hosted by a cluster, and include disks with a drive letter, and any disks without a drive letter (which are discovered as mount points).

    Cluster Disks include any disk that is hosted by a Microsoft Cluster as a shared resource, but not a specific Cluster Shared Volume.

    Cluster Shared Volumes are a specific type of cluster disks, that is leveraged by Hyper-V clusters for placement of virtual machines.

    For most customers, the impact will be if you have placed any instance or group specific overrides for your cluster disks, these will no longer apply, as these disks are going to be re-discovered as a new entity of a new class, “Cluster Disk”.  This new class will have entirely different monitoring targeting it, described below.

    However, this is a GOOD thing!  In the past, if you had a disk that was part of a cluster, it was undiscovered and rediscovered on each NODE when a failover occurred.  If you did overrides for the disk while it was on one node, your changes would no longer apply when it failed over to another node, because it was literally discovered as a different disk! (basemanagedentity)  This is now resolved – the disk will retain the same BaseManagedEntityId (its unique GUID under the covers in SCOM) as it moves from node to node.  It is also now “hosted” by the cluster, and not the Operating System class.

    I put together a state dashboard that demonstrates these different disk types:

     

    image

     

    There are also distinct views for these that ship inside the management pack:

    image

     

    Another point to make here – is that the Mount Point discovery, which has been enabled in all previous Base OS MP’s, is now DISABLED.  This means you will no longer discover mount points by default.  You can enable this via override if you want mount point discovery, or selectively enable it only for specific servers that you know host a mount point that you wish to monitor.

    Our mount point discovery is a bit misleading.  We don’t actually only discover mount points, we actually use the mount point discovery to discover ANY disk that does not have a drive letter assigned.  For instance, you may have noticed on your Server 2008 R2 machines, that you discovered a 100MB logical disk. 

     

    image

     

    These 100MB disks are System Reserved for Bitlocker use, to hold the boot loader.  Once you upgrade to the new MP version – new mounted disks (non-clustered disks with no drive letter) will no longer be discovered, as this discovery is disabled by default.  This will NOT remove the previously discovered disks, however.   Neither will running Remove-DisabledMonitoringObject.    The reason that Remove-DisabledMonitoringObject does NOT remove these discovered disks, is because it will only remove objects if there is an explicit *override* for a discovery, disabling it.  If we change the default configuration of a discovery to disabled, the cmdlet has no impact.  So if you wanted to remove these from your management group, you simply need to add an explicit override disabling the mount point discovery, and THEN run the cmdlet.  Keep in mind – doing this will undiscover ALL your mounted disks, possibly including real mount points if you have those.  As there is ZERO value in discovering and monitoring these 100MB disks, I’d recommend disabling the mounted disk discovery with an explicit override, then create instance specific or group specific overrides for your servers that DO host a mounted disk.

     

     

    2.  Logical Disk free space monitoring, along with Cluster Disk and Cluster Shared Volume monitoring has changed.  Here are the details:

    The default configuration of the “Logical Disk Free Space” monitor is largely UNCHANGED from MP version 6.0.6958.0, which I wrote about HERE.  This was done to create the lowest possible impact on you, the admin, who is using this monitor, and likely already has many overrides and has implemented this alert into any ticketing systems.  There were many complaints that this monitor (once it was modified to allow for consecutive samples) no longer generated alerts that contained free space and MB free in the alert description.  This is still the case in this version – the monitor was not modified.  This monitor will also generate alerts for warning state AND critical state, which is NOT a good thing.  When a single monitor generates alerts on both warning and critical state, a *new* alert is *not* generated when the monitor changes from warning to critical.  We simply modify the existing alert from warning to critical (if it exists in an open state).  This modification will NOT generate a new notification subscription, nor will it route the alert to a connector subscription set with a filter for “critical” severity alerts, because it has already been inspected and watermarked.  For this reason I never recommend using three state monitors and alerting on a warning and a critical state.

    However, another complaint we often got was that customers didn’t understand how this monitor worked, in that we inspect BOTH % free threshold AND MB free threshold, and BOTH conditions need to be met before we will change the state of the monitor and generate an alert.  This is a very good design, because it helps cut out the majority of noise and remains flexible for disks of different sizes.  That said, many customers would say “I just want a simple monitor to alert on % free ONLY, or MB free ONLY…” which was easier for them to understand.  Therefore, we have added THREE new monitors for disk space monitoring of logical disks.

    These new monitors are disabled by default, to allow customers to choose if they want to implement them.  What we have done is to create two new Unit monitors, one for % free and one for MB free.  Then place both of these under an aggregate rollup monitor.

     

    image

     

    If enabled, the customer can pick if they want only %, or only MB free, or both, via overrides.  These new Unit monitors also provide a richer alert description as seen below:

    The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 28 free Mbytes.

    The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 4% free space.

    Additionally, if the customer DOES want alerts on warning state for these monitors, they can enable this, and additionally enable alerting on the Aggregate rollup monitor above, to issue critical alerts only.  This way, you can have unique alerting for a warning state, but if any monitor is critical, we can roll up health and generate a NEW alert for critical state, which can be used to send a notification or send to a ticketing system.

    As you can see, a lot of thought went into this new design, trying to make the new format fit as many customer requested scenarios as possible.  You essentially have three options now:

     

    • Continue to use the existing Logical Disk Free space monitor that is provided and enabled in the management pack.
    • Enable and start using the newly designed Logical Disk free space monitors, based on your specific requirements.
    • Use my addendum MP which uses a single free space monitor that is similar to the old Base OS management packs, described and available HERE

     

    For Cluster Disks, and Cluster Shared Volume disks – both of those are using the new format for free disk space monitoring:

     

    image

    image

     

    Based on this, I’d recommend considering and testing a move of your logical disk free space monitoring over to the new style as well, to have a consistent experience.  I welcome your feedback on this point.

     

    ***Note – if you enable the new Logical Disk free space monitors, the MB Free monitor will go into a critical state for any Logical disk that is under 2GB (non-system) or 500MB (system).  This means if you have any tiny disks, such as the 100MB bitlocker disks, this monitor will alert on all of those disks, potentially creating a large number of alerts.  I’d recommend undiscovering those 100MB disks (see #1 above) or create a dynamic group of disks in your override MP, based on “size is less than a specific numerical size”, and use this group to disable free space monitoring.

     

    3.  The previous “Cluster Shared Volume” MP with was “Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp” has a new displayname of “Windows Server Cluster Disks Monitoring” and the new classes for Cluster disks mentioned above are included in this MP, so if you didn’t import it previously because you weren't using Hyper-V Cluster Shared Volumes, you need this MP now to discover and monitor clustered disks.

     

    4.  We have disabled the Network Utilization scripts by default on Server 2003, and fixed them for Server 2008 to make them consume less resources.  I wrote about this previously HERE.  This now should be addressed, so if you previously disabled these, but want that counter for alerting or perf collection, you can consider enabling it. It should REMAIN disabled for Windows 2003, as there is an issue with Netman.dll which causes the crash of services.

     

    5.  The “Total CPU Utilization Percentage” monitor was changed.  In previous management packs, it would inspect the value every 2 minutes, and if the AVERAGE of 5 samples for “CPU Queue length”AND “% Processor Time” were over their default thresholds, we would generate an alert.  Now, we inspect the value every 5 minutes, and if the AVERAGE of 3 samples for both counters are over the thresholds, then an alert is generated.  I am told this change was made on customer request, I have to assume to spread out the time period over a longer time span…. not really sure.  Seems fairly insignificant.

     

     

    Known Issues/Things to remember:

     

    1.  Which MP’s to import:  This MP update contains the following files:

    image

    Don’t import management packs that you don’t need or use. 

    Don’t import the BPA management pack if you don’t want to see alerts for this new feature.

    Don’t import the Microsoft.Windows.Server.Reports.mp if your back-end SQL is still running SQL 2005, this MP is supported on SQL 2008 and newer only.  It will cause your reporting to break if you import this MP and your management group leverages SQL 2005 on the back-end.

    DO import the Microsoft.Windows.Server.ClusterSharedVolume.mp because this contains the discovery and monitoring for Cluster Disks, not just Cluster Shared Volumes.  If you don’t import this your monitoring of clustered disks will disappear.

     

    2.  The knowledge for the Total CPU Utilization Percentage is incorrect – the monitor was updated to a default value of 3 samples but the knowledge still reflects 5 samples.

     

    3.  There is no free space perf collection rules for “Cluster Disks”.  We have multiple performance collection rules for Logical Disks, and for Cluster Shared Volumes, however there are none for the new Cluster Disks class.  If you want performance reports on free space, disk latency, idle time, etc, you will need to create these.

     

    4.  Perf collection and disk monitoring for cluster disks and CSV’s only works when the resource group hosting the disks, are on the same node that is hosting the cluster name (quorum) resource.  If the disk’s resource group is running on a different node than the cluster name itself, perf collection and monitoring will cease.

  • OpsMgr 2012 – Grooming deep dive in the OperationsManager database

    Grooming of the OpsDB in OpsMgr 2012 is very similar to OpsMgr 2007.  Grooming is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “All Management Servers Resource Pool” and is part of the System Center Internal Library.

    image

    It calls the “p_PartitioningAndGrooming” stored procedure.  This SP calls two other SP's:  p_Partitioning and then p_Grooming

    p_Partitioning inspects the table PartitionAndGroomingSettings, and then calls the SP p_PartitionObject for each object in the PartitionAndGroomingSettings table where "IsPartitioned = 1"   (note - we partition event and perf into 61 daily tables - just like MOM 2005/SCOM 2007)

    The PartitionAndGroomingSettings table:

    image

    The p_PartitionObject SP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates the PartitionTables table in the database, to update the IsCurrent field to the next numeric table for events and perf.  It also sets the current time as the partition end time in the previous “is current” row, and sets the current time in the partition start time of the new “is current” row.  Then it calls the p_PartitionAlterInsertView sproc, to make new data start writing to the “new” current event and perf table.

    To review which tables you are writing to - execute the following query:   select * from partitiontables where IsCurrent = '1'

    A select * from partitiontables will show you all 61 event and perf tables, and when they were used.  You should see a PartitionStartTime updated every day - around midnight (time is stored in UTC in the database).  If partitioning is failing to run, then we wont see this date changing every day.  

    Ok - that's the first step of the p_PartitioningAndGrooming sproc - Partitioning.  Now - if that is all successful, we will start grooming!

    The p_Grooming is called after partitioning is successful.  One of the first things it does - is to update the InternalJobHistory table.  In this table - we keep a record of all partitioning and grooming jobs.  It is a good spot check to see what's going on with grooming.  To have a peek at this table - execute a select * from InternalJobHistory order by InternalJobHistoryId DESC

    image

    The p_Grooming sproc then calls p_GroomPartitionedObjects 

    p_GroomPartitionedObjects  will first examine the PartitionAndGroomingSettings and compare the “days to keep” column value, against the current date, to figure out how many partitions to keep vs groom.  It will then inspect the partitions (tables) to ensure they have data, and then truncate the partition, by calling p_PartitionTruncate.  A truncate command is just a VERY fast and efficient way to delete all data from a table without issuing a highly transactional DELETE command.  The p_GroomPartitionedObjects sproc will then update the PartitionAndGroomingSettings table with the current time, under the GroomingRunTime column, to reflect when grooming last ran. 

    Next - the p_Grooming sproc continues, by calling p_GroomNonPartitionedObjects. 

    p_GroomNonPartitionedObjects is a short, but complex sproc - in that is calls all the individual sprocs listed in the PartitionAndGroomingSettings table where IsPartitioned = 0.  The following stored procedures are present in my database as non-partitioned data:

    • p_AlertGrooming
    • p_StateChangeEventGrooming
    • p_MaintenanceModeHistoryGrooming
    • p_AvailabilityHistoryGrooming
    • p_JobStatusGrooming
    • p_MonitoringJobStatusGrooming
    • p_PerformanceSignatureGrooming
    • p_PendingSdkDataSourceGrooming
    • p_InternalJobHistoryGrooming
    • p_EntityChangeLogGroom
    • p_UserSettingsStoreGrooming
    • p_TriggerEntityChangeLogStagedGrooming

    Now, for the above sprocs, each one could potentially return a success or failure.  They will also likely call additional sprocs, for specific tasks.  You can see, the rabbit hole is deep.  Smile  This is just an example of the complexity involved in self-maintenance and grooming.  If you are experiencing a grooming failure of any kind, and the error messages involve any of the above stored procedures when you execute p_PartitioningAndGrooming manually, you should open a support case with Microsoft for troubleshooting and resolution.  The theory is, that each of the above procedures grooms a specific non-partitioned dataset.  Under NORMAL circumstances, each should be able to complete in a reasonable time frame.  The challenge becomes evident when you have something go wrong, like alert storms, state change even storms from monitors flip-flop, lots of performance signature data from using self-tuning threshold monitors, huge amounts of pending SDK datasource data from large Exchange 2010 environments, or from other MP’s that might leverage this.  Grooming non-partitioned data is slow, and highly resource intensive and transactional.  These are specific delete statements, from tables directly, often combined with creating temp tables in TempDB.  Having a good presized high performance TempDB can help, as will ensuring you have plenty of transaction log space for the database, and having the disk subsystem offer as many IOPS as possible.  http://technet.microsoft.com/en-us/library/ms175527(v=SQL.105).aspx

    Next - the p_Grooming sproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)

    If you ever have a problem with grooming - or need to get your OpsDB database size under control - simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming.  To start with - I recommend setting all these to just 2 days, from the default of 7.  This keeps your OpsDB under control until you have time to tune all the noise from the MP's you import.  So just reduce this number, then open up query analyzer, and execute EXEC p_PartitioningAndGrooming  When it is done, check the job status by executing select * from InternalJobHistory order by InternalJobHistoryId DESC   The last groom job should be present, and successful.  The OpsDB size should be smaller, with more free space.  And to validate, you can always run my large table query, found at:   Useful Operations Manager 2007 SQL queries

  • Using a Generic Text Log rule to monitor an ASCII text file – even when the file is a UNC path

    There are several examples in blogs on how to create a generic text log rule to monitor for a local text file (Unicode, ASCII, or UTF8).

    This will be a step-by-step example of doing the same, however, using this to monitor the log file on a remote UNC path instead of a local drive.  This is useful when we want to monitor a file/files on a NAS or an a share that is hosted by a computer without an agent.

    This is a bit unique… instead of applying this rule to ALL systems that might have a specific logfile present in a specific directly – we are going to target this rule to only ONE agent.  This agent will monitor the remote fileshare similar to the concept of a “Watcher Node” for a synthetic transaction.  Therefore we will be creating this rule disabled, and enabling it only for our “Watcher”.

     

    In the Ops console – select the Authoring pane > Rules. 

    Right click Rules, and select Create a new rule.  We will chose the Generic Text Log for this example:

     

    image

     

    Choose the appropriate MP to save this new customer rule to, and click Next.

    For this rule name – I will be using “Company Name – Monitor remote logfile rule”

    Set the Rule Category to “Alert”

    For the target – I like to use “Windows Server Operating System” for generic rules and monitors.

    UNCHECK the box for “Rule is enabled”

     

    image

     

    Click Next.

     

    The directory will be the UNC path.  Mine is “\\VS2\Software\Temp”

    The pattern will be the logfile(s) you want to monitor.  We can use a specific file, such as “logfile.log” or a wildcard, such as “*.log”.

    You should not check the “UTF8” box unless you know the logfile to be UTF8 encoded.

     

    image

     

    Click Next.

    On the event expression, click Insert for a new line.  Essentially – log file monitors look at each new line in a logfile as one object to read, and this is represented by “Params/Param[1]”  This “Parameter 1” is the entire line in the logfile, and is the only value that is valid for this type of monitor – so just type/paste that in the box for Parameter Name.

    Since we want to search the logfile line for a specific word, the Operator will be “Contains”.

    For the value – this can be the word you are looking for in the line, that you want to alert on.  For my example, I will use the word “failed”.

     

    image

     

    Click Next.

     

    On the alert screen – we can customize the alert name if desired, set the severity and priority, and build a better Alert Description.  If you are using SP1 – the default alert description is blank.  If you are using R2 – the default alert description is “Event Description: $Data/EventDescription$”  HOWEVER – this is an invalid event variable for this type of event (logfile)…. so we need to change that right away.  I keep a list of common alert description strings HERE

    For this – I will recommend the following alert description.  Feel free to customize to make good sense out of your alert:

    Logfile Directory : $Data/EventData/DataItem/LogFileDirectory$
    Logfile name: $Data/EventData/DataItem/LogFileName$
    String:  $Data/EventData/DataItem/Params/Param[1]$

    Click “Create” to create the rule.

    Find the rule you just created in the console – right click it and choose “Properties”.  On the Configuration tab, under responses (to the right of “Alert”) click Edit.

    Click the “Alert Suppression” button.  You should consider adding in alert suppression on specific fields of an alert – in order to suppress a single alert for each match in the logfile.  If you don't – should the monitored logfile ever get flooded with lines containing “failed” from the application writing the log – SCOM will generate one alert for each line written to the log.  This has the potential to flood the SCOM database/Console with alerts.  By setting alert suppression here – we will create one alert, and increment the repeat count for each subsequent line/alert.  I am going to suppress on LoggingComputer and Parameter 1 for this example:

     

    image

     

    Click OK several times to accept and save these changes to the rule.

     

    Now – we created this rule as disabled – so we need to enable it via an override.  I will find the rule in the console – and override the rule “For a specific object of class:  Windows Server Operating System”

     

    image

     

    Now – pick one of these machines to be the “watcher” for the logfile in the remote share. 

    **Note – the default agent action account will make the connection to the share and read the file.  In my case – the default agent action account is “Local System” so this will be the domain computer account of the “Watcher” agent which connects to the remote share and reads the file.  This account will need access to the share, folder, and files monitored.  Keep that in mind.

    Set the override to “Enabled = True” and click OK.

     

    At this point, our Watcher machine will download the management pack again with the newly created override, and apply the new config.  Once that is complete – it will begin monitoring this file.  You can create a log file in the share path, and then write a new line with the word “failed” in it.  You need a carriage return after writing the line for SCOM to pick up on the change.

    You should see a new alert pop in the console, based on matching the criteria.  Subsequent log file matches will only increment the repeat count.  Customize the alert suppression as it makes sense for you.

    Then – create additional rules just like this – for different UNC paths.

     

    image

  • A new script error - SCOMpercentageCPUTimeCounter.vbs - Invalid class

    As you deploy the latest OpsMgr R2 core MP updates version 6.1.7599.0 which I blogged about HERE, you will probably notice a new script error popping up in your environment:

    Alert Description:

    The process started at 10:36:57 AM failed to create System.PropertyBagData. Errors found in output:

    C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\SCOMpercentageCPUTimeCounter.vbs(125, 5) SWbemRefresher: Invalid class

    Command executed: "C:\Windows\system32\cscript.exe" /nologo "SCOMpercentageCPUTimeCounter.vbs" servername.domain.com false 3
    Working Directory: C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\4885\

    One or more workflows were affected by this.
    Workflow name: Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeMonitor
    Instance name: servername.domain.com
    Instance ID: {50E57AC1-08CC-6E7E-149A-1E8690881BBD}
    Management group: MGNAME

    This is caused by a new script based monitor (Agent Processor Utilization) and collection rule (Collect agent processor utilization) that was added to this core MP… which measures the agent CPU impact, including the Healthservice, MonitoringHost, and all ancillary processes spawned by the SCOM agent.  This monitor and rule targets the Health Service class:

    image

    image

    Both the monitor and rule share the same script datasource – so make sure if you override ANYTHING on one, you make the SAME override on the other…. otherwise you will break cookdown for the datasource.

     

    The problem is – there is a fair percentage of Windows Servers – that for some reason randomly will have a problem with WMI health, or (more likely) will not have the WMI performance counters enabled for Perfproc.dll, which this script needs in order to measure the CPU.

    This is not a big deal… but until you fix this – the monitor wont work…. and will throw these script errors on a regular basis.

    The good thing is – this issue is fully documented in the guide that ships with these new MP’s.  You DID read the guide first – didn't you?  :-)  The guide lists many possible steps to go through in order… I will discuss the most common resolution I am seeing in the field, below.

     

    So – gather a list of all servers throwing this specific script error for “Invalid class” on “SCOMpercentageCPUTimeCounter.vbs”, then make plans to fix them.  This will likely require a reboot – so keep that in mind.  Another alternative is to fix them now, and then let them get rebooted on their next patching cycle if that works better for you.

     

    What is typically seen – is that there is a missing WMI class, due to the WMI perf counter being disabled.  Right now this appears pretty random…. I have three VM’s all built from the same media on the same day – and one out of those three had this issue.  I recently worked with a customer who had 4 machines out of 16 missing this perf counter.  Here is an example of the WMI class we are looking for:  (Run wbemtest, connect to root\cimv2, then click Enum Classes, select recursive, OK)

    clip_image002

     

     

    Luckily – the fix is VERY simple.  There is a tool you can download and install on your workstation – and remotely connect to each machine and fix them:

    Windows 2000 Resource Kit Tool : Extensible Performance Counter List (exctrlst.exe)

    http://www.microsoft.com/downloads/details.aspx?familyid=7ff99683-b7ec-4da6-92ab-793193604ba4&displaylang=en

     

    Using this tool – connect to \\servername and click refresh.  Sroll down in the list until you find “PerfProc     perfproc.dll”.  What you will likely find is that this class is disabled. 

    image

     

    Simply check the box to enable it…   and then reboot the machine at your convenience. 

    This will persist the class in WMI and the script error should go away.

     

     

    Other errors from this script:

    Now – if you are getting some OTHER error – not “Invalid Class”… this is likely an environmental problem with your server.  I would walk through all the steps called out in the guide for this issue.  If those don't work – then try some of these:

    • Connecting to WMI using WBEMTEST, root\cimv2 and seeing if there is a WMI issue.
    • Ensure WMI service is running.
    • Ensure OS is healthy and not experiencing severe memory pressure.
    • Reboot server
    • Run script manually:  copy the script from the \Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files\<numeric>\ location in the alert – to a local temp directory… then run it manually at a command prompt.  For instance – I search the “Monitoring Host Temporary Files directory for SCOMpercentageCPUTimeCounter.vbs to find a copy of the script, copy it to C:\temp\, and then run:  cscript.exe /nologo "SCOMpercentageCPUTimeCounter.vbs" servername.domain.com false 3.  Change the FQDN in this example to your actual servername FQDN you are testing it on.  This should return a XML propertybag that looks like:

    image

     

    If you don't get a XML propertybag returned – then something is broken…. you can look and see if you get an error returned, or if nothing…. then make sure you gave the script the correct expected parameters as above, then start debugging the script.

    Another example I am seeing is the following:

    The process started at 6:56:10 PM failed to create System.PropertyBagData. Errors found in output:

    C:\Program Files\System Center Operations Manager 2007\Health Service State\Monitoring Host Temporary Files 2\1220\SCOMpercentageCPUTimeCounter.vbs(124, 5) Microsoft VBScript runtime error: ActiveX component can't create object: 'WbemScripting.SWbemRefresher'

    This happens on Windows 2000 servers.  This note came from a reader comment:  It appears from http://msdn.microsoft.com/en-us/library/aa393838(VS.85).aspx that this Script API is only supported on Windows XP/Windows Server 2003 and up – so it is likely that this script will not work on Windows 2000 Servers.  If that is the case, you can disable this monitor AND rule for all Windows 2000 Computers, by overriding this monitor AND rule – for a group – and choosing the “Windows 2000 Server Computer Group”.  That should make these errors go away for old legacy systems you might still be monitoring.

  • Moving the Data Warehouse Database and Reporting server to new hardware–my experience

    The time has come to move my Warehouse Database and OpsMgr Reporting Server role to a new server in my lab.  Today – both roles are installed on a single server (named OMDW).  This server is running Windows Server 2008 SP2 x86, and SQL 2008 SP1 DB engine and SQL Reporting (32bit to match the OS).  This machine is OLD, and only has 2GB of memory, so it is time to move it to a 64bit capable machine with 8GB of RAM.  The old server was really limited by the available memory, even for testing in a small lab.  As I do a lot of demo’s in this lab – I need reports to be a bit snappier.

    The server it will be moving to is running Server 2008 R2 (64bit only) and SQL 2008 SP1 (x64).  Since Operations Manager 2007 R2 does not yet support SQL 2008R2 at the time of this writing – we will stick with the same SQL version.

     

    We will be using the OpsMgr doco – from the Administrators Guide:

    http://technet.microsoft.com/en-us/library/cc540402.aspx

     

    So – I map out my plan. 

    1. I will move the warehouse database.
    2. I will test everything to ensure it is functional and working as hoped.
    3. I will move the OpsMgr Reporting role.
    4. I will test everything to ensure it is functional and working as hoped.

     

    Move the Data Warehouse DB:

    Using the TechNet documentation, I look at the high level plan:

    1. Stop Microsoft System Center Operations Manager 2007 services to prevent updates to the OperationsManagerDW database during the move.
    2. Back up the OperationsManagerDW database to preserve the data that Operations Manager has already collected from the management group.
    3. Uninstall the current Data Warehouse component, and delete the OperationsManagerDW database.
    4. Install the Reporting Data Warehouse component on the new Data Warehouse server.
    5. Restore the original OperationsManagerDW database.
    6. Configure Operations Manager to use the OperationsManagerDW database on the new Data Warehouse server.
    7. Restart Operations Manager services.

     

    Sounds easy enough.  (gulp)

     

    • I start with step 1 – stopping all RMS and MS core services.
    • I then take a fresh backup of the DW DB and master.  This is probably one of the most painful steps – as on a large warehouse – this can be a LONG time to wait while my whole management group is down.
    • I then uninstall the DW component from the old server (OMDW) per the guide.
    • I then (gasp) delete the existing OperationsManagerDW database.
    • I install the DW component on the new server (SQLDW1).
    • I delete the newly created and empty OperationsManagerDW database from SQLDW1.
    • I then need to restore the backup I just recently took of the warehouse DB to my new server.  The guide doesn’t give any guidance on these procedures – this is a SQL operations and you would use standard SQL backup/restore procedures here – nothing OpsMgr specific.  I am not a SQL guy – but I figure this out fairly easily.
    • Next up is step 8 in the online guide – “On the new Data Warehouse server, use SQL Management Studio to create a login for the System Center Data Access Service account, the Data Warehouse Action Account, and the Data Reader Account.”  Now – that’s a little bogus documentation.  The first one is simple enough – that is the “SDK” account that we used when we installed OpsMgr.  The second one though – that isnt a real account.  When we installed Reporting – we were asked for two accounts – the "reader” and “write” accounts.  The above referenced Data Warehouse Action Account is really your “write” account.  If you aren't sure – then there is a Run-As profile for this that you can see what credentials you used.
    • I then map my logins I created to the appropriate rights they should have per the guide.  Actually – since I created the logins with the same names – mine were already mapped!
    • I start the Data Access (SDK) service ONLY on the RMS
    • I modify the reporting server data warehouse main datasource in reporting.
    • I edit the registry on the current Reporting server (OMDW) and have to create a new registry value for DWDBInstance per the guide – since it did not exist on my server yet.  I fill it in with “SQLDW1\I01” since that is my servername\instancename
    • I edit my table in the OpsDB to point to the new Warehouse DB servername\instance
    • I edit my table in the DWDB to point to the new Warehouse DB servername\instance
    • I start up all my services.

    Now – I follow the guidance in the guide to check to make sure the move is a success.  Lots of issues can break this – missing a step, misconfiguring SQL rights, firewalls, etc.  When I checked mine – it was actually failing.  Reports would run – but lots of failed events on the RMS and management servers.  Turns out I accidentally missed a step – editing the DW DB table for the new name.  Once I put that in and bounced all the services again – all was well and working fine.

     

    Now – on to moving the OpsMgr Reporting role!

     

    Using the TechNet documentation, I look at the high level plan:

    1. Back up the OperationsManagerDW database.
    2. Note the accounts that are being used for the Data Warehouse Action Account and for the Data Warehouse Report Deployment Account. You will need to use the same accounts later, when you reinstall the Operations Manager reporting server.
    3. Uninstall the current Operations Manager reporting server component.
    4. Restore the original OperationsManagerDW database.
    5. If you are reinstalling the Operations Manager reporting server component on the original server, run the ResetSRS.exe tool to clean up and prepare the reporting server for the reinstallation.
    6. Reinstall the Operations Manager reporting server component.

     

    Hey – even fewer steps than moving the database! 

    ***A special note – if you have authored/uploaded CUSTOM REPORTS that are not deployed/included within a management pack – these will be LOST when you follow these steps.  Make sure you export any custom reports to RDL file format FIRST, so you can bring those back into your new reporting server.

     

    • I back up my DataWarehouse database.  This step isn't just precautionary – it is REQUIRED.  When we uninstall the reporting server from the old server – it modifies the Warehouse DB in such a way that we cannot use – and must return it to the original state before we modified anything – in preparation for the new installation of OpsMgr Reporting on the new server.
    • Once I confirm a successful backup, I uninstall OpsMgr R2 Reporting from my old reporting server.
    • Now I restore my backup of the OperationsManagerDW database I just took prior to the uninstall of OpsMgr reporting.  My initial attempts at a restore failed – because the database was in use.  I needed to kill the connections to this database which were stuck from the RMS and MS servers.
    • I am installing OpsMgr reporting on a new server, so I can skip step 4.
    • In steps 5-10, I confirm that my SQL reporting server is configured and ready to roll.  Ideally – this should have already been done BEFORE we took down reporting in the environment.  This really is a bug in the guide – you should do this FIRST – BEFORE event starting down this road.  If something was broken, we don’t want to be fixing it while reporting is down for all our users.
    • In step 11, I kick of the Reporting server role install.  Another bug in the guide found:  they tell us to configure the DataWarehouse component to “this component will not be available”  That is incorrect.  That would ONLY be the case if we were moving the OpsMgr reporting server to a stand alone SRS?Reporting ONLY server.  In my case – I am moving reporting to a server that contains the DataWarehouse component – so this should be left alone.  I then chose my SQL server name\instance, and type in the DataWarehouse write and reader accounts.  SUCCESS!!!!

    Now – I follow the guide and verify that reporting is working as designed.

    Mine (of course) was failing – I got the following error when trying to run a report:

     

    Date: 8/24/2010 5:49:27 PM
    Application: System Center Operations Manager 2007 R2
    Application Version: 6.1.7221.0
    Severity: Error
    Message: Loading reporting hierarchy failed.

    System.Net.WebException: Unable to connect to the remote server ---> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80
       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
       at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)
       --- End of inner exception stack trace ---
       at System.Net.HttpWebRequest.GetRequestStream(TransportContext& context)
       at System.Net.HttpWebRequest.GetRequestStream()
       at System.Web.Services.Protocols.SoapHttpClientProtocol.Invoke(String methodName, Object[] parameters)
       at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ReportingService.ReportingService2005.ListChildren(String Item, Boolean Recursive)
       at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.ManagementGroupReportFolder.GetSubfolders(Boolean includeHidden)
       at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingSubtree(TreeNode node, ManagementGroupReportFolder folder)
       at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTree(ManagementGroupReportFolder folder)
       at Microsoft.EnterpriseManagement.Mom.Internal.UI.Reporting.WunderBar.ReportingPage.LoadReportingTreeJob(Object sender, ConsoleJobEventArgs args)
    System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.10.10.12:80
       at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
       at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Int32 timeout, Exception& exception)

     

    The key area of this is highlighted in yellow above.  I forgot to open a rule in my Windows Firewall on the reporting server to allow access to port 80 for web reporting.  DOH!

    Now – over the next hour – I should see all my reports from all my MP’s trickle back into the reporting server and console.

     

    Relatively pain free.  Smile

  • SCVMM 2012: Quickstart deployment guide

    The following document will cover a basic install of System Center Virtual Machine Manager 2012 at a generic customer.  This is to be used as a template only, for a customer to implement as their own pilot or POC deployment guide.  It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.

    SVCMM can be scaled to match the customer requirements. This document will cover a single server model, where all server roles are installed on a single VM/Server.

    This is not an architecture guide or intended to be a design guide in any way.

    • Windows Server 2008 R2 SP1 Enterprise edition will be installed as the base OS for all platforms. All servers will be a member of the AD domain.
    • SQL 2008 R2 ENT edition with CU6 will be the base standard for all database and SQL reporting services.


    High Level Deployment Process:

     

    1.  In AD, create the following accounts and groups, according to your naming convention:

    • DOMAIN\scvmmsvc SCVMM Service Account account
    • DOMAIN\scvmmadmin SCVMM RunAs account for managing hosts
    • DOMAIN\sqlsvc SQL 2008 service account
    • DOMAIN\SCVMMAdmins SCVMM Administrators security group

    2.  Add the “scvmmsvc” and “scvmmadmin” account to the “SCVMMAdmins” global group.

    3.  Add the domain user accounts for yourself and your team to the SCVMMAdmins group.

    4.  Install Windows Server 2008 R2 SP1 to all server role servers.

    5.  Install Prerequisites and SQL 2008 R2.

    6.  Install the SCVMM Server, Console, and Self Service Portal.

    7.  Deploy SCVMM Agent to Hyper-V hosts.

     


    Prerequisites:

    1.  Install Windows Server 2008R2 SP1 to the SCVMM server.

    2.  Ensure server has a minimum of 2GB of RAM.

    3.  Add .Net 3.5.1 and IIS role. IIS is being added to support the self service portal.

    From http://technet.microsoft.com/en-us/library/bb691354.aspx open powershell (as an administrator) and run the following:

    Import-Module ServerManager

    <then>

    Add-WindowsFeature NET-Framework-Core,Web-Static-Content,Web-Default-Doc,Web-Dir-Browsing,Web-Http-Errors,Web-Http-Logging,Web-Request-Monitor,Web-Filtering,Web-Stat-Compression,Web-Mgmt-Console,Web-Metabase,Web-Asp-Net,Web-Windows-Auth -Restart

    4.  Install .NET 4.0 to all servers

    5.  Install all available Windows Updates.

    6.  Join all servers to domain.

    7.  Add the “DOMAIN\SCVMMAdmins” domain global group and the “DOMAIN\scvmmsvc” domain account explicitly to the Local Administrators group on each SCVMM server.

    8.  Install the Windows Automated Installation Kit (AIK) 2.0. http://go.microsoft.com/fwlink/?LinkID=194654

    9.  Install SQL 2008 R2 DB engine.

    • Setup is fairly straightforward. This document will not go into details and best practices for SQL configuration. Consult your DBA team to ensure your SQL deployment is configured for best practices according to your corporate standards.
    • Run setup, choose Installation > New Installation…
    • When prompted for feature selection, install ALL of the following:
      • Database Engine Services
    • Optionally – consider adding the following to ease administration:
      • Management Tools – Basic and Complete (for running queries and configuring SQL services)
    • On the Instance configuration, choose a default instance, or a named instance. Default instances are fine for testing and labs. Production clustered instances of SQL will generally be a named instance. For the purposes of the POC, choose default instance to keep things simple.
    • On the Server configuration screen, set SQL Server Agent to Automatic. Click “Use the same account for all SQL Server Services, and input the SQL service account and password we created earlier.
    • On the Collation Tab – make sure SQL_Latin1_General_CP1_CI_AS is selected, as that is the ONLY collation supported.
    • On the Account provisioning tab – add your personal domain user account or a group you already have set up for SQL admins. Alternatively, you can use the SCVMMAdmins global group here. This will grant more rights than is required to all SCVMM Admin accounts, but is fine for testing purposes of the POC.
    • On the Data Directories tab – set your drive letters correctly for your SQL databases, logs, TempDB, and backup.
    • Setup will complete.
    • Apply SQL 2008 R2 CU6
    • The update is very straightforward. Accept the defaults and update all features. When complete, reboot the SQL server.


    Step by step deployment guide:

    1. Install SCVMM 2012:

    • Log on using your domain user account that is a member of the SCVMMAdmins group.
    • Run Setup.exe
    • Click Install
    • Accept the license agreement and click Next.
    • Select:
      • VMM Server
      • VMM Administrator Console
      • VMM Self Service Portal
    • On the Product Registration – click Next.
    • Accept or change the default install path and click Next.
    • If you get any Prerequisite errors – resolve them. If you get any warnings, understand them and click Next to proceed.
    • On the Database Configuration screen, enter in the name of your SQL database server (SVCMM if using the locally installed SQL instance) and leave port blank. You can leave “use the following credentials” blank if you are installing to the local SQL server. You can enter credentials here to connect to a remote SQL server if your user account you are running setup as does not have enough rights over the instance to create a database. Ensure “New Database” is checked and use the default name or change it to suit your naming standards. Click Next when this screen is complete.
    • On the Account Configuration screen, enter the domain account for the SCVMM service account that we created earlier (DOMAIN\scvmmsvc). Leave the default to store encryption keys locally for this deployment. Click Next.
    • On the Port configuration screen, accept defaults and click Next.
    • On the Self-Service Portal configuration screen – type in the name of the local SCVMM server we are installing to, accept all other defaults, and click Next.
    • On the Library configuration screen, change the library path or accept the default location, and click Next.
    • Click Install.
    • Setup will install all three roles and complete.

    2. Deploy an agent to an existing Hyper-V Host.

    • Open the System Center Virtual Machine Manager 2012 console.
    • Connect to the SCVMM server.
    • In the lower left hand pane of the console – select “Fabric”.
    • In the folder list – Right click “All Hosts” and choose “Create Host Group”.
    • Name your custom host group something like “Demo”
    • Right click the Demo host group and choose “Add Hyper-V hosts and Clusters”
    • On the Resource Location screen – choose the first bullet for a trusted AD domain computer.
    • On the Credentials screen, click Browse.
    • Select “Create Run As Account”
    • On the General screen, enter a Name of “Hyper-V Host Administration Account”
    • Input a DOMAIN\username of an AD account that has admin access to your Hyper-V servers. This account will be used to administer the Host and VM guests. For the purposes of the POC, we will use the DOMAIN\scvmmadmin account.
    • After inputting the password, and accepting the new account, we will return to the Credentials screen with our existing RunAs account shown. Click Next.
    • Type in the computer names of your Hyper-V servers that you wish to Manage. Ensure that the DOMAIN\SCVMMAdmins global group is a member of the local admins group on all Hyper-V servers so that we can manage them. Click Next.
    • Select all the discovered Hyper-V servers, and click Next.
    • Assign the discovered hosts to the “Demo” host group.
    • Click Next, Finish.
    • A job will be created to deploy the SCVMM agent to the Hyper-V hosts.
  • Do I need a specific Cumulative Update Release (UR) in order to upgrade to SCOM 2012 or 2012 SP1 or 2012 R2?

    There is sometimes confusion over upgrades, if you can just upgrade to the next version, or if there is an CU (Cumulative Update) or UR (Update Rollup) minimum level that is required BEFORE a major version upgrade.

    The MAJOR VERSION upgrade path is below, which Marnix talked about HERE

    This article covers IN PLACE upgrades only, not side by side migrations.

    SCOM 2007R2 > SCOM 2012 RTM > SCOM 2012 SP1 > SCOM 2012 R2

    No shortcuts… you have to upgrade to each major version before continuing to the next.  So if you are way behind, considering a side-by-side migration can be MUCH less work.  Depends on where you are at.

    Ok, back to the Update Rollup requirements.  Technically speaking, the upgrade from SCOM 2007 R2 > SCOM 2012 RTM has a documented CU level.  Per http://technet.microsoft.com/en-us/library/hh476934.aspx which states the following:

    Upgrading to System Center 2012 – Operations Manager is supported from Operations Manager 2007 R2 CU4, or from the latest available CU.

    Which basically means if you are at CU4 or later on SCOM 2007R2, you can upgrade to SCOM 2012 RTM.  There are no other statements in TechNet which state that any of the SCOM 2012 SP1 or R2 upgrades have a minimum UR requirement. 

    Next, the upgrade from 2012 RTM to 2012 SP1 has this page:  http://technet.microsoft.com/en-us/library/jj628203.aspx  This “recommends” updating to SCOM 2012 RTM UR2 (or the latest UR available.)  I cannot say this is a requirement, it is likely what was tested at the time.

    Same for SCOM 2012 SP1 to SCOM 2012 R2.  http://technet.microsoft.com/en-us/library/dn521010.aspx  This page states “We recommend that you update all of the System Center 2012 SP1 components with the most current update rollups.”  The most current UR at the time of that publishing was SCOM 2012 SP1 UR4.

    Therefore, it appears the “required” upgrade path looks like:

    SCOM 2007R2 CU4+ > SCOM 2012 RTM > SCOM 2012 SP1 > SCOM 2012 R2

    Our “Recommended” rolling upgrade path looks like the following:

    SCOM 2007R2 CU4+ > SCOM 2012 RTM UR2+ > SCOM 2012 SP1 UR4+ > SCOM 2012 R2

    If I were performing a rolling upgrade, this is most likely how I’d recommend doing it.  If you are planning a VERY SLOW migration from one version to another, due to lots of additional work that is necessary, such as upgrading OS’s or SQL versions along the way, then you might consider going ahead and applying whatever the most recent Update Rollup available for SCOM 2012.  There are documented here:

    http://support.microsoft.com/kb/2906925

    One word of caution.  The latest word from the product group just came out, on supported interop scenarios:

    http://blogs.technet.com/b/momteam/archive/2014/01/17/system-center-2012-operations-manager-supported-configurations-interop-etc.aspx

    They specifically called out one little point:

    *Latest CU or UR applies in all cases

    This most likely means, this is what they tested, at the time of this posting.  I cant say this is a “requirement”.  If you are doing a rolling upgrade, applying the latest UR to SCOM 2012 RTM before upgrading to SCOM 2012 SP1, then applying the latest UR to SCOM 2012 SP1 before upgrading to SCOM 2012 R2, would be a lot of extra effort, and technically is not a requirement to make the upgrade.   The best decision would be to figure out how long you plan to stay at each stage, how large your management group is, and how much effort it would be in order to deploy the latest UR in each case.  Also to test this in your environment before rolling out the upgrade to production.

    The upgrade from 2007R2 CU4+ to SCOM 2012 RTM is the biggest jump, because you must update ALL your agents to SCOM 2012 before finalizing the upgrade.  After that, you can technically leave your agents alone, since a SCOM 2012RTM agent can report to both SCOM 2012 SP1 and R2 management servers.  Then just update your agents at the end, to SCOM 2012 R2 (plus whatever the latest UR is at that time).

    Resource links:

    SCOM 2007R2 > SCOM 2012 RTM Upgrade Guide TechNet

    SCOM 2012 RTM > SCOM 2012 SP1 Upgrade Guide TechNet

    SCOM 2012 SP1 > SCOM 2012 R2 Upgrade Guide TechNet

    There are a TON of really good blogs out there with upgrade experiences, tips and tricks, so I cant possibly list them all.  I did want to point out a really cool link from a colleague, Wei H Lim, who wrote an updated version of the “Upgrade Helper” MP but for moving from SCOM 2012 SP1 > SCOM 2012 R2.  He does some amazing work so if you don’t follow his blog definitely add him to your lists.

    OpsMgr- Sample Upgrade Helper MP for 2012 SP1 to 2012 R2 (SUHMP2012R2)

  • Orchestrator 2012 SP1 - QuickStart deployment guide

     

    System Center Orchestrator 2012 SP1 is extremely easy to setup and deploy.  There are only a handful of prerequisites, and most can be handled by the setup installer routine.

    The TechNet documentation does an excellent job of detailing the system requirements and deployment process:

    http://technet.microsoft.com/en-us/library/hh420337.aspx

    The following document will cover a basic install of System Center Orchestrator 2012 at a generic customer.  This is to be used as a template only, for a customer to implement as their own pilot or POC deployment guide.  It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.

    SCORCH can be scaled to match the customer requirements. This document will cover a typical two server model, where all server roles are installed on a single VM, and utilize a remote database server or SQL cluster.

    This is not an architecture guide or intended to be a design guide in any way.

    Definitions:

    SCORCH          System Center Orchestrator

    Server Names\Roles:

    SCORCH          Orchestrator 2012 role server

    • Management Server
    • Runbook Server
    • Orchestrator Web Service Server
    • Runbook Designer client application

    DB1                  SQL 2012 Database Engine Server

     

    Windows Server 2012 will be installed as the base OS for all platforms.  All servers will be a member of the AD domain.

    SQL 2012 will be the base standard for all database services. SCORCH only requires a SQL DB engine (locally or remote) in order to host SCORCH databases.

    High Level Deployment Process:

    1.  In AD, create the following accounts and groups, according to your naming convention:

    a.  DOMAIN\scorchsvc                       SCORCH Mgmt, Runbook, and Monitor Account

    b.  DOMAIN\ScorchUsers                   SCORCH users security global group

    c.  DOMAIN\sqlsvc                              SQL Service Account

    2.  Add the domain user accounts for yourself and your team to the ScorchUsers group.

    3.  Install Windows Server 2012 to all server role members.

    4.  Install Prerequisites.

    5.  Install the SCORCH Server.

    Prerequisites:

    1.  Install Windows Server 2012 on all servers.

    2.  Join all servers to domain.

    3.  Ensure SCORCH server has a minimum of 1GB of RAM.

    4.  On the SCORCH server, .Net 3.5SP1 is required. Setup will not be able to add this feature on Windows Server 2012.  Open an elevated PowerShell session (run as an Administrator) and execute the following:

    Add-WindowsFeature NET-Framework-Core

    5.  On the SCORCH server, IIS (IIS Role) is required. Setup will add this role if not installed. 

    6.  On the SCORCH .Net 4.0 is required. This is included in the WS2012 OS (.NET 4.5)

    7.  Install all available Windows Updates as a best practice.

    8.  Add the “DOMAIN\scorchsvc” domain account explicitly to the Local Administrators group on the SCORCH server.

    9.  Add the “DOMAIN\ScorchUsers” global group explicitly to the Local Administrators group on the SCORCH server.

    10.  On the SQL database server, install SQL 2012.

    • Setup is fairly straightforward. This document will not go into details and best practices for SQL configuration. Consult your DBA team to ensure your SQL deployment is configured for best practices according to your corporate standards.
    • Run setup, choose Installation > New SQL server stand-alone installation…..
    • When prompted for feature selection, install ALL of the following:
      • Database Engine Services
    • Additionally, the product documentation for SCVMM states to install the management tools – complete:
      • Management Tools – Basic and Complete (for running queries and configuring SQL services)
    • On the Instance configuration, choose a default instance, or a named instance. Default instances are fine for testing and labs. Production clustered instances of SQL will generally be a named instance. For the purposes of the POC, choose default instance to keep things simple.
    • On the Server configuration screen, set SQL Server Agent to Automatic.  I prefer to use a service account for SQL, so I will set the Agent and DB Engine to run under my DOMAIN\sqlsvc account and provide the password.  This is optional.
    • On the Collation Tab – you can use the default of SQL_Latin1_General_CP1_CI_AS or choose another supported collation.
    • On the Account provisioning tab – add your personal domain user account or a group you already have set up for SQL admins. Alternatively, you can use the ScorchUsers global group here. This will grant more rights than is required to all ScorchUser Admin accounts, but is fine for testing purposes of the POC.
    • On the Data Directories tab – set your drive letters correctly for your SQL databases, logs, TempDB, and backup.
    • Setup will complete.

    Step by step deployment guide:

    1.  Install SCORCH 2012:

    • Log on using your domain user account that is a member of the ScorchUsers group.
    • Run Setuporchestrator.exe
    • Click Install
    • Supply a name, org, and license key (if you have one) and click Next.  If you don't input a license key it will install Eval version.
    • Accept the license agreement and click Next.
    • Check all boxes on the getting started screen, for:
      • Management Server
      • Runbook Server
      • Orchestration Console and Web Service
      • Runbook Designer
    • On the Prerequisites screen, check the boxes to remediate any necessary prerequisites, and click Next when all prerequisites are installed.
    • Input the service account “scorchsvc” and input the password, domain, and click Test. Ensure this is a success and click Next.
    • Configure the database server. Type in the local computer name if you installed SQL on this SCORCH Server, or provide a remote SQL server (and instance if using a named instance) to which you have the “System Administrator” (SA) rights to in order to create the SCORCH database and assign permissions to it. Test the database connection and click Next.
    • Specify a new database, Orchestrator. Click Next.
    • Browse AD and select your domain global group for ScorchUsers. Click Next.
    • Accept defaults for the SCORCH Web service ports of 81 and 82, Click Next.
    • Accept default location for install and Click Next.
    • Select the appropriate options for Microsoft Update, Customer Experience and Error reporting. Click Next.
    • Click Install.
    • Setup will install all roles, create the Orchestrator database, and complete very quickly.

    2. Open the consoles.

    • Open the Deployment Manager, Orchestration Console, and Runbook designer. Ensure all consoles open successfully.

    Post install procedures:

    1.  Lets register and then deploy Integration Packs that enable Orchestrator to connect to so many outside systems.

    Download the toolkit, add-ons, and IP’s for SCORCH 2012 SP1.

    • Make a directory on the local SCORCH server such as “C:\IntegrationPacks”
    • Copy to this directory, the downloaded IP’s, such as the following:
      • SC2012SP1_Integration_Pack_for_Configuration_Manager.oip
      • SC2012SP1_Integration_Pack_for_Data_Protection_Manager.oip
      • SC2012SP1_Integration_Pack_for_Operations_Manager.oip
      • SC2012SP1_Integration_Pack_for_Service_Manager.oip
      • SC2012SP1_Integration_Pack_for_Virtual_Machine_Manager.oip
      •                  
    • Open the Deployment Manager console
    • Expand “Orchestrator Management Server
    • Right click “Integration Packs” and choose “Register IP with the Orchestrator Management Server
    • Click Next, then “Add”.  Browse to “C:\Integration Packs” and select all of the OIP files you copied here.  You have to select one at a time and go back and click “Add” again to get them all.
    • Click Next, then Finish.  You have to accept the License Agreement for each IP. 
    • Now when you select “Integration Packs” you can see these IP’s in the list.
    • Right Click “Integration Packs” again, this time choose “Deploy IP to Runbook server or Runbook Designer”.
    • Click Next, select all the available IP’s and click Next.
    • Type in the name of your Runbook server role name, and click Add.
    • On the scheduling screen – accept the default (which will deploy immediately) and click Next.
    • Click Finish.  Note the logging of each step in the Log entries section of the console.
    • Verify deployment by expanding “Runbook Servers” in the console.  Verify that each runbook was deployed.
    • Open the Runbook Designer console.
    • Note that you now have these new IP’s available in the designer for your workflows.

    Additionally – you can download more IP’s at:

    http://technet.microsoft.com/en-us/library/hh295851.aspx

    Such as the VMware VSphere IP, or the IBM Netcool IP.

    Additionally – check out Charles Joy’s blog on popular codeplex IP’s which have been updated for Orchestrator:

    http://blogs.technet.com/b/charlesjoy/

  • System Center Operations Manager SDK service failed to register an SPN

    System Center Operations Manager SDK service failed to register an SPN

     

     

    Have you seen this event in your RMS OpsMgr event logs?

     

    Event Type:      Warning

    Event Source:   OpsMgr SDK Service

    Event Category:            None

    Event ID:          26371

    Date:                12/13/2007

    Time:                2:58:24 PM

    User:                N/A

    Computer:         RMSCOMPUTER

    Description:

    The System Center Operations Manager SDK service failed to register an SPN. A domain admin needs to add MSOMSdkSvc/rmscomputer and MSOMSdkSvc/rmscomputer.domain.com to the servicePrincipalName of DOMAIN\sdkaccount

     

    This seems to appear in the RC1-SP1 build of OpsMgr.

     

    Every time the SDK service starts, it tries to update the SPN’s on the AD account that the SDK service runs under.  It fails, because by default, a user cannot update its own SPNs.  Therefore we see this error logged.

     

    If the SDK account is a domain admin – it does not fail – because a domain admin would have the necessary rights.  Obviously – we don’t want the SDK account being a domain admin…. That isn’t required nor is it a best practice.

     

    Therefore – to resolve this error, we need to allow the SDK service account rights to update the SPN.  The easiest way, is to go to the user account object for the SDK account in AD – and grant SELF to have full control.

     

    A better, more granular way – is to only grant SELF the right of modifying the SPN:

     

    • Run ADSIEdit as a domain admin.
    • Find the SDK domain account, right click, properties.
    • Select the Security tab, click Advanced.
    • Click Add.  Type “SELF” in the object box.  Click OK.
    • Select the Properties Tab.
    • Scroll down and check the “Allow” box for “Read servicePrincipalName” and “Write servicePrincipalName”
    • Click OK.  Click OK.  Click OK.
    • Restart your SDK service – if AD has replicated from where you made the change – all should be resolved.

     To check SPN's:

    The following command will show all the HealthService SPN's in the domain:

        Ldifde -f c:\ldifde.txt -t 3268 -d DC=DOMAIN,DC=COM -r "(serviceprincipalname=MSOMHSvc/*)" -l serviceprincipalname -p subtree
     

    To view SPN's for a specific server: 

        "setspn -L servername"

     

     

  • Applying Update Rollup 2 (UR2) to OpsMgr 2012 SP1

     

    image

     

    Update Rollup 2 (UR2) for OpsMgr 2012 SP1 has shipped.  This post will be a simple walk-through of applying it.  This hotfix is included on my Hotfix page for SCOM:  http://blogs.technet.com/b/kevinholman/archive/2009/01/27/which-hotfixes-should-i-apply.aspx

     

    Description and download location:

    http://support.microsoft.com/kb/2802159

    Description of fixes in this release: 

    1. The Web Console performance is very poor when a view is opened for the first time.
    2. The alert links do not open in the Web Console after Service Pack 1 is applied for Operations Manager.
    3. The Distributed Applications (DA) health state is incorrect in Diagram View.
    4. The Details Widget does not display data when it is viewed by using the SharePoint webpart.
    5. The renaming of the SCOM group in Group View will not work if the user language setting is not "English (United States)."
    6. An alert description that includes multibyte UTF-8 characters is not displayed correctly in the Alert Properties view.
    7. The Chinese (Taiwan) Web Console displays the following message even after the SilverlightClientConfiguration.exe program is run:  Web Console Configuration Required.
    8. The Application Performance Monitoring (APM) to IntelliTrace conversion is broken when alerts are generated from dynamic module events such as the Unity Container.
    9. Connectivity issues to System Center services are fixed.
    10. High CPU problems are experienced in Operations Manager UI.
    11. Query processor runs out of internal resources and cannot produce a query plan when you open Dashboard views.
    12. Path details are missing for "Objects by Performance."

    Unix and Linux fixes:

    1. The Solaris agent could run out of file descriptors when many multi-version file systems (MVFS) are mounted.
    2. Logical and physical disks are not discoverable on AIX-based computers when a disk device file is contained in a subdirectory.
    3. Rules and monitors that were created by using the UNIX/Linux Shell Command templates do not contain overridable ShellCommand and Timeout parameters.
    4. Process monitors that were created by the UNIX/Linux Process Monitoring template cannot save in an existing management pack that has conflicting references to library management packs.
    5. The Linux agent cannot install on a CentOS or Oracle Linux host by using FIPS version of OpenSSL 0.9.8.

    This Update Rollup is also required if you want to use the new System Center Advisor Connector:  http://blogs.technet.com/b/momteam/archive/2013/04/09/system-center-advisor-connector-for-operations-manager-preview.aspx

     

    That’s a LOT.  Looks like some very important ones as well…. so lets get this one tested in our labs!

     

    Download the update:

    You can get this update “partially” applied by using Windows Update.  However, since there are manual steps involved, and a specific recommended order of operations, I don’t really recommend using Windows Update in general.  It is certainly an option, however.

    To download all of the updates, you will need to click the link in the KB above, which will launch the catalog for the individual downloads. 

     

    image

     

    You’ll notice some of these updates are a LOT bigger than the previous ones in UR1.

    I also notice there is now an update for the “Console” which is new from UR1.  The original release of UR2 was missing the update for the Gateway, which is now included and available to make UR2 truly “cumulative”.

    Add these to your “basket” then “view basket” and choose a download location.

     

    Build a plan:

    Following the KB – the installation plan looks something like this:

    1. Install the update rollup package on the following server infrastructure:
      • Management server or servers
      • Gateway servers
      • Reporting servers
      • Web console server role computers
      • Operations console role computers
    2. Manually import the management packs.
    3. Apply the agent update to manually installed agents, or push the installation from the Pending view in the Operations console.

    ***Note:  One of the things you will notice – is that there is no update available for reporting servers.  We will skip the reporting role. 

     

     

     

    My new list looks like:

    • Management servers
    • Gateway servers
    • Web console server role computers
    • Operations console Role Computers

    Since I am monitoring Linux systems, I’ll need to add steps for that from the KB:

    1. Download the updated management packs from the following Microsoft website:

      (The Unix/Linux MP location isn't available, and the previous location hasn’t been updated yet.  So this part is still under investigation as well.  I will update this section when I clear this part up)

    2. Install the management pack update package to extract the management pack files.
    3. Import the following:
      • The updated Microsoft.Unix.Library management pack (from the Microsoft.Unix.Library\2012 SP1 folder)
      • The Microsoft.Unix.Process.Library management pack bundle
      • The platform library management packs that are relevant to the Linux or UNIX platforms that you monitor in your environment

    Seems simple enough, lets get started.

     

    Install the update rollup package

     

    On the catalog site, I add all the updates to my basket, and click View Basket, and Download.

    Next I copy these files to a share that all my SCOM servers have access too.  These are actually .CAB files, so I will need to extract the MSP’s from these CAB files.

    Once I have the MSP files, I am ready to start applying the update to each server by role.

    ***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

    My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

     

    image

    This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update does not provide any feedback that it had success or failure.  You can check the application log for the MsiInstaller events for that.

    You can also spot check a couple DLL files for the file version attribute. 

    image

     

    Next up – run the Web Console update:

    image

    This runs much faster.   A quick file spot check:

    image

     

    Lastly – install the console update:

    image

     

    Well, this one required a reboot.  The KB article instructed “If you do not want to restart the computer after you apply the console update, close the console before you apply the update for the console role.”  However – my console was closed….. so you had better prepare that these files might be locked and require a reboot.

     

    image

    After the reboot – a quick file spot check:

    image

     

     

    I now move on to my additional management servers, applying the server update, then the console update.  My additional management servers did not require a reboot after the console update.

     

    Next, I update the gateways.   

    image

    The update launches a UI and quickly finishes.

    I do a spot-check to ensure the right files were dropped.  First I will check the Agent update files in C:\Program Files\System Center Operations Manager\Gateway\AgentManagement\

     

    image

    Then I will spot check the DLL’s:

    image

     

     

     

     

     

     

     

    Manually import the management packs?

     

    We have two updated MP’s to import  (MAYBE!).

    image

     

    These MP bundles are only used for specific scenarios, such as Global Service Monitoring, or DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  The Intellitrace MP will actually fail to import of you are not using these, because of a dependency.  I’d skip this MP import unless you already have these MP’s present in your environment.

    Apply the agent update

     

    Approve the pending updates in the Administration console for pushed agents.  Manually apply the update for manually installed agents.

    image

    100% success rate.

    Be sure to check the “Agents By Version” view to find any agents that did not get patched:

    image

     

    ***Note!  The agents behind a gateway are NOT placed into pending actions for an update.  These agents will need to be “repaired” via the administration console, or use Windows Update.  On my agents behind a GW – a repair worked perfectly.

     

     

    Update Unix/Linux MPs

     

    Next up – I download and extract the updated Linux MP’s for SCOM 2012 SP1 UR2

    (The link in the KB article doesn’t work at the time of this writing – here is the correct link)

    http://www.microsoft.com/en-us/download/details.aspx?id=29696

     

    7.4.3507 is SCOM 2012 SP1. 

    7.4.4112.0 is SCOM 2012 SP1 with UR1.

    7.4.4119.0 is SCOM 2012 SP1 with UR2.

    Download the MSI and run it.  It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 MPs for UNIX and Linux (7.4.4199.0)

    Import the files in the 2012 SP1 folder, and the following:

    Microsoft.Unix.ConsoleLibrary.mp

    Microsoft.Unix.Process.Library.mpb

    Microsoft.Unix.ShellCommand.Library.mpb

    Also add any platform specific MP’s for versions on Unix or Linux in your monitoring environment.

    image

    You will likely observe high CPU utilization of your management servers during these MP imports.  Give it time to complete the process of the import and MPB deployments.

     

     

    Next up – you would upgrade your agents on the Unix/Linux monitored agents.  You can now do this straight from the console:

     

    image

     

    image

    You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

    image

     

     

    Lastly – refer to the KB article for the UR1 update, as if you are a heavy user of Linux process monitoring using our template – additional steps are required to address the fixes.  You must open, edit, and re-save any process templates that you had previously created in order to apply the fixes to each.

     

    Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

     

    image

     

    Known issues:

    See the existing list of known issues documented in the KB article.

    Additional:

    1.  Agents behind a Gateway will not be placed into pending management for an update.  If you are using Windows Update/WSUS/SCCM to update your agents, then no steps are necessary, as they will receive the agent update automatically.

    2.  OM12 SP1 UR#2 Web Console Error: System.Reflection.ReflectionTypeLoadException: [ReflectionTypeLoad_LoadFailed]  Savision has released updated versions of the Live Maps Summary Widget management packs that resolve this issue. The versions can be downloaded here. The download contains the following files:

    • Savision.LiveMaps.Presentation.SummaryWidget.Library.mpb (Version 1.2.1.0)
    • Savision.LiveMaps.Presentation.SummaryWidget.WebConsole.mpb (Version 1.2.1.3)

    The problem should be fixed by importing the 2 management packs into your environment.

    http://www.savision.com/resources/news/fix-om12-sp1-ur2-web-console

  • Using OpsMgr for intrusion detection and security hardening

    Here is an interesting little concept of how to use OpsMgr.

    Because I have a lab, that is exposed to the internet over port 3389, I get a LOT of hacking attempts on this lab.  Mostly the source is from bots running on other compromised systems.  These bots just do brute force attacks against the typical Admin accounts and passwords via RDP.  In this article, I am going to show how OpsMgr can not only alert on this condition, but also respond by configuring the Windows Firewall to block these attacks.

     

    I will start by analyzing the Server 2008 event that occurs when someone tries to attack using my “Administrator” account:

     

    Log Name:          Security
    Source:              Microsoft-Windows-Security-Auditing
    Date:                  7/14/2009 12:44:05 PM
    Event ID:            4625
    Task Category:   Account Lockout
    Level:                  Information
    Keywords:          Audit Failure
    User:                   N/A
    Computer:           terminalserver.domain.com

    Description:   An account failed to log on.

    Subject:
        Security ID:             SYSTEM
        Account Name:        TERMINALSERVER$
        Account Domain:     DOMAIN
        Logon ID:                 0x3e7

    Logon Type:            10

    Account For Which Logon Failed:
        Security ID:             NULL SID
        Account Name:        administrator
        Account Domain:     TERMINALSERVER

    Failure Information:
        Failure Reason:        Account locked out.
        Status:                      0xc0000234
        Sub Status:               0x0

    Process Information:
        Caller Process ID:          0x14f0
        Caller Process Name:    C:\Windows\System32\winlogon.exe

    Network Information:
        Workstation Name:    TERMINALSERVER
        Source Network Address:    10.10.10.1
        Source Port:        1261

    Detailed Authentication Information:
        Logon Process:           User32
        Authentication Package:    Negotiate
        Transited Services:    -
        Package Name (NTLM only):    -
        Key Length:        0

     

    So… for starters, I want to alert on this condition… when ANYONE is trying multiple times… to RDP into the server, with a disabled account, non-existent account, or valid account, but bad password.  Therefore – I will create a monitor:  Windows Events > Repeated Event Detection > Timer Reset.

    The idea here is to only respond when multiple bad passwords are entered in a short time period…. representing an attack.  (I don't want to lock out or block access from my normal users who sometimes mis-type their password on a couple attempts.)

    So I create the monitor, target “Windows Server Operating System”, set it to “Security” for the Parent Monitor, and UNCHECK the box enabling it.  (I will later override this monitor and ONLY enable it for my entry terminal server.)

    I create my event expression for the security event log, event 4625, and I only want the Logon Type of 10, which is from RDP:

     

    image

     

     

    Next – I will set up my monitor, to Trigger on Count (of events), Sliding.  Compare count will be set to 5 (events) within a 3 minute interval.  Therefore, as soon as 5 events are captured, in ANY sliding 3 minute “window”, the monitor will change state.

     

    image

     

    Next… since my goal is really to execute a script/command/response…. (not really a state change is desired) I will set the timer reset to reset the state back to healthy after 2 minutes.  This will free the workflow up to block any other source IP’s which might attack soon after.

     

    image

     

    I don't want to impact availability data, which assumes critical state = unavailable…. so I will use a Warning State:

     

    image

     

    Now – I will enable a unique alert for this condition.  I want a critical, high priority alert in this case, and I will set this NOT to close the alert when we auto-resolve the state on the timer.  I also will customize the alert description, to give me a richer alert based on the even details and my custom response.  I talk more about these event parameters HERE.   I will be adding:

     

    $Data/Context/Context/DataItem/Params/Param[6]$ typed a bad password accessing directly from computer: $Data/Context/Context/DataItem/Params/Param[14]$ from IP: $Data/Context/Context/DataItem/Params/Param[20]$
    The Windows Firewall will be modified to block this IP address in response to this monitor state.

     

    image

     

     

    Next – I will go back and find my monitor, and add a Recovery for the Warning State:

     

    image 

     

    I will choose to Run Command.  Give it a name “Modify Windows Firewall”

     

    image

     

    Next – for the command – I am going to run Netsh.exe which can configure the Windows Firewall running on the terminal server.  Here is the command:

     

    C:\Windows\System32\netsh.exe

    advfirewall firewall set rule name="Block RDP" new remoteip=$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$

     

    $Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$ is based on an Event Parameter of the Server 2008 event, which I will pass to the command, so it will gather the IP address of the attacker, and pass that to the command which configures the firewall rule.  Getting this variable was the most complicated for me…..   Marius talked about how to derive this variable HERE  Just understand that the variables you use in an alert description are not the same was used in a diagnostic or recovery.

     

    image

     

    Cool:

     

    image

     

     

    My Netsh.exe command modifies an existing custom rule in the Windows Firewall, so I need to make sure I create that and name it “Block RDP”.

    Now – I will override this rule and enabled it for my published terminal server, and then test this monitor… by attempting to log into my terminal server via RDP 5 times in a short period, using a disabled account.  This will cause the event in the security event log for each event, and eventually trip the repeated event detection monitor.

     

    Alert generates:

    image

     

    Monitor changes state:

    image

     

    Recovery runs:

     

    image

     

    Windows Firewall rule gets modified:

     

    image

     

    Attack is stopped.

    Pretty cool, eh? 

  • Orchestrator 2012: a quickstart deployment guide

    System Center Orchestrator 2012 is extremely easy to setup and deploy.  There are only a handful of prerequisites, and most can be handled by the setup installer routine.

     

    The TechNet documentation does an excellent job of detailing the system requirements and deployment process:

    http://technet.microsoft.com/en-us/library/hh420337.aspx

     

    The following document will cover a basic install of System Center Orchestrator 2012 at a generic customer.  This is to be used as a template only, for a customer to implement as their own pilot or POC deployment guide.  It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.

    SCORCH can be scaled to match the customer requirements. This document will cover a typical two server model, where all server roles are installed on a single VM, and utilize a remote database server or cluster.

    This is not an architecture guide or intended to be a design guide in any way.

    Definitions:

    SCORCH          System Center Orchestrator

    Server Names\Roles:

    SCORCH          Orchestrator 2012 role server

    • Management Server
    • Runbook Server
    • Orchestrator Web Service Server
    • Runbook Designer client application
    • Windows Server 2008 R2 SP1 Enterprise edition will be installed as the base OS for all platforms.
    • All servers will be a member of the AD domain.
    • SQL 2008 R2 ENT edition with SP1 will be the base standard for all database services. SCORCH only requires a SQL DB engine (locally or remote) in order to host SCORCH databases.

     

    High Level Deployment Process:

     

    1.  In AD, create the following accounts and groups, according to your naming convention:

    a.  DOMAIN\scorchsvc                       SCORCH Mgmt, Runbook, and Monitor Account

    b.  DOMAIN\ScorchUsers                 SCORCH users security global group

    2.  Add the domain user accounts for yourself and your team to the ScorchUsers group.

    3.  Install Windows Server 2008 R2 SP1 to all server role members.

    4.  Add the DOMAIN\scorchsvc account to the local administrators group on the SCORCH server.

    5.  Add the DOMAIN\ScorchUsers global group to the local administrators group on the SCORCH server.

    6.  Install the SCORCH Server.

     

    Prerequisites:

    1.  Install Windows Server 2008R2 SP1

    2.  Ensure server has a minimum of 1GB of RAM.

    3.  .Net 3.5SP1 is required. Setup will add this feature if not installed.

    4.  IIS7 (IIS Role) is required. Setup will add this role is not installed.

    5.  .Net 4.0 is required. This must be installed manually on Server 2008 R2 SP1. Download and install this prereq.

    6.  Install all available Windows Updates as a best practice.

    7.  Join all servers to domain.

    8.  Add the “DOMAIN\scorchsvc” domain account explicitly to the Local Administrators group on the SCORCH server.

    9.  Add the “DOMAIN\ScorchUsers” global group explicitly to the Local Administrators group on the SCORCH server.

     

    Step by step deployment guide:

    1.  Install SCORCH 2012:

    • Log on using your domain user account that is a member of the ScorchUsers group.
    • Run Setuporchestrator.exe
    • Click Install
    • Supply a name, org, and license key (if you have one) and click Next.
    • Accept the license agreement and click Next.
    • Check all boxes on the getting started screen, for:
      • Management Server
      • Runbook Server
      • Orchestration console and web service
      • Runbook Designer
    • On the Prerequisites screen, check the boxes to remediate any necessary prerequisites, and click Next when all prerequisites are installed.
    • Input the service account “scorchsvc” and input the password, domain, and click Test. Ensure this is a success and click Next.
    • Configure the database server. Type in the local computer name if you installed SQL on this SCORCH Server, or provide a remote SQL server (and instance if using a named instance) to which you have the “System Administrator” (SA) rights to in order to create the SCORCH database and assign permissions to it. Test the database connection and click Next.
    • Specify a new database, Orchestrator. Click Next.
    • Browse AD and select your domain global group for ScorchUsers. Click Next.
    • Accept defaults for the SCORCH Web service ports of 81 and 82, Click Next.
    • Accept default location for install and Click Next.
    • Select the appropriate options for Customer Experience and Error reporting. Click Next.
    • Click Install.
    • Setup will install all roles, create the Orchestrator database, and complete very quickly.

    2. Open the consoles.

    • Start > Microsoft System Center 2012 > Orchestrator
    • Open the Deployment Manager, Orchestration Console, and Runbook designer. Ensure all consoles open successfully.

     

    Post install procedures:

     

    1.  Lets register and then deploy Integration Packs that enable Orchestrator to connect to so many outside systems.

    Go to http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=28725 and download the toolkit, add-ons, and IP’s.

    • Make a directory on the local SCORCH server such as “C:\Integration Packs”
    • Copy to this directory, the downloaded IP’s, such as the following:
      • SC2012_Configuration_Manager_Integration_Pack.oip
      • SC2012_Data_Protection_Manager_Integration_Pack.oip
      • SC2012_Operations_Manager_Integration_Pack.oip
      • SC2012_Service_Manager_Integration_Pack.oip
      • SC2012_Virtual_Machine_Manager_Integration_Pack.oip
      • Configuration_Manager_2007_Integration_Pack.oip
      • Data_Protection_Manager_2010_Integration_Pack.oip
      • Operations_Manager_2007_Integration_Pack.oip
      • Service_Manager_2010_Integration_Pack.oip
      • Virtual_Machine_Manager_2008_Integration_Pack.oip
    • Open the Deployment Manager console
    • Expand “Orchestrator Management Server
    • Right click “Integration Packs” and choose “Register IP with the Orchestrator Management Server
    • Click Next, then “Add”.  Browse to “C:\Integration Packs” and select all of the OIP files you copied here.  You have to select one at a time and go back and click “Add” again to get them all.
    • Click Next, then Finish.  You have to accept the License Agreement for each IP. 
    • Now when you select “Integration Packs” you can see these 10 IP’s in the list.
    • Right Click “Integration Packs” again, this time choose “Deploy IP to Runbook server or Runbook Designer”.
    • Click Next, select all the available IP’s and click Next.
    • Type in the name of your Runbook server role name, and click Add.
    • On the scheduling screen – accept the default (which will deploy immediately) and click Next.
    • Click Finish.  Note the logging of each step in the Log entries section of the console.
    • Verify deployment by expanding “Runbook Servers” in the console.  Verify that each runbook was deployed.
    • Open the Runbook Designer console.
    • Note that you now have these new IP’s available in the console for your workflows.

     

    Additionally – you can download more IP’s at:

    http://technet.microsoft.com/en-us/library/hh295851.aspx

    Such as the VMware VSphere IP, or the IBM Netcool IP.

    Additionally – check out Charles Joy’s blog on popular codeplex IP’s which have been updated for Orchestrator:

    http://blogs.technet.com/b/charlesjoy/

  • New Base OS MP 6.0.6667.0 adds file fragmentation monitor to all Logical Disks

    I recently blogged about the new Base OS MP that was recently released:  HERE

     

    One of the things you will notice RIGHT off the bat… is that a huge percentage of your logical disks will go into a warning state, if you don't already have some sort of scheduled defragmentation set up.  This will be true for virtual machines and physical machines…. anything over 10 percent file fragmentation (or the OS recommended setting) will get hit:

     

    image

     

    image

     

    You will also get many warning alerts on this monitor…. the first time the condition is detected and the state changes for this monitor.  This monitor checks status every Saturday, at 3:00AM by default, for all logical disks discovered.

     

    image

     

     

    If you don't care about this monitoring in SCOM – disable this monitor using overrides.

    If you do care about seeing the state change – but don't want the alerts – turn the “Generates Alert” property to False, using overrides.

     

    You can adjust the threshold from 10% to some other number…. but make sure you take note – this monitor will ignore the “File Percent Fragmentation” property by default, and always use the OS recommended setting.  If you want to control this – you also need to set “Use OS Recommendation” to FALSE.

     

    Here is an example of hard coding the frag percentage to 20% from the OS default:

    image

     

    “Use OS Recommendation” property description:

    image

     

     

    Lastly – one thing of interest….  If you want SCOM to “fix” the fragmentation issue…. it can.  There is a recovery on this very monitor that can run a VBScript that will run a defrag job against your logical disks.  It is disabled by default. 

     

    image

     

    Keep in mind – if you turn on this defrag…. on your physical boxes – it wont be a big deal… it will simply fix the fragmentation issue.  However – this will also run on ALL yours VM’s.  If this was triggered all at the same time – Saturday at 3:00AM by default – this can kill the disk I/O on the disk subsystem hosting your VM/VHD files.  Keep this in mind if you decide to enable this….   This recovery will only run when the state change is detected… as a recovery to the condition, so any disks that are already in a warning state will not run this recovery should you enable it.  This defrag has a timeout of 1 hour…. so it should kill the defrag if it cannot complete within an hour.

     

    Another cool thing to do – is to use the recovery action as a single run-time task. You can do this right from health explorer, to fix the disks on your own schedule:

     

    image

     

    Just click the link, and run the task:

    image

     

    Minimize this…. and just let it run – you can come back in 1 hour – and see if it completed, or timed out.

     

    You can also monitor for task status in the Task Status list in the console:

     

    image

     

     

    On the agent – you will see the following events logged in the OpsMgr event log:

     

    Log Name:      Operations Manager
    Source:        Health Service Script
    Date:          9/28/2009 10:50:04 AM
    Event ID:      4002
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      OMDW.opsmgr.net
    Description:
    Microsoft.Windows.Server.LogicalDisk.Defrag.vbs : Perform Defragmentation (disk: C:; computer: OMDW.opsmgr.net).

    And when completed:

     

    Log Name:      Operations Manager
    Source:        Health Service Script
    Date:          9/28/2009 11:03:44 AM
    Event ID:      4002
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      OMDW.opsmgr.net
    Description:
    Microsoft.Windows.Server.LogicalDisk.Defrag.vbs : Defragmentation completed (disk: C:; computer: OMDW.opsmgr.net): FilePercentFragmentation = 0.

  • Do you randomly see a MonitoringHost.exe process consuming lots of CPU?

    Randomly, you might see a single MonitoringHost.exe process on an agent, consuming 100% CPU. (Or 50%, or 25% depending on how many cores you have).  This process will stay at this level, and will not recover.  If you restart the OpsMgr HealthService, the problem goes away, and might not return for days or even weeks.

     

    This particular symptom, might be due to an XML spinlock issue… this is a core Windows OS issue, and there is a hotfix available, which I have on my HOTFIX LINK

     

    The KB is 968967 :

    “The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed”

    I have seen that most customers are affected by this issue from time to time.  I have seen it very commonly in my lab, on Server 2008 Domain controllers, and my Server 2008 Hyper-V hosts…

     

     

    A note on patching Server 2008:

     

    When you go to download this hotfix for a server 2008 machine – it is very misleading on which hotfix to even get.  Here is the list of all available fixes:

     

    image

     

    For patching Server 2008 – you need to download the “Windows Vista” hotfix – in either x86 or x64, depending on your OS version:

     

    image

     

     

     

    Monitoring for this condition:

    You can easily write a threshold monitor targeting agent or HealthService, to track the monitoringhost process \ %processor time threshold, and set it to alert when it has multiple consecutive samples above a defined threshold.

     

    Here is an example of creating this monitor:

    Authoring Pane > Monitors > New Unit Monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Consecutive Samples over Threshold.

     

    image

     

    Give it a custom name that follows your documented custom Monitor naming standard, target “Health Service”, and put this under Performance rollup.

     

    image

     

    Hit the “Select” button (in SP1 – select “Browse”)  In the perf counter picker – choose a server with an installed agent, choose the Object “Process” the counter “%Processor Time” and the Instance “MonitoringHost”, and click OK.

     

    image

     

    Since there are multiple MonitoringHost processes… we will add a Wildcard to the Instance name in the monitor…. this will monitor ANY MonitoringHost process for high CPU.  Set the Interval to every 1 minute.

     

    image

     

    For the number of consecutive samples, and threshold… that is up to you.  For me – I will say that if I detect a single MonitoringHost process using more than 50% CPU, over all 5 consecutive samples (5 minutes) then I consider that bad:

     

    image

    image

     

    image

     

    At this point…. you can simply alert on the condition, or event try and add a recovery script – that will bounce the health service.  Generally, bouncing the HealthService when one of the processes is using all the CPU is not always 100% reliable… especially from a “NET STOP & NET START” type command.  I have found it more reliable to just kill the MonitoringHost process in this condition, and allow it to respawn…. but your mileage may vary.

    http://blogs.technet.com/kevinholman/archive/2008/03/26/using-a-recovery-in-opsmgr-basic.aspx

  • Targeting workflows to Resource pools

     

    Resource Pools in SCOM 2012 are an advancement over SCOM 2007, where a resource pool can be used to host instances, that have targeted workflows, and make them highly available.  This allowed the “All Management Servers Resource Pool” to host the instances that the RMS used to run in SCOM 2007.  This allowed for all management servers, in the AMSRP, to automatically load balance the old RMS workflows, across all management servers.

    This also is used for thing like the Notifications Resource pool, which hosts two instances (or Top Level Managed Entities) which are the Pool object itself, and the “Alert Notification Subscription Server” which have many monitoring workflows target it to monitor the notification process health.

     

    Well, we can also write workflows and target resource pools.  We might do this if we want a workflow to run on the management servers, but be highly available. 

    In this example, I will take a VERY simple script that does nothing but log an event, and target the All Management Servers Resource Pool.

    First, here is my PowerShell script:

    $api = new-object -comObject 'MOM.ScriptAPI' $api.LogScriptEvent("momscriptevent.ps1",9999,0,"this is a test event")

    This script simply loads the MOM.ScriptAPI which is necessary to perform specific SCOM actions in script, such as logging events to the SCOM event lot, creating property bags, submitting discovery data, etc.

    Then, it logs an informational event for the script in the SCOM event log wherever it is running.

    Next up – write my rule to run the script.

    We cannot use the SCOM 2007R2 Authoring Console to write this rule, as we need to target the Resource Pool object which SCOM 2007R2 does not understand, nor can it reference.  If you are most familiar with authoring in that tool, and you really want to use that SCOM 2007R2 Authoring Console, you can do that, and just target something else, like “Windows Server Operating System” and then change the class later in an XML editor.

    Here is my manifest section.  Note – I need to reference the SCOM 2012 versions of these MP’s since this MP will not work on SCOM 2007:

    <Manifest> <Identity> <ID>Target.ResourcePool.Example</ID> <Version>1.0.0.1</Version> </Identity> <Name>Target.ResourcePool.Example</Name> <References> <Reference Alias="SC"> <ID>Microsoft.SystemCenter.Library</ID> <Version>7.0.8427.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Windows"> <ID>Microsoft.Windows.Library</ID> <Version>7.5.8500.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="Health"> <ID>System.Health.Library</ID> <Version>7.0.8427.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="System"> <ID>System.Library</ID> <Version>7.5.8500.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> </References> </Manifest>

    Next, my simple rule.  Notice – I target the AMSRP class, I add a simple scheduler module to run this workflows every 30 seconds, and I have a simple write action based on Microsoft.Windows.PowerShellWriteAction module.

    <Monitoring> <Rules> <Rule ID="Target.ResourcePool.Example.RunSampleScriptRule" Enabled="true" Target="SC!Microsoft.SystemCenter.AllManagementServersPool" ConfirmDelivery="true" Remotable="true" Priority="Normal" DiscardLevel="100"> <Category>Custom</Category> <DataSources> <DataSource ID="SchedDS" TypeID="System!System.SimpleScheduler"> <IntervalSeconds>30</IntervalSeconds> <SyncTime></SyncTime> </DataSource> </DataSources> <WriteActions> <WriteAction ID="PoshWA" TypeID="Windows!Microsoft.Windows.PowerShellWriteAction"> <ScriptName>momscriptevent.ps1</ScriptName> <ScriptBody><![CDATA[ $api = new-object -comObject 'MOM.ScriptAPI' $api.LogScriptEvent("momscriptevent.ps1",9999,0,"this is a test event") ]]></ScriptBody> <TimeoutSeconds>30</TimeoutSeconds> </WriteAction> </WriteActions> </Rule> </Rules> </Monitoring>

    That’s it!  I will post my full XML as a sample attached to this article.

    Now, when I import this MP, ONE of my management servers should start running this workflow.  It will be whichever MS is hosting the AMSRP class at that time.  This could change as loads are reshuffled, or as management servers are taken down for maintenance.

    I have three management servers, SCOM01, SCOM02, and SCOM03.  I can see this workflow is running happily on SCOM02:

    image

    I will stop the health service on SCOM02, or shut the OS down.

    The last event I got from the test script was at 9:09:56 AM.

    What happens now, is the other management servers are waiting for a heartbeat failure threshold to take a vote, and evict SCOM02 from the pool.  The SCOM database is also a “default observer” and plays a role in the voting process. 

    At 9:12:36 AM, I start to see the pool manager events coming in, showing that they other management servers are redistributing the workflows.  My 9999 event is now being created on SCOM03, with my first event showing up at 9:12:55 AM, or about 3 minutes after SCOM02 went down.

    image

     

    My sample XML is provided below.

  • OpsMgr: MP Update: New Base OS MP 6.0.7026.0

     

    A new Base OS MP Version 6.0.7026.0 has shipped.  This management pack includes updated MP’s for Windows 2003 through Windows 2012 operating systems.  This updated MP will import into OpsMgr 2007 or 2012 management groups.

     

    http://www.microsoft.com/en-us/download/details.aspx?id=9296

     

    image

     

     

    Ok – so what's new in this MP?

     

    The April 2013 update (version 6.0.7026.0) of the Windows Server Operating System Management Pack contains the following changes:

    • Fixed a bug in Microsoft.Windows.Server.2008.Monitoring.mp where the performance information for Processor was not getting collected.
    • Made monitoring of Cluster Shared Volume consistent with monitoring of Logical Disks by adding performance collection rules. (“Cluster Shared Volume - Free space / MB”,”Cluster Shared Volume - Total size / MB”,”Cluster Shared Volume - Free space / %”,”Cluster Disk - Total size / MB”,”Cluster Disk - Free space / MB”,”Cluster Disk - Free space / %”)
    • Fixed bug in Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp where the Cluster disks running on Windows Server 2008 (non R2) were not discovered.
    • Fixed bug 'Cluster Disk Free Space Percent' and Cluster Disk Free Space MB' monitors generate alerts with bad descriptions when the volume label of a cluster disk is empty.
    • Added feature to raise event when NTLM requests time out and customers are unable to use mailboxes, outlook stops responding, due to the low default value for Max Concurrent API registry Key (HLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters) , which is a ceiling for the maximum NTLM or Kerberos PAC password validations a server can take care of at a time. It uses the “Netlogon” performance counter to check for the issue.

     

    These fixes address the majority of known issues discussed in my last article on the Base OS MP:

    http://blogs.technet.com/b/kevinholman/archive/2012/09/27/opsmgr-mp-update-new-base-os-mp-6-0-6989-0-adds-support-for-monitoring-windows-server-2012-os-and-fixes-some-previous-issues.aspx

     

    A note on Processor utilization monitoring and collection:

    Distinct rules and monitors were created for Windows Server 2008, and 2008 R2.  Server 2008 will monitor and collect “Processor\% Processor Time” while Server 2008 R2 will monitor and collect “Processor Information\% Processor Time”.  Overrides were included in the MP to disable the “2008” rules and monitors for the 2008 R2 instances.  If for some reason you prefer to collect and monitor "from “Processor” instead of “Processor Information”, for instance if this breaks some of your existing reports, it is very simple to just override those rules and monitors back to enabled.  An unsealed override will always trump a sealed override.

     

     

     

    Known Issues in this MP:

    1.  The knowledge for the 2008 and 2008 R2 Total CPU Utilization Percentage is incorrect – the monitor was updated to a default value of 3 samples but the knowledge still reflects 5 samples.  This is still an issue (no biggie)  The 2012 monitors use 5 samples by default with correct knowledge.

    2.  There are now collection rules for Cluster disks and CSV for free space (MB), free space (%), and total size (MB),  If you want performance reports on other perfmon objects that are available in perfmon but not included in our MP, such as disk latency, idle time, etc., you will need to create these.  Since this can be complicated to get it right – I wrote an article on how to do this correctly, and offer a sample MP for download:  http://blogs.technet.com/b/kevinholman/archive/2012/09/27/opsmgr-authoring-performance-collection-rules-for-cluster-disks-the-right-way.aspx

    3.  The new monitor for Max Concurrent API has some issues and will generate a false alert in some cases.  If you have servers where this is happening – disable this monitor and it will be addressed in the next release of the MP.

  • Configuring Notifications - to include specific alerts from specific groups and classes

    So.... Say I am an Exchange Administrator in a global company.... in the good old USA.

    My company has recently implemented OpsMgr 2007 to monitor our Exchange servers.  I am going to configure my notification subscriptions so I can get an email anytime one of my Exchange servers has an issue.

    Try #1:  I start by creating a notification subscription, and I dont scope it by groups or classes (all groups, all classes).  I think this sounds fine.  However, instantly I find I am flooded with email notifications from every single alert coming into the console.  This is NOT good!

    Try #2:  Therefore – I decide I really need to see only Exchange alerts.  I scope the notification *classes* down to just Exchange classes.  This will ensure I only receive notifications from Exchange target classes.  Good?  Nope....  I soon find that when an alert comes in from the base OS, or heartbeat, or hardware, we won’t get those.  We need to add those classes back.  If we add the heartbeat (Health Service Watcher) class – we will now get heartbeat failures for ALL machines… not just restricted to exchange servers.  No good.

    Try #3:  So – we need to scope the subscription using groups.  We create a group with all our Exchange Server Windows Computer objects in it.  We can manually add these in (Explicit) or we can use a dynamic rule based on criteria - I chose NetBIOS name, and used a naming standard of EX* (all my exchange servers start with "ex").  I used an "OR" statement since the wildcard is case sensitive.

    image

    Now I create a subscriptions - and scope it to this group - and choose ALL classes....  thinking that this way, we should get ALL notifications, including base OS, exchange, and heartbeat alerts… right? 

    Nope.  Because of the object oriented monitoring model – we will only receive alerts from a rule/monitor with a target class that has a child relationship to the Windows Computer class.  This is the only class type in the group we created.  So – using the model in #3, we will get notifications from pretty much any class needed – except heartbeats.  These come from the Health Service Watcher class, and have no relation to the Windows Computer class.

    Try #4:  I am thinking, we must add the class type to our group – and any instances of that class we are interested in.  Since most object classes are a child of Windows Computer, there should not be many of these that we will have to do.

    In the group – add the Health Service Watcher display name instances, in the same way we add the Windows Computer NetBIOS names:

    clip_image002

    The AND/OR verbiage is misleading…. This was opened as a bug then closed – because it is “as designed”.

    Essentially – The or group at the top will include ANY of the following and groups below it…. BOTH the windows computer objects AND the Health Service Watcher objects are included:  (you can right click any group and choose to show members)

    clip_image004

    I tested all kinds of Exchange alerts, and heartbeat failures – and this works.  It is possible there will be other alerts we wont get in this subscription.... IF the rule or monitor that created the alert was using a target class that was unique, and not a child of "Windows Computer"

    I don’t think this will be a huge hassle moving forward… because MOST alerting is done on a target which is a child of Windows computer.  If we find one that is not – we just need to go back and add that class’s instances to the groups we create for notifications.

     

    Want alert by alert notifications?  Where you can subscribe to a single alert, rule by rule, monitor by monitor?  Check out:

    http://code4ward.net/cs2/blogs/code4ward/archive/2007/09/19/set-notificationforalert.aspx

     

  • Writing monitors to target Logical or Physical Disks

    This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.

    When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple.  “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent.  “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement.  Therefore – your performance unit monitors for this example work just like you’d think.

    However – Logical Disk is very different.  On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc…   We must write our monitors to take this into account. 

    For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target.  The only way this would work, is if we SPECIFICALLY chose the instance in perfmon.  I will explain:

    Bad example #1:

    I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.

    I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold. 

    image

    I target a generic class, such as “Windows Server Operating System”.

    I choose the perf counter I want – and select all instances:

    image

    And save my monitor.

    The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target.  This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances.  If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues.  HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter.  For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert. 

    This is SERIOUS.  This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.

    A quick review of Health Explorer will show what is happening:

    This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:

    image

    Then went “healthy” in the same SECOND from the _Total Instance:

    image

    Then flipped back to unhealthy, at the same time – for the D: instance.

    image

     

    I think you can see how bad this is.  I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn't understand the class model, or multi-instance perf counters.  In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!

     

    Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter.  So – what should I have used?  Well, in this case – I should use something like “Windows 2008 Logical Disk”  But we can still screw that up!  :-)

    Bad example #2:

    I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

    I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

    image

    I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

    I choose the perf counter I want – and select all instances:

    image

    And save my monitor.

    Ack!  The SAME problem!  Why????

    The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING.  This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.

    Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.

    image

     

    When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn't event their drive letter!  I will examining the F: drive monitor:

    I can see it was turned unhealthy because of the free space threshold hit on the D: drive!

    image

    and then flipped back to healthy due to the available space on the C: instance:

    image

    This is very, very bad.  So – what are we supposed to do???

     

    We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object.  Make sense?  Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM.  This is actually easier than it sounds:

     

    Good example:

     

    I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

    I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

    image

    I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

    I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2.  Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:

    image

     

    This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class.  You can see all of these if you viewed that class in discovered inventory.  What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name.  In perfmon – we see the instance names are “C:” or “D:”

    image

    In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:

    image

     

    So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.

    image

     

    The wildcard parameter is actually something like this:

    $Target/Property[Type="MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice"]/DeviceID$

    This simply is a reference to the MP that defined the “Device ID” property on the class.

     

    Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.

     

     

    You can use this same process for any multi-instance perf object.  I have a (slightly less verbose) example using SQL server HERE.

     

    To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List.  These should indicate lots of alerts, and monitor flip-flop, and should be investigated.

  • OpsMgr: Network utilization scripts in BaseOS MP version 6.0.6958.0 may cause high CPU utilization and service crashes on Server 2003

    Recently I discussed some of the changes in the Base OS MP version 6.0.6958.0

    OpsMgr- MP Update- New Base OS MP 6.0.6958.0 adds Cluster Shared Volume monitoring, BPA, new rep

     

    One of the changes in this newer version of the MP is the addition of a new datasource module, which runs a script to output the Network Adapter Utilization.  The name of the datasource is “Microsoft.Windows.Server.2008.NetworkAdapter.BandwidthUsed.ModuleType”.   This datasource module uses the timed script property bag provider, along with a generic mapper condition detection.  The script name is:  “Microsoft.Windows.Server.NetwokAdapter.BandwidthUsed.ModuleType.vbs”

     

    There are 3 rules, and 3 monitors for each OS (2003 and 2008), which utilize this datasource:

    • Rules:
      • 2008
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedReads.Collection (Percent Bandwidth Used Read)
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedWrites.Collection (Percent Bandwidth Used Write)
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedTotal.Collection (Percent Bandwidth Used Total)
      • 2003
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedReads.Collection (Percent Bandwidth Used Read)
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedWrites.Collection (Percent Bandwidth Used Write)
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedTotal.Collection (Percent Bandwidth Used Total)
    • Monitors:
      • 2008
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedReads (Percent Bandwidth Used Read)
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedWrites (Percent Bandwidth Used Write)
        • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedTotal (Percent Bandwidth Used Total)
      • 2003
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedReads (Percent Bandwidth Used Read)
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedWrites (Percent Bandwidth Used Write)
        • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedTotal (Percent Bandwidth Used Total)

     

    Only the “Total” rules and monitors are enabled by default, the Read/Write workflows are disabled out of the box by design.

    The good:

     

    This new functionality is cool because it allows us to monitor the total utilization based on the network bandwidth as a percentage of the “total pipe”, report on this, and view the data in the console:

     

    image

     

     

    The issue:

     

    Since there is no direct perfmon data to collect this, the information must be collected via script.  I wrote about how to write this yourself HERE.

    There are 4 known issues with this script in the current Base OS MP, which can cause problems in some environments:

     

    1.  When the script executes – it consumes a high amount of CPU (WMIPrvse.exe process) for a few seconds.

    2.  The script does not support cookdown, so it runs a cscript.exe process and an instance of the script for EACH and every network adapter in your system (physical or virtual).  This makes the CPU consumption even higher, especially for systems with a large number of network adapters (such as Hyper-V servers).

    3.  The script does not support teamed network adapters very well, as they are manufacturer/driver dependent, and are often missing the WMI classes expected by the script, so you will see errors on each script execution, about “invalid class”

    4.  On some Windows 2003 servers, people have reported this script eventually causes a fault in netman.dll, and this can subsequently cause some additional critical services to fault/stop.

    Event Type:        Error
    Event Source:    Application Error
    Event Category:                (100)
    Event ID:              1000
    Date:                     16/10/2011
    Time:                     4:41:09 AM
    User:                     N/A
    Computer:          WSMSG7104C02
    Description:
    Faulting application svchost.exe, version 5.2.3790.3959, faulting module netman.dll, version 5.2.3790.3959, fault address 0x0000000000008d4f.

     

     

     

    From a CPU perspective – below is an example Hyper-V server with multiple NIC’s.  I set the rule and monitor which use this script to run every 30 seconds for demonstration purposes (they run every 5 minutes by default).

    image

     

    You can see WMI (and the total CPU) spiking every 30 seconds.

    After disabling all the rules and monitors which utilize this data source, we see the following from the same server:

    image

     

     

    Based on these issues, I’d probably recommend disabling these rules AND monitors for Windows 2003 and Windows 2008.  They seem to create a bit more impact than the usefulness of the data they provide.

     

     

    To disable these monitor and rules:

     

    Open the Authoring pane of the console.

    Highlight “Monitors” in the left pane.

     

    In the top line – click “Scope” until you see the “Scope Management Pack Object” pop up:

    image

     

    In the Look For box – type “Network”:

     

    image

     

    Tick the boxes next to “Windows Server 2003 Network Adapter” and “Windows Server 2008 Network Adapter” and click OK.

     

    image

     

    Now you will see a scoped view of only the monitors that target the windows server network adapter classes.  Expand Windows Server 2003 Network Adapter > Entity Health > Performance:

    image

     

    You can see that Read and Write monitors are already disabled out of the box.  You need to add a new override to disable the “Total” monitor.  Set enabled = false and save it to your Base OS Override MP for Windows 2003.

     

    Now, repeat this for the Server 2008 monitor for “Percent Bandwidth Used Total”.

     

    After disabling the two monitors that run this script – we also need to disable the rules that also share this script.  Highlight Rules in the left pane.

    Again – the read/write rules are disabled out of the box, so you need to create two overrides for each rule, one for Server 2003 Percent Bandwidth Used Total, and then the same that targets Server 2008:

     

    image