• Understanding and modifying Data Warehouse retention and grooming

    You will likely find that the default retention in the OpsMgr data warehouse will need to be adjusted for your environment.  I often find customers are reluctant to adjust these – because they don't know what they want to keep.  So – they assume the defaults are good – and they just keep EVERYTHING. 

    This is a bad idea. 

    A data warehouse will often be one of the largest databases supported by a company.  Large databases cost money.  They cost money to support.  They are more difficult to maintain.  They cost more to backup in time, tape capacity, network impact, etc.  They take longer to restore in the case of a disaster.  The larger they get, the more they cost in hardware (disk space) to support them.  The larger they get, can impact how long reports take to complete.

    For these reasons – you should give STRONG consideration to reducing your warehouse retention to your reporting REQUIREMENTS.  If you don't have any – MAKE SOME!

    Originally – when the product released – you had to directly edit SQL tables to adjust this.  Then – a command line tool was released to adjust these values – making the process easier and safer.  This post is just going to be a walk through of this process to better understand using this tool – and what each dataset actually means.

    Here is the link to the command line tool: 

    http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx

     

    Different data types are kept in the Data Warehouse in unique “Datasets”.  Each dataset represents a different data type (events, alerts, performance, etc..) and the aggregation type (raw, hourly, daily)

    Not every customer will have exactly the same data sets.  This is because some management packs will add their own dataset – if that MP has something very unique that it will collect – that does not fit into the default “buckets” that already exist.

     

    So – first – we need to understand the different datasets available – and what they mean.  All the datasets for an environment are kept in the “Dataset” table in the Warehouse database.

    select * from dataset
    order by DataSetDefaultName

    This will show us the available datasets.  Common datasets are:

    Alert data set
    Client Monitoring data set
    Event data set
    Microsoft.Windows.Client.Vista.Dataset.ClientPerf
    Microsoft.Windows.Client.Vista.Dataset.DiskFailure
    Microsoft.Windows.Client.Vista.Dataset.Memory
    Microsoft.Windows.Client.Vista.Dataset.ShellPerf
    Performance data set
    State data set

    Alert, Event, Performance, and State are the most common ones we look at.

     

    However – in the warehouse – we also keep different aggregations of some of the datasets – where it makes sense.  The most common datasets that we will aggregate are Performance data, State data, and Client Monitoring data (AEM).  The reason we have raw, hourly, and daily aggregations – is to be able to keep data for longer periods of time – but still have very good performance on running reports.

    In MOM 2005 – we used to stick ALL the raw performance data into a single table in the Warehouse.  After a year of data was reached – this meant the perf table would grow to a HUGE size – and running multiple queries against this table would be impossible to complete with acceptable performance.  It also meant grooming this table would take forever, and would be prone to timeouts and failures.

    In OpsMgr – now we aggregate this data into hourly and daily aggregations.  These aggregations allow us to “summarize” the performance, or state data, into MUCH smaller table sizes.  This means we can keep data for a MUCH longer period of time than ever before.  We also optimized this by splitting these into multiple tables.  When a table reaches a pre-determined size, or number of records – we will start a new table for inserting.  This allows grooming to be incredibly efficient – because now we can simply drop the old tables when all of the data in a table is older than the grooming retention setting.

     

    Ok – that’s the background on aggregations.  To see this information – we will need to look at the StandardDatasetAggregation table.

    select * from StandardDatasetAggregation

    That table contains all the datasets, and their aggregation settings.  To help make more sense of this -  I will join the dataset and the StandardDatasetAggregation tables in a single query – to only show you what you need to look at:

    SELECT DataSetDefaultName,
    AggregationTypeId,
    MaxDataAgeDays
    FROM StandardDatasetAggregation sda
    INNER JOIN dataset ds on ds.datasetid = sda.datasetid
    ORDER BY DataSetDefaultName

    This query will give us the common dataset name, the aggregation type, and the current maximum retention setting.

    For the AggregationTypeId:

    0 = Raw

    20 = Hourly

    30 = Daily

    Here is my output:

    DataSetDefaultName AggregationTypeId MaxDataAgeDays
    Alert data set 0 400
    Client Monitoring data set 0 30
    Client Monitoring data set 30 400
    Event data set 0 100
    Microsoft.Windows.Client.Vista.Dataset.ClientPerf 0 7
    Microsoft.Windows.Client.Vista.Dataset.ClientPerf 30 91
    Microsoft.Windows.Client.Vista.Dataset.DiskFailure 0 7
    Microsoft.Windows.Client.Vista.Dataset.DiskFailure 30 182
    Microsoft.Windows.Client.Vista.Dataset.Memory 0 7
    Microsoft.Windows.Client.Vista.Dataset.Memory 30 91
    Microsoft.Windows.Client.Vista.Dataset.ShellPerf 0 7
    Microsoft.Windows.Client.Vista.Dataset.ShellPerf 30 91
    Performance data set 0 10
    Performance data set 20 400
    Performance data set 30 400
    State data set 0 180
    State data set 20 400
    State data set 30 400

     

    You will probably notice – that we only keep 10 days of RAW Performance by default.  Generally – you don't want to mess with this.  This is simply to keep a short amount of raw data – to build our hourly and daily aggregations from.  All built in performance reports in SCOM run from Hourly, or Daily aggregations by default.

     

    Now we are cooking!

    Fortunately – there is a command line tool published that will help make changes to these retention periods, and provide more information about how much data we have currently.  This tool is called DWDATARP.EXE.  It is available for download HERE.

    This gives us a nice way to view the current settings.  Download this to your tools machine, your RMS, or directly on your warehouse machine.  Run it from a command line.

    Run just the tool with no parameters to get help:    

    C:\>dwdatarp.exe

    To get our current settings – run the tool with ONLY the –s (server\instance) and –d (database) parameters.  This will output the current settings.  However – it does not format well to the screen – so output it to a TXT file and open it:

    C:\>dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW > c:\dwoutput.txt

    Here is my output (I removed some of the vista/client garbage for brevity)

     

    Dataset name Aggregation name Max Age Current Size, Kb
    Alert data set Raw data 400 18,560 ( 1%)
    Client Monitoring data set Raw data 30 0 ( 0%)
    Client Monitoring data set Daily aggregations 400 16 ( 0%)
    Configuration dataset Raw data 400 153,016 ( 4%)
    Event data set Raw data 100 1,348,168 ( 37%)
    Performance data set Raw data 10 467,552 ( 13%)
    Performance data set Hourly aggregations 400 1,265,160 ( 35%)
    Performance data set Daily aggregations 400 61,176 ( 2%)
    State data set Raw data 180 13,024 ( 0%)
    State data set Hourly aggregations 400 305,120 ( 8%)
    State data set Daily aggregations 400 20,112 ( 1%)

     

    Right off the bat – I can see how little data that daily performance actually consumes.  I can see how much data that only 10 days of RAW perf data consume.  I also see a surprising amount of event data consuming space in the database.  Typically – you will see that perf hourly will consume the most space in a warehouse.

     

    So – with this information in hand – I can do two things….

    • I can know what is using up most of the space in my warehouse.
    • I can know the Dataset name, and Aggregation name… to input to the command line tool to adjust it!

     

    Now – on to the retention adjustments.

     

    First thing – I will need to gather my Reporting service level agreement from management.  This is my requirement for how long I need to keep data for reports.  I also need to know “what kind” of reports they want to be able to run for this period.

    From this discussion with management – we determined:

    • We require detailed performance reports for 90 days (hourly aggregations)
    • We require less detailed performance reports (daily aggregations) for 1 year for trending and capacity planning.
    • We want to keep a record of all ALERTS for 6 months.
    • We don't use any event reports, so we can reduce this retention from 100 days to 30 days.
    • We don't use AEM (Client Monitoring Dataset) so we will leave this unchanged.
    • We don't report on state changes much (if any) so we will set all of these to 90 days.

    Now I will use the DWDATARP.EXE tool – to adjust these values based on my company reporting SLA:

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Hourly aggregations" -m 90

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Daily aggregations" -m 365

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Alert data set" -a "Raw data" -m 180

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Event data set" -a "Raw Data" -m 30

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Raw data" -m 90

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Hourly aggregations" -m 90

    dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Daily aggregations" -m 90

     

    Now my table reflects my reporting SLA – and my actual space needed in the warehouse will be much reduced in the long term:

     

    Dataset name Aggregation name Max Age Current Size, Kb
    Alert data set Raw data 180 18,560 ( 1%)
    Client Monitoring data set Raw data 30 0 ( 0%)
    Client Monitoring data set Daily aggregations 400 16 ( 0%)
    Configuration dataset Raw data 400 152,944 ( 4%)
    Event data set Raw data 30 1,348,552 ( 37%)
    Performance data set Raw data 10 468,960 ( 13%)
    Performance data set Hourly aggregations 90 1,265,992 ( 35%)
    Performance data set Daily aggregations 365 61,176 ( 2%)
    State data set Raw data 90 13,024 ( 0%)
    State data set Hourly aggregations 90 305,120 ( 8%)
    State data set Daily aggregations 90 20,112 ( 1%)

     

    Here are some general rules of thumb (might be different if your environment is unique)

    • Only keep the maximum retention of data in the warehouse per your reporting requirements.
    • Do not modify the performance RAW dataset.
    • Most performance reports are run against Perf Hourly data for detail performance throughout the day.  For reports that span long periods of time (weeks/months) you should generally use Daily aggregation.
    • Daily aggregations should generally be kept for the same retention as hourly – or longer.
    • Hourly datasets use up much more space than daily aggregations.
    • Most people don't use events in reports – and these can often be groomed much sooner than the default of 100 days.
    • Most people don't do a lot of state reporting beyond 30 days, and these can be groomed much sooner as well if desired.
    • Don't modify a setting if you don't use it.  There is no need.
    • The Configuration dataset generally should not be modified.  This keeps data about objects to report on, in the warehouse.  It should be set to at LEAST the longest of any perf, alert, event, or state datasets that you use for reporting.
  • Which hotfixes should I apply?

     

    This is updated as of 8-11-2014 

    In general - you should evaluate all hotfixes available, and only apply those applicable to your environment.  However, some of these below I have seen impact almost every environment, and should be heavily considered.

    This list is nothing official.... this is just a general list of the recommended hotfixes I end up proactively applying to most environments.... it is not a complete list of ALL hotfixes, and you may be affected by other issues.

     

    Common OpsMgr 2012 R2 updates:

    This list ABSOLUTELY assumes you are at OpsMgr 2012 R2 level as a base.

    Hotfix Update Files Resolves
    Applies to: Comments
    2965445 

    UR3 for 2012 R2
    OpsMgr 2012 R2 Update Rollup 3          
               
    http://support.microsoft.com/kb/2965445

    http://blogs.technet.com/b/kevinholman/archive/2014/08/07/ur3-for-scom-2012-r2-step-by-step.aspx
    Many updates. See KB article for all.         
             
    MS
    GW
    Agents
    Console
    WebConsole
    MP Import 
    SQL scripts
    See the KB and blog article

     

    Common OpsMgr 2012 SP1 updates:

    This list ABSOLUTELY assumes you are at OpsMgr 2012 SP1 level as a base (7.0.9538.0).

    Hotfix Update Files Resolves
    Applies to: Comments
    2965420 

    UR7 for 2012 SP1
    OpsMgr 2012 SP1 Update Rollup 7        
               
    http://support.microsoft.com/kb/2965420
               
          
    Many updates. See KB article for all.         
             
    MS
    GW            
    Agents
    Console
    WebConsole
    MP Import 
    ACS
    SQL script  
    This should be applied after installing SP1, as it has some critical fixes.  See the KB article – do not apply it immediately after installing SP1.

     

    Common OpsMgr 2007 R2 hotfixes:

    This list ABSOLUTELY assumes you are at OpsMgr R2-RTM level as a base (6.1.7221.0).   

    Hotfix Update Files Resolves
    Applies to: Comments
    MP Update

    Microsoft.SystemCenter.2007.mp
    6.1.7695.0

    Microsoft.SystemCenter.OperationsManager.2007.mp
    6.1.7695.0

    Microsoft.SystemCenter.OperationsManager.AM.DR.2007.mp
    6.1.7695.0

    Microsoft.SystemCenter.OperationsManager.Reports.2007.mp
    6.1.7695.0

    ODR.mp
    6.1.7695.0

    New reports, knowledge, monitors, rules.  See MP Guide.
    MP import only I recommend this update for ALL OpsMgr R2 environments.
    2783850

    R2 CU7
    OpsMgr 2007 R2 CU7 Cumulative Update

    http://support.microsoft.com/kb/2783850/en-us

    http://www.microsoft.com/en-us/download/details.aspx?id=36379
               
    Multiple.  See KB Article.  Note this is a DLL update, MP updates, and SQL scripts update.
    Many updates.  See KB
    RMS
    MS
    GW 
    Agents
    AuditCollector
    Console
    WebConsole
    MP Import
    TSQL Script



    This hotfix includes a SQL script, which you execute on the database in a query window.
    971233 none The console shows customized subscriptions SMTP{GUID} after you upgrade to OpsMgr R2 from OpsMgr SP1 Operations Database (TSQL only) I recommend this hotfix only if you are impacted with this issue.

     

    Common related Windows Operating System Hotfixes:

    This list is not sorted by OS or anything special – just a collection of OS related hotfixes that SCOM might require, or might fix an issue with the OS that impacts OpsMgr.  These can apply to OpsMgr 2012, 2007R2, or 2007SP1 environments.  

    Hotfix Resolves
    Applies to: Comments
    2954185 Memory leak in WMIPRVSE.exe caused by monitoring DNS.  Agent role
    Server 2012 R2
    I recommend this hotfix on all Windows Server 2012 R2 DNS servers that are monitored using the DNS management pack, or any type of monitoring which leverages the DNS PowerShell providers.  The leak is caused by accessing the DNS PowerShell providers, which scripts will often call.
    2923126 Agents on Windows 2012 R2 Domain Controllers can stop responding or heart-beating Agent role
    Server 2012 R2
    Windows 8.1
    I recommend this hotfix on all Windows Server 2012 R2 servers that are experiencing this issue, or proactively on any WS2012R2 domain controller.
    2790831 Handle leak in WmiPrvSE.exe process on Windows 8 or Windows Server 2012 Agent role
    Management Server

    Server 2012
    Windows 8
    I recommend this hotfix to stop the handle leak that WILL occur without it.
    2692929 "0x80041001" error when the Win32_Environment WMI class is queried by multiple requestors in Windows 7 or in Windows Server 2008 R2 Agent role
    Server 2008R2
    I recommend this hotfix for Server 2008 R2 machines that also host a ConfigMgr 2007 role, and you are using the ConfigMgr 2006 MP.  This fixes an annoying issue
    2618982
    Memory leak occurs when monitoring IIS.  The IIS MP discovery modules leverage the same function called out in the KB.  This causes a memory leak in Monitoringhost.exe over time on IIS servers, especially when using APM in OpsMgr 2012. Any OpsMgr Agent Managed or Server role running IIS on Windows 2008 R2 SP1 I recommend this hotfix to be applied to any Server 2008R2 SP1 OS, if it is agent managed and has IIS installed.
    2547244 The WMI service and the WMI providers stop responding when you use WMI performance classes to monitor performance on a computer that is running Windows 7 or Windows Server 2008 R2 Agent role
    Management Server
    Windows 2008 R2
    Windows 2008 R2 SP1
    Windows 7
    I recommend this hotfix if you are experiencing WMI failures, service crashes, or failures to collect data from WMI performance counters via script.  If impacted, it would be applied to any Server 2008R2 or Win7 machine, if it is agent managed or holds a SCOM server role.
    2470949 The RegQueryValueEx function returns a very large incorrect value for the "Avg. Disk sec/Transfer" performance counter in Windows Server 2008 R2 or in Windows 7 Agent role
    Management Server
    Windows 2008 R2
    Windows 2008 R2 SP1
    Windows 7
    I recommend this hotfix to be applied to any Server 2008R2 or Win7 machine, if it is agent managed or holds a SCOM server role.
    2495300

    (see notes)
    Invalid "Avg. Disk sec/Transfer" value returned by the RegQueryValueEx function in Windows Server 2008 or in Windows Vista Agent role
    Management server
    Windows 2008
    Vista
    I recommend this hotfix to be applied to any Server 2008 or Vista machine, if it is agent managed or holds a SCOM server role.

    Due to an issue in this KB hotfix breaking teamed network adapters, I recommend deploying http://support.microsoft.com/kb/2710558/en-us in order to address this issue, which supersedes 2495300
    981314 The "Win32_Service" WMI class leaks memory in Windows Server 2008 R2 and in Windows 7 Agent role
    Management Server
    Windows 2008 R2
    Windows 7
    I recommend this hotfix to be applied to any Server 2008R2 or Win7 machine, if it is agent managed or holds a SCOM server role.


    This hotfix is already included in Server 2008 R2 Service Pack 1
    981263 Management servers or assigned agents unexpectedly appear as unavailable in the Operations Manager console in Windows Server 2003 or Windows Server 2008
    (ESE jet database corruption)
    Agent Role
    Management Server
    Server 2003 SP2
    Server 2008 SP2

    I recommend this hotfix for all RMS, MS, and GW roles running Windows Server 2003 SP2, or Windows Server 2008 SP2.

    Apply to agent machines if you feel you are impacted by this issue.
    932370 The number of physical hyperthreading-enabled processors or the number of physical multicore processors is incorrectly reported in Windows Server 2003 Agent role 
    Server 2003
    Server 2003 SP1
    Server 2003 SP2
    I recommend this hotfix for all agent managed computers running Windows Server 2003, SP1 or SP2, x86 or x64 (If managed using SCOM 2012)
    933061 WMI Stability in Server 2003 Agent role 
    Server 2003
    Server 2003 SP1
    Server 2003 SP2
    I recommend this hotfix for all agent managed computers running Windows Server 2003, SP1 or SP2, x86 or x64
    955360 Cscript 5.7 update for Server 2003 Agent role 
    Server 2003
    Server 2003 SP1
    Server 2003 SP2
    I recommend this hotfix for all agent managed computers running Windows Server 2003, SP1 or SP2, x86 or x64
    968760 High handle count on the RMS

    A managed application has a high number of thread handles and of event handles in the Microsoft .NET Framework 2.0
    SCOM 2007 RMS I recommend this hotfix if you are experiencing high handle count on the RMS. 

    This hotfix requires SP2 for the OS and .NET 2.0 SP2.
    968967 The CPU usage of an application or a service that uses MSXML 6.0 to handle XML requests reaches 100% in Windows Server 2008, Windows Vista, Windows XP Service Pack 3, or other systems that have MSXML 6.0 installed
    (Spinlock)
    RMS
    MS
    GW
    Agent
    I recommend this hotfix if you are impacted with this issue, which is very common.

    You might find a MonitoringHost.exe process randomly stuck at 100% CPU.  If so – this hotfix might be applicable.
    951327 The System Center Operations Manager 2007 console may crash in Windows Server 2008 or in Windows Vista when you open the Health Explorer window Any Vista or Server 2008 computer with a SCOM console installed I recommend this hotfix only if you run the console on Server 2008 or Vista. 


    This hotfix is already included in Server 2008 SP2.
    952664 The Event Log service may stop responding because of a deadlock on a Windows Server 2008-based or Windows Vista-based computer RMS
    MS
    GW
    Agent
    I recommend this hotfix only if you host an OpsMgr server or agent role on Vista or Server 2008. 

    This hotfix is already included in Server 2008 SP2.
    953290 An application may crash when it uses legacy methods to query performance counter values in Windows Vista or in Windows Server 2008 RMS
    MS
    GW
    Agent
    I recommend this hotfix only if you host an OpsMgr server or agent role on Vista or Server 2008. 

    This hotfix is already included in Server 2008 SP2.
    958661 FIX: Small memory leaks may occur when you use RSCA to query runtime statistics in IIS 7.0 Any OpsMgr Agent/Server role with IIS 7.0 installed I recommend this hotfix in all cases where you are monitoring servers with IIS 7.0 installed, and use the IIS Management pack.

    This hotfix is already included in Server 2008 SP2.
    958807 Windows Server 2008 Failover Clustering WMI provider does not correctly handle invalid characters in the private property names causing WMI queries to fail Any Server 2008 agent managed cluster node I recommend this hotfix only if you are impacted with this issue, and use the current Cluster MP.

    This hotfix is already included in Server 2008 SP2.

    Some general guidance on hotfixes to make you more successful:

    ALWAYS - on Server 2008 OS and later, run the hotfix MSI from an elevated command prompt window. This will launch the install of the hotfix, and then launch the boot-strapper window in an elevated process – which is required. Do this regardless of the UAC configuration of the 2008 (and later) OS.

    ALWAYS - make sure you read the instructions to understand if the hotfix is a SQL update, installed to the RMS, MS, and/or Gateway, AND/OR applies to agents as well.

    ALWAYS - make sure you double-check the DLL version of the updated files to make sure the hotfix successfully applied after installing.

    ALWAYS - make sure you double-check the \AgentManagement directory of the management servers and gateways, to make sure if there is an agent update, the x86 and x64 MSP was copied over correctly.

    ALWAYS check the language version of the hotfix, and make sure it is the same language version as your SCOM base install. For instance – if you have a English base SCOM install – do not download a localized German version of a hotfix and apply it – or it can break the English SCOM base install.

    ALWAYS log on to your OpsMgr role servers using a domain user account that meets the following requirements:

    • SCOM administrator role
    • Member of the Local Administrators group on all SCOM role servers (RMS, MS, GW, Reporting)
    • SA privileges on the SQL server instances hosting the Operations DB and the Warehouse DB.

    These rights (especially the user account having SA priv on the DB instances) are often overlooked. These are the same rights required to install SCOM, and must be granted to apply major hotfixes and upgrades (like RTM>SP1, SP1>R2, etc…) Most of the time the issue I run into is that the SCOM admin logs on with his account which is a SCOM Administrator role on the SCOM servers, but his DBA’s do not allow him to have SA priv over the DB instances. This must be granted temporarily to his user account while performing the updates, then can be removed, just like for the initial installation of SCOM as documented HERE. At NO time do your service accounts for MSAA or SDK need SA priv to the DB instances…. unless you decide to log in as those accounts to perform an update (which I do not recommend).

  • OpsMgr: MP Update: New Base OS MP 6.0.6972.0 Adds new cluster disks, changes free space monitoring, other fixes

    There is a new Base OS MP version 6.0.6972.0 available here:  http://www.microsoft.com/en-us/download/details.aspx?id=9296

     

    Be very careful updating to this new version – there are multiple changes and potential issues you should plan for and test with, that might impact your existing environments.  I will discuss them below.

     

    I previously wrote about the last MP update HERE and HERE.  Then I wrote about some issues in the MP’s with Logical Disk monitoring HERE.  Additionally, there were some problems with the network monitoring utilization scripts HERE.  All of these items have been addressed in this latest MP update. (somewhat)

     

    First – lets cover the list of updates from the guide:

    Changes in This Update

    •    Updated the Cluster shared volume disk monitors so that alert severity corresponds to the monitor state.
    •    Fixed an issue where the performance by utilization report would fail to deploy with the message “too many arguments specified”.
    •    Updated the knowledge for the available MB monitor to refer to the Available MB counter.
    •    Added discovery and monitoring of clustered disks for Windows Server 2008 and above clusters.
    •    Added views for clustered disks.
    •    Aligned disk monitoring so that all disks (Logical Disks, Cluster Shared Volumes, Clustered disks) now have the same basic set of monitors.
    •    There are now separate monitors that measure available MB and %Free disk space for any disk (Logical Disk, Cluster Shared Volume, or Clustered disk).

    Note :  These monitors are disabled by default for Logical Disks, so you will need to enable them if you want to use them in place of the default Logical Disk monitor for free space.

    •    Updated display names for all disks to be consistent, regardless of the disk type.
    •    The monitors generate alerts when they are in an error state.  A warning state does not create an alert.
    •    The monitors have a roll-up monitor that also reflects disk state. This monitor does not alert by default. If you want to alert on both warning and error states, you can have the unit monitors alert on warning state and the roll-up monitor alert on error state.
    •    Fixed an issue where network adapter monitoring caused high CPU utilization on servers with multiple NICs.
    •    Updated the Total CPU Utilization Percentage monitor to run every 5 minutes and alert if it is three consecutive samples above the threshold.
    •    Updated the properties of the Operating System instances so that the path includes the server name it applies to so that this name will show up in alerts.
    •    Disabled the network bandwidth utilization monitors for Windows Server 2003.
    •    Updated the Cluster Shared Volume monitoring scripts so they do not log informational events.
    •    Quorum disks are now discovered by default.
    •    Mount point discovery is now disabled by default.

    Notes:  This version of the Management Pack consolidates disk monitoring for all types of disks as mentioned above. However, for Logical Disks, the previous Logical Disk Free Space monitor, which uses a combination of Available MB and %Free space, is still enabled.  If you prefer to use the new monitors (Disk Free Space (MB) Low Disk Free Space (%) Low), you must disable the Logical Disk Free Space monitor before enabling the new monitors.
    The default thresholds for the Available MB monitor are not changed, the warning threshold (which will not alert) is 500MB and the error threshold (which will alert) is 300MB. This will cause alerts to be generated for small disk volumes. Before enabling the new monitors, it is recommended to create a group of these small disks (using the disk size properties as criteria for the group), and overriding the threshold for available MB.

    Ok, sounds good.  But what does all that mean to me?

     

    I will summarize the fundamental changes below:

     

    1.  Disk discovery and monitoring has changed.  We now will UNDISCOVER any “Logical Disks” that are hosted by a Windows Server 2008 R2 cluster, and REDISCOVER those as a new entity, of the “Cluster Disk” class.  This discovery only pertains to Windows Server 2008 R2 and later, it does not affect Server 2008 and older clusters.

     

    There are now THREE types of disks we will discover and monitor:

    • Logical Disks
    • Cluster Disks
    • Cluster Shared Volumes

    Logical Disks include disks that are not part of/hosted by a cluster, and include disks with a drive letter, and any disks without a drive letter (which are discovered as mount points).

    Cluster Disks include any disk that is hosted by a Microsoft Cluster as a shared resource, but not a specific Cluster Shared Volume.

    Cluster Shared Volumes are a specific type of cluster disks, that is leveraged by Hyper-V clusters for placement of virtual machines.

    For most customers, the impact will be if you have placed any instance or group specific overrides for your cluster disks, these will no longer apply, as these disks are going to be re-discovered as a new entity of a new class, “Cluster Disk”.  This new class will have entirely different monitoring targeting it, described below.

    However, this is a GOOD thing!  In the past, if you had a disk that was part of a cluster, it was undiscovered and rediscovered on each NODE when a failover occurred.  If you did overrides for the disk while it was on one node, your changes would no longer apply when it failed over to another node, because it was literally discovered as a different disk! (basemanagedentity)  This is now resolved – the disk will retain the same BaseManagedEntityId (its unique GUID under the covers in SCOM) as it moves from node to node.  It is also now “hosted” by the cluster, and not the Operating System class.

    I put together a state dashboard that demonstrates these different disk types:

     

    image

     

    There are also distinct views for these that ship inside the management pack:

    image

     

    Another point to make here – is that the Mount Point discovery, which has been enabled in all previous Base OS MP’s, is now DISABLED.  This means you will no longer discover mount points by default.  You can enable this via override if you want mount point discovery, or selectively enable it only for specific servers that you know host a mount point that you wish to monitor.

    Our mount point discovery is a bit misleading.  We don’t actually only discover mount points, we actually use the mount point discovery to discover ANY disk that does not have a drive letter assigned.  For instance, you may have noticed on your Server 2008 R2 machines, that you discovered a 100MB logical disk. 

     

    image

     

    These 100MB disks are System Reserved for Bitlocker use, to hold the boot loader.  Once you upgrade to the new MP version – new mounted disks (non-clustered disks with no drive letter) will no longer be discovered, as this discovery is disabled by default.  This will NOT remove the previously discovered disks, however.   Neither will running Remove-DisabledMonitoringObject.    The reason that Remove-DisabledMonitoringObject does NOT remove these discovered disks, is because it will only remove objects if there is an explicit *override* for a discovery, disabling it.  If we change the default configuration of a discovery to disabled, the cmdlet has no impact.  So if you wanted to remove these from your management group, you simply need to add an explicit override disabling the mount point discovery, and THEN run the cmdlet.  Keep in mind – doing this will undiscover ALL your mounted disks, possibly including real mount points if you have those.  As there is ZERO value in discovering and monitoring these 100MB disks, I’d recommend disabling the mounted disk discovery with an explicit override, then create instance specific or group specific overrides for your servers that DO host a mounted disk.

     

     

    2.  Logical Disk free space monitoring, along with Cluster Disk and Cluster Shared Volume monitoring has changed.  Here are the details:

    The default configuration of the “Logical Disk Free Space” monitor is largely UNCHANGED from MP version 6.0.6958.0, which I wrote about HERE.  This was done to create the lowest possible impact on you, the admin, who is using this monitor, and likely already has many overrides and has implemented this alert into any ticketing systems.  There were many complaints that this monitor (once it was modified to allow for consecutive samples) no longer generated alerts that contained free space and MB free in the alert description.  This is still the case in this version – the monitor was not modified.  This monitor will also generate alerts for warning state AND critical state, which is NOT a good thing.  When a single monitor generates alerts on both warning and critical state, a *new* alert is *not* generated when the monitor changes from warning to critical.  We simply modify the existing alert from warning to critical (if it exists in an open state).  This modification will NOT generate a new notification subscription, nor will it route the alert to a connector subscription set with a filter for “critical” severity alerts, because it has already been inspected and watermarked.  For this reason I never recommend using three state monitors and alerting on a warning and a critical state.

    However, another complaint we often got was that customers didn’t understand how this monitor worked, in that we inspect BOTH % free threshold AND MB free threshold, and BOTH conditions need to be met before we will change the state of the monitor and generate an alert.  This is a very good design, because it helps cut out the majority of noise and remains flexible for disks of different sizes.  That said, many customers would say “I just want a simple monitor to alert on % free ONLY, or MB free ONLY…” which was easier for them to understand.  Therefore, we have added THREE new monitors for disk space monitoring of logical disks.

    These new monitors are disabled by default, to allow customers to choose if they want to implement them.  What we have done is to create two new Unit monitors, one for % free and one for MB free.  Then place both of these under an aggregate rollup monitor.

     

    image

     

    If enabled, the customer can pick if they want only %, or only MB free, or both, via overrides.  These new Unit monitors also provide a richer alert description as seen below:

    The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 28 free Mbytes.

    The disk F: on computer computer1.domain.com is running out of disk space. The value that exceeded the threshold is 4% free space.

    Additionally, if the customer DOES want alerts on warning state for these monitors, they can enable this, and additionally enable alerting on the Aggregate rollup monitor above, to issue critical alerts only.  This way, you can have unique alerting for a warning state, but if any monitor is critical, we can roll up health and generate a NEW alert for critical state, which can be used to send a notification or send to a ticketing system.

    As you can see, a lot of thought went into this new design, trying to make the new format fit as many customer requested scenarios as possible.  You essentially have three options now:

     

    • Continue to use the existing Logical Disk Free space monitor that is provided and enabled in the management pack.
    • Enable and start using the newly designed Logical Disk free space monitors, based on your specific requirements.
    • Use my addendum MP which uses a single free space monitor that is similar to the old Base OS management packs, described and available HERE

     

    For Cluster Disks, and Cluster Shared Volume disks – both of those are using the new format for free disk space monitoring:

     

    image

    image

     

    Based on this, I’d recommend considering and testing a move of your logical disk free space monitoring over to the new style as well, to have a consistent experience.  I welcome your feedback on this point.

     

    ***Note – if you enable the new Logical Disk free space monitors, the MB Free monitor will go into a critical state for any Logical disk that is under 2GB (non-system) or 500MB (system).  This means if you have any tiny disks, such as the 100MB bitlocker disks, this monitor will alert on all of those disks, potentially creating a large number of alerts.  I’d recommend undiscovering those 100MB disks (see #1 above) or create a dynamic group of disks in your override MP, based on “size is less than a specific numerical size”, and use this group to disable free space monitoring.

     

    3.  The previous “Cluster Shared Volume” MP with was “Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.mp” has a new displayname of “Windows Server Cluster Disks Monitoring” and the new classes for Cluster disks mentioned above are included in this MP, so if you didn’t import it previously because you weren't using Hyper-V Cluster Shared Volumes, you need this MP now to discover and monitor clustered disks.

     

    4.  We have disabled the Network Utilization scripts by default on Server 2003, and fixed them for Server 2008 to make them consume less resources.  I wrote about this previously HERE.  This now should be addressed, so if you previously disabled these, but want that counter for alerting or perf collection, you can consider enabling it. It should REMAIN disabled for Windows 2003, as there is an issue with Netman.dll which causes the crash of services.

     

    5.  The “Total CPU Utilization Percentage” monitor was changed.  In previous management packs, it would inspect the value every 2 minutes, and if the AVERAGE of 5 samples for “CPU Queue length”AND “% Processor Time” were over their default thresholds, we would generate an alert.  Now, we inspect the value every 5 minutes, and if the AVERAGE of 3 samples for both counters are over the thresholds, then an alert is generated.  I am told this change was made on customer request, I have to assume to spread out the time period over a longer time span…. not really sure.  Seems fairly insignificant.

     

     

    Known Issues/Things to remember:

     

    1.  Which MP’s to import:  This MP update contains the following files:

    image

    Don’t import management packs that you don’t need or use. 

    Don’t import the BPA management pack if you don’t want to see alerts for this new feature.

    Don’t import the Microsoft.Windows.Server.Reports.mp if your back-end SQL is still running SQL 2005, this MP is supported on SQL 2008 and newer only.  It will cause your reporting to break if you import this MP and your management group leverages SQL 2005 on the back-end.

    DO import the Microsoft.Windows.Server.ClusterSharedVolume.mp because this contains the discovery and monitoring for Cluster Disks, not just Cluster Shared Volumes.  If you don’t import this your monitoring of clustered disks will disappear.

     

    2.  The knowledge for the Total CPU Utilization Percentage is incorrect – the monitor was updated to a default value of 3 samples but the knowledge still reflects 5 samples.

     

    3.  There is no free space perf collection rules for “Cluster Disks”.  We have multiple performance collection rules for Logical Disks, and for Cluster Shared Volumes, however there are none for the new Cluster Disks class.  If you want performance reports on free space, disk latency, idle time, etc, you will need to create these.

     

    4.  Perf collection and disk monitoring for cluster disks and CSV’s only works when the resource group hosting the disks, are on the same node that is hosting the cluster name (quorum) resource.  If the disk’s resource group is running on a different node than the cluster name itself, perf collection and monitoring will cease.

  • Adding custom information to alert description (s) and notifications

    This is just a dump of some alert description variables I pulled from several other bloggers:

    Custom Properties for Alert Description and Notification:

    Alert Description Variables:

     

    For event Rules:

    EventDisplayNumber (Event ID):             $Data/EventDisplayNumber$
    EventDescription (Description):               $Data/EventDescription$
    Publisher Name (Event Source):              $Data/PublisherName$
    EventCategory:                                    $Data/EventCategory$
    LoggingComputer:                                $Data/LoggingComputer$
    EventLevel:                                          $Data/EventLevel$
    Channel:                                              $Data/Channel$
    UserName:                                           $Data/UserName$
    EventNumber:                                      $Data/EventNumber$
    Event Time:                                          $Data/@time$

     

    For event Monitors:

    EventDisplayNumber (Event ID):            $Data/Context/EventDisplayNumber$
    EventDescription (Description):              $Data/Context/EventDescription$
    Publisher Name (Event Source):             $Data/Context/PublisherName$
    EventCategory:                                    $Data/Context/EventCategory$
    LoggingComputer:                                $Data/Context/LoggingComputer$
    EventLevel:                                         $Data/Context/EventLevel$
    Channel:                                             $Data/Context/Channel$
    UserName:                                          $Data/Context/UserName$
    EventNumber:                                     $Data/Context/EventNumber$
    Event Time:                                         $Data/Context/@time$

     

    For Repeating Event Monitors:

    EventDisplayNumber (Event ID):              $Data/Context/Context/DataItem/EventDisplayNumber$
    EventDescription (Description):                $Data/Context/Context/DataItem/EventDescription$
    Publisher Name (Event Source):              $Data/Context/Context/DataItem/PublisherName$
    EventCategory:                                      $Data/Context/Context/DataItem/EventCategory$
    LoggingComputer:                                  $Data/Context/Context/DataItem/LoggingComputer$
    EventLevel:                                            $Data/Context/Context/DataItem/EventLevel$
    Channel:                                                $Data/Context/Context/DataItem/Channel$
    UserName:                                             $Data/Context/Context/DataItem/UserName$
    EventNumber:                                         $Data/Context/Context/DataItem/EventNumber$

      

    Performance Threshold Monitors:

    Object (Perf Object Name):                    $Data/Context/ObjectName$
    Counter (Perf Counter Name):                $Data/Context/CounterName$
    Instance (Perf Instance Name):              $Data/Context/InstanceName$
    *Value (Perf Counter Value):                  $Data/Context/Value$ 
    **Last Sampled Value                            $Data/Context/SampleValue$

    *Value will show the actual performance value for simple and avg monitors.  It will show number of samples for consecutive threshold monitors.
    **Last Sampled Value works to show the last value evaluated in a consecutive sample value monitor.

     

    Service Monitors:

    Service Name                         $Data/Context/Property[@Name='Name']$
    Service Dependencies             $Data/Context/Property[@Name='Dependencies']$
    Service Binary Path                $Data/Context/Property[@Name='BinaryPathName']$
    Service Display Name             $Data/Context/Property[@Name='DisplayName']$
    Service Description                 $Data/Context/Property[@Name='Description']$

     

    Logfile Monitors:

    Logfile Directory :                  $Data/Context/LogFileDirectory$
    Logfile name:                        $Data/Context/LogFileName$
    String:                                  $Data/Context/Params/Param[1]$

     

    Logfile rules:

    Logfile Directory:                   $Data/EventData/DataItem/LogFileDirectory$
    Logfile name:                        $Data/EventData/DataItem/LogFileName$
    String:                                  $Data/EventData/DataItem/Params/Param[1]$

     

    General:

    To show the name of the Windows Computer host:
    $Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$

     

    Notifications:

    $Data/Context/DataItem/AlertId$                                       The AlertID GUID
    $Data/Context/DataItem/AlertName$                                   The Alert Name
    $Data/Context/DataItem/Category$                                    The Alert category
    $Data/Context/DataItem/CreatedByMonitor$                       True/False
    $Data/Context/DataItem/Custom1$                                     CustomField1
    $Data/Context/DataItem/Custom2$                                    CustomField2
    $Data/Context/DataItem/Custom3$                                    CustomField3
    $Data/Context/DataItem/Custom4$                                    CustomField4
    $Data/Context/DataItem/Custom5$                                    CustomField5
    $Data/Context/DataItem/Custom6$                                     CustomField6
    $Data/Context/DataItem/Custom7$                                     CustomField7
    $Data/Context/DataItem/Custom8$                                     CustomField8
    $Data/Context/DataItem/Custom9$                                     CustomField9
    $Data/Context/DataItem/Custom10$                                  CustomField10
    $Data/Context/DataItem/DataItemCreateTime$                      UTC Date/Time of Dataitem created
    $Data/Context/DataItem/DataItemCreateTimeLocal$               LocalTime Date/Time of Dataitem created
    $Data/Context/DataItem/LastModified$                                 UTC Date/Time DataItem was modified
    $Data/Context/DataItem/LastModifiedLocal$                          Local Date/Time DataItem was modified
    $Data/Context/DataItem/ManagedEntity$                               ManagedEntity GUID
    $Data/Context/DataItem/ManagedEntityDisplayName$             ManagedEntity Display name
    $Data/Context/DataItem/ManagedEntityFullName$                   ManagedEntity Full name
    $Data/Context/DataItem/ManagedEntityPath$                          Managed Entity Path
    $Data/Context/DataItem/Priority$                                          The Alert Priority Number (High=1,Medium=2,Low=3)
    $Data/Context/DataItem/Owner$                                           The Alert Owner
    $Data/Context/DataItem/RepeatCount$                                  The Alert Repeat Count
    $Data/Context/DataItem/ResolutionState$                               Resolution state ID (0=New, 255= Closed)
    $Data/Context/DataItem/ResolutionStateLastModified$                 UTC Date/Time ResolutionState was last modified
    $Data/Context/DataItem/ResolutionStateLastModifiedLocal$          Local Date/Time ResolutionState was last modified
    $Data/Context/DataItem/ResolutionStateName$                       The Resolution State Name (New, Closed)
    $Data/Context/DataItem/ResolvedBy$                                     Person resolving the alert
    $Data/Context/DataItem/Severity$                                          The Alert Severity ID
    $Data/Context/DataItem/TicketId$                                           The TicketID
    $Data/Context/DataItem/TimeAdded$                                       UTC Time Added
    $Data/Context/DataItem/TimeAddedLocal$                               Local Time Added
    $Data/Context/DataItem/TimeRaised$                                      UTC Time Raised
    $Data/Context/DataItem/TimeRaisedLocal$                              Local Time Raised
    $Data/Context/DataItem/TimeResolved$                                  UTC Date/Time the Alert was resolved
    $Data/Context/DataItem/WorkflowId$                                      The Workflow ID (GUID)
    $Data/Recipients/To/Address/Address$                                    The name of the recipient

    The Web Console URL:
    $Target/Property[Type="Notification!Microsoft.SystemCenter.AlertNotificationSubscriptionServer"/WebConsoleUrl$

    The principalname of the management server:
    Target/Property[Type="Notification!Microsoft.SystemCenter.AlertNotificationSubscriptionServer"/PrincipalName$

     

    Also see related post:

    http://blogs.technet.com/kevinholman/archive/2009/09/23/alert-notification-subscription-variables-and-linking-that-to-the-console-database-and-sdk.aspx

  • OpsMgr: MP Update: New Base OS MP 6.0.6958.0 ships.

    Recently I discussed that we released a new Base OS MP 6.0.6957.0 which added many new features to the base OS MP’s.  In some of these new features, we got some feedback on some issues, and we are shipping an updated version of the MP to resolve the majority of the reported issues.  See my previous post describing the new features here:

    http://blogs.technet.com/b/kevinholman/archive/2011/09/30/opsmgr-new-base-os-mp-6-0-6956-0-adds-cluster-shared-volume-monitoring-bpa-and-many-changes.aspx

     

    Get the new version 6.0.6958.0 from the download center:  http://www.microsoft.com/download/en/details.aspx?id=9296

     

    What’s new?

     

    • Disabled BPA Rules by default.

    The Best Practices Analyzer monitor is now shipped disabled out of the box.  Since most customers have a lack of adherence to the best practices on specific server roles, and this monitor would generate a significant amount of noise in most customer environments, it has been changed to disabled by default.  You can enable this if you would like to compare your server roles against the built in Server 2008 R2 BPA and receive alerts on this.

    • Added appropriate SQL Stored Procedures credentials

    The reports we shipped in the new Microsoft.Windows.Server.Reports.mp contained two stored procedures which required manual intervention to assign permissions, previously.  This has been resolved.

    ***Note – this MP with these new reports was designed for SQL2008 reporting environments only.  It will fail to deploy on SQL 2005 SCOM infrastructures.  If you are using SQL 2005 for a backend for OpsMgr databases and reporting, either upgrade to SQL 2008 or later, or do not import this MP.  If you have already imported this MP, delete it.  It is not supported for SQL 2005.

    • Updated Knowledge for Logical Disks

    The knowledge for the logical disk free space monitors was updated to reflect the new default values.

    • Updated Overrides for Logical Disks

    In the previous release (6.0.6957.0) of this MP, some of your previous overrides would not apply.  This has been resolved in the current version of the MP.

    • Fixed %Idle time sorting in the utilization report.