• Silect MPAuthor Service Pack 2 released

     

    Silect MP Author is the simple tool for authoring SCOM Management Packs.  They shipped Service Pack 2 today:

    http://www.silect.com/mp-author

    Download here:  http://www.silect.com/content/mp-author-free-download-form

    Questions?  info@silect.com

  • UR3 for SCOM 2012 R2 – Step by Step

     

     

    KB Article for OpsMgr:  http://support.microsoft.com/kb/2965445

    KB Article for all System Center components:  http://support.microsoft.com/kb/2965090

    Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=2965445

     

    Key fixes:

     

    • Reliability fix:  A deadlock condition occurs when a database is connected after an outage. You may experience this issue may when one or more HealthServices services in the environment are listed as Unavailable after a database goes offline and then comes back online.  Management servers cannot reconnect to SQL after a SQL outage because of thread exhaustion. 
    • The Desktop console crashes after exception TargetInvocationException occurs when the TilesContainer is updated. You may experience this issue after you leave the console open on a Dashboard view for a long time.
    • The Password expiration monitor is fixed for logged events. To make troubleshooting easier, this fix adds more detail to Event IDs 7019 and 7020 when they occur.
    • The Health service bounces because of high memory usage in the instance MonitoringHost: leak MOMModules!CMOMClusterResource::InitializeInstance. This issue may be seen as high memory usage if you examine monitoringhost.exe in Performance Monitor. Or, the Health service may restart every couple of days , depending on the load on the server.
    • The Health service crashes in Windows HTTP Services (WinHTTP) if the RunAs account is not read correctly.
    • Windows PowerShell stops working with System.Management.Automation.PSSnapInReader.ReadEnginePSSnapIns. You may see this issue as Event ID 22400 together with a description of "Failed to run the Powershell script."
    • The PropertyValue column in the contextual details widget is unreadable in smaller widget sizes because the PropertyName column uses too much space.
    • The update threshold for monitor "Health Service Handle Count Threshold" is reset to 30,000. You can see this issue in the environment, and the Health Service Handle Count Threshold monitor is listed in the critical state.
    • An acknowledgement (ACK) is delayed by write collisions in MS queue when lots of data is sent from 1,000 agents.
    • The execution of the Export-SCOMEffectiveMonitoringConfiguration cmdlet fails with the error "Subquery returned more than 1 value.”
    • The MOMScriptAPI.ReturnItems method can be slow because a process race condition may occur when many items are returned, and the method may take two seconds between items. Scripts may run slowly in the System Center Operations Manager environment.
    • When you are in the console and click Authoring, click Management Pack, click Objects, and then click Attributes to perform a Find operation, the Find operations seems unexpectedly slow. Additionally, the Momcache.mdb file grows very large.
    • A delta synchronization times out on SQL operations with Event ID 29181.
    • Operations Manager grooms out the alert history before an alert is closed.
    • The time-zone settings are not added to a subscription when non-English display languages are set. Additionally, time stamps on alert notifications are inaccurate for the time zone.
    • Web Browser widget requires the protocol (http or https) to be included in the URL.
    • You cannot access MonitoringHost's TemporaryStoragePath within the PowerShell Module.
    • The TopNEntitiesByPerfGet stored procedure may cause an Operations Manager dashboard performance issue. This issue may occur when a dashboard is run together with multiple widgets. Additionally, you may receive the following error message after a time-out occurs:

    [Error] :DataProviderCommandMethod.Invoke{dataprovidercommandmethod_cs370}( 000000000371AA78 )
    An unknown exception was caught during invocation and will be re-wrapped in a DataAccessException. System.TimeoutException: The operation has timed out.  at Microsoft.EnterpriseManagement.Monitoring.DataProviders.RetryCommandExecutionStrategy.Invoke(IDataProviderCommandMethodInvoker invoker) at Microsoft.EnterpriseManagement.Presentation.DataAccess.DataProviderCommandMethod.Invoke(CoreDataGateway gateWay, DataCommand command)

     
    Xplat updates:
    • Slow results are returned when you run the Get-SCXAgent cmdlet or view UNIX/Linux computers in the administration pane for lots of managed UNIX/Linux computers.
      Note To apply this hotfix, you must have version 7.5.1025.0 or later of the UNIX/Linux Process Monitoring, UNIX/Linux Log File Monitoring, and UNIX/Linux Shell Command Template management pack bundles.
    • Accessing the UNIX/Linux computers view in the administration pane can sometimes trigger the following exception message:

      Microsoft.SystemCenter.CrossPlatform.ClientLibrary.Common.SDKAbstraction.ManagedObjectNotFoundException

     

    Lets get started.

    From reading the KB article – the order of operations is:

    1. Install the update rollup package on the following server infrastructure:
      • Management servers
      • Gateway servers
      • Web console server role computers
      • Operations console role computers
    2. Apply SQL scripts.
    3. Manually import the management packs.
    4. Update Agents

    Now, we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.

           5.  Update Unix/Linux MP’s and Agents.

     

     

    1.  Management Servers

    image

    Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

    I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

    The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

    image

    Then extract the contents:

    image

    Once I have the MSP files, I am ready to start applying the update to each server by role.

    ***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

    My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

    image

    This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update does not provide any feedback that it had success or failure.  You can check the application log for the MsiInstaller events for that:

    Log Name:      Application
    Source:        MsiInstaller
    Date:          8/6/2014 3:00:46 PM
    Event ID:      1022
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          OPSMGR\kevinhol
    Computer:      SCOM01.opsmgr.net
    Description:
    Product: System Center Operations Manager 2012 Server - Update 'System Center 2012 R2 Operations Manager UR3 Update Patch' installed successfully.

    You can also spot check a couple DLL files for the file version attribute. 

    image

    Next up – run the Web Console update:

    image

    This runs much faster.   A quick file spot check:

    image

    Lastly – install the console update (make sure your console is closed):

    image

    A quick file spot check:

    image

     

     

    Secondary Management Servers:

    image

    I now move on to my secondary management servers, applying the server update, then the console update. 

    On this next management server, I will use the example of Windows Update as opposed to manually installing the MSP files.  I check online, and make sure that I have configured Windows Update to give me updates for additional products:

    image29

    This shows me two applicable updates for this server:

    image

    I apply these updates (along with some additional Windows Server Updates I was missing, and reboot each management server, until all management servers are updated.

     

    Updating Gateways:

    image

    I can use Windows Update or manual installation.

    image

    The update launches a UI and quickly finishes.

    Then I will spot check the DLL’s:

    image

    I can also spot-check the \AgentManagement folder, and make sure my agent update files are dropped here correctly:

    image

     

     

    2. Apply the SQL Scripts

    In the path on your management servers, where you installed/extracted the update, there are two SQL script files: 

    %SystemDrive%\Program Files\System Center 2012\Operations Manager\Server\SQL Script for Update Rollups

    image

    First – let’s run the script to update the OperationsManager database.  Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

    image

    Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

    You will see the following (or similar) output:

    image47

    or

    image

    IF YOU GET AN ERROR – STOP!  Do not continue.  Try re-running the script several times until it completes without errors.  In a large environment, you might have to run this several times, or even potentially shut down the services on your management servers, to break their connection to the databases, to get a successful run.

    Technical tidbit:  This script has been updated in UR3.  Even if you previously ran this script in UR1 or UR2, you must run this again.

     

    image

    Next, we have a script in UR3 to run against the warehouse DB.  Do not skip this step under any circumstances.    From:

    %SystemDrive%\Program Files\System Center 2012\Operations Manager\Server\SQL Script for Update Rollups

    Open a SQL management studio query window, connect it to your OperationsManagerDW database, and then open the script file UR_Datawarehouse.sql.  Make sure it is pointing to your OperationsManagerDW database, then execute the script.

    If you see a warning about line endings, choose Yes to continue.

    image

    Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

    You will see the following (or similar) output:

    image

     

     

    3. Manually import the management packs?

    image

    We have 6 updated MP’s to import  (MAYBE!).

    image

    The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

    The Advisor MP’s are only needed if you are using System Center Advisor services.

    However, the Image and Visualization libraries deal with Dashboard updates, and these need to be updated.

    I import all of these without issue.

    image

     

     

    4.  Update Agents

    image

    Agents should be placed into pending actions by this update (mine worked great) for any agent that was not manually installed (remotely manageable = yes):

     image

    If your agents are not placed into pending management – this is generally caused by not running the update from an elevated command prompt, or having manually installed agents which will not be placed into pending

    You can approve these – which will result in a success message once complete:

    image

     

    Soon you should start to see PatchList getting filled in from the Agents By Version view under Operations Manager monitoring folder in the console:

    image

     

     

    5.  Update Unix/Linux MPs and Agents

    image

    Next up – I download and extract the updated Linux MP’s for SCOM 2012 SP1 UR3

    http://www.microsoft.com/en-us/download/details.aspx?id=29696

    7.5.1025.0 is current at this time for SCOM 2012 R2 UR2. 

    ****Note – take GREAT care when downloading – that you select the correct download for R2.  You must scroll down in the list and select the MSI for 2012 R2:

    image

    Download the MSI and run it.  It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\

    Update any MP’s you are already using.   These are mine for RHEL, SUSE, and the universal Linux libraries:

    image

    You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports.  Give it plenty of time to complete the process of the import and MPB deployments.

    Next up – you would upgrade your agents on the Unix/Linux monitored agents.  You can now do this straight from the console:

    image

    image

    You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

    image

     

     

    6.  Update the remaining deployed consoles

    image

    This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, SCVMM server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the UR3 update.

     

     

     

    Review:

    Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

    image

     

     

    Known issues:

    See the existing list of known issues documented in the KB article.

    1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop the services on the management servers, or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

    ------------------------------------------------------
    (1 row(s) affected)
    (1 row(s) affected)
    Msg 1205, Level 13, State 56, Line 1
    Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
    Msg 3727, Level 16, State 0, Line 1
    Could not drop constraint. See previous errors.
    --------------------------------------------------------

  • Operations Manager 2012 R2 now supports SQL 2012 SP2

     

    I didn’t see any announcements on this – but several customers have been asking. 

    From the SQL Requirements for System Center 2012 R2, which looks like it was updated on July 9th:

    http://technet.microsoft.com/library/dn281933.aspx

     

    System Center 2012 R2 component
    SQL Server 2008 R2 SP1 Standard, Datacenter SQL Server 2008 R2 SP2 Standard, Datacenter SQL Server 2012 Enterprise, Standard (64-bit) SQL Server 2012 SP1 Enterprise, Standard (64-bit) SQL Server 2012 SP2
    App Controller Server    
    Data Protection Manager (DPM) Database Server  
    Operations Manager Data Warehouse
    Operations Manager Operational Database
    Operations Manager Reporting Server
    Orchestrator Management Server  
    Service Manager Database or Data Warehouse Database  
    Service Provider Foundation         
    Virtual Machine Manager Database Server  
     

     

  • Monitor for file size with SCOM – Using script and WMI examples

     

    SCOM has many different ways to monitor for a file size.  Here are some simple examples using script and WMI monitor types.

    In this specific example – this will be a monitor to look for Windows Server Registry Bloat.  The monitor will inspect the registry hives for the registry file size, and alarm when the size is over a set threshold.

    In the console, under Authoring, create a new Unit Monitor.  Choose a Timed Script Two State Monitor and choose an appropriate management pack.

    image

     

    Provide a displayname for the monitor, and choose “Windows Server Operating System” as that is the BEST generic targeting class.  I will place the monitor under “Availability” as that is most applicable for what I am trying to impact:  If the registry file grows to large, the availability of the server might become impacted.

    image

    Set a schedule that makes sense for your monitor.  Remember script based monitors consume the most resources, especially depending on the complexity of the script, so don’t try and run it too frequently.

    image

    Next, give your script a name that it will be compiled in XML as, and paste in the body of your script.  Here is my script below.  It accepts two parameters:  the full path to the file we wish to monitor, and the size threshold.

    Option Explicit Dim oAPI, oBag, objFSO, objFile, varSize, oArgs, filepath, threshold Set oArgs = Wscript.Arguments filepath = oArgs(0) threshold = int(oArgs(1)) Set oAPI = CreateObject("MOM.ScriptAPI") Set objFSO = CreateObject("Scripting.FileSystemObject") Set objFile = objFSO.GetFile(filepath) varSize = objFile.Size If varSize > threshold Then Set oBag = oAPI.CreatePropertyBag() Call oBag.AddValue("Status","Bad") Call oBag.AddValue("Size", varSize) Call oBag.AddValue("Threshold", threshold) Call oAPI.Return(oBag) Call oAPI.LogScriptEvent("regfilesize.vbs", 160, 0, "The registry file size of HKLM\SOFTWARE is greater than the threshold of " & threshold & " bytes. The current size is: " & varSize & " bytes") Else Set oBag = oAPI.CreatePropertyBag() Call oBag.AddValue("Status","Ok") Call oBag.AddValue("Size", varSize) Call oBag.AddValue("Threshold", threshold) Call oAPI.Return(oBag) Call oAPI.LogScriptEvent("regfilesize.vbs", 160, 0, "The registry file size of HKLM\SOFTWARE is less than the threshold of " & threshold & " bytes. The current size is: " & varSize & " bytes") End If

    Then select the “parameters” button, and provide the params:

    image

     

    Next – we must provide the “Unhealthy” expression.  We are returning a PropertyBag from the script as “Status” which will either be “Bad” or “Ok”.  The parameter name here is in the format:  Property[@Name='Status']

    image

    Repeat for Healthy expression:

    image

    Configure the health status you are looking to drive:

    image

     

    And alerting.  Note:  to make the value of the alert higher, you can include data from the propertybags returned in the script, into the alert context.  See the examples below for Size and Threshold, along with the computer name:

    image

     

    Here is the finished result of the alert:

    image

     

    And Health Explorer output is also very useful:

    image

     

    If you need to tune the monitor for specific systems – the script arguments are automatically exposed in Overrides:

    image

     

    Additional reading and examples on using script based monitors:

    http://technet.microsoft.com/en-us/library/ff629453.aspx

    http://blogs.technet.com/b/kevinholman/archive/2014/03/06/create-a-script-based-monitor-for-the-existence-of-a-file-with-recovery-to-copy-file.aspx

    http://blogs.technet.com/b/kevinholman/archive/2014/02/11/opsmgr-simple-example-script-based-monitor-with-script-based-recovery.aspx

    http://blogs.technet.com/b/kevinholman/archive/2009/07/22/101-using-custom-scripts-to-write-events-to-the-opsmgr-event-log-with-momscriptapi-logscriptevent.aspx

    http://blogs.technet.com/b/kevinholman/archive/2011/03/02/how-to-collect-performance-data-from-a-script-example-network-adapter-utilization.aspx

    http://contoso.se/blog/?p=1367

     

    You can make this even more sexy, by creating a composite datasource for the script.  Then create a Monitortype to call the datasource, and then create Monitors to pass the necessary data.  Then you can also create a script based performance collection rule to use the same datasource.

     

     

    Ok, that’s pretty cool.  But – what about another way? 

     

    SCOM also has a built in WMI based monitor, which will accept WMI queries to which you can map as performance type data with thresholds.  I previously wrote examples of this:

    Lets create another new Unit Monitor, WMI Performance Counters, Static Threshold, Simple Threshold:

    image

    Give it a name, choose Windows Server Operating System as that is the preferred generic target of choice, and choose Availability.

    image

     

    We will connect to root\cimv2.  The query we will use is:

    select filesize from cim_datafile where name='c:\\windows\\system32\\config\\software'

    image

     

    The Performance Mapper screen might be the most confusing.  We simply just need to make up the data as to how we’d like to see it inserted in SCOM. 

    image

    I used “FileSize” for the counter, since that is what I am querying from WMI.  Then I need to make sure that Value matches the counter name I used, and in the format of:  $Data/Property[@Name='QueryObject']$

    Next I set my threshold value:

    image

    Configure health according to what you desire:

    image

    Configure alerting:

    image

    The subsequent alert:

    image

    And Health Explorer:

    image

     

    Now, we can also create a rule – to collect this value, and have a report for which servers have the biggest registry:

    Create a new rule, collection, performance based, WMI:

    image

    Provide a name and target:

    image

    Provide the same query, and set a frequency that you need for reporting on changes.

    image

     

    Fill out the performance mapper just as we did above:

    image

     

    Now – create a performance view to examine the data:

    image

    image

     

    image

     

    And even a cool dashboard to show off all of it:

    image

     

    For additional reading on using WMI counters in SCOM:

    http://blogs.technet.com/b/kevinholman/archive/2008/07/02/collecting-and-monitoring-information-from-wmi-as-performance-data.aspx

    http://blogs.msdn.com/b/steverac/archive/2009/08/30/monitoring-file-size-with-custom-wmi-performance-counter.aspx

  • The case of the Dell (Detailed) MP – beware of large environments

     

    This article is not just a warning about the Dell (Detailed) MP, but the danger of importing ANY management pack into your environment without fully understanding the intended scope, scalability, and any known/common issues.

    I recently worked with a customer who had an interesting issue.  They had a very large agent based monitoring environment (greater than 10,000 agents).  While performing a supportability review, we noticed that Config generation was failing.  This was evidenced by the Config monitors showing red on the console, alerts generated, events logged in the Management Server SCOM event logs, and most notably by the fact that agents were not getting updated config in a timely fashion.

    Events were similar to:

    Log Name:      Operations Manager
    Source:        OpsMgr Management Configuration
    Event ID:      29181
    Computer:      managementserver.domain.com
    Description:
    OpsMgr Management Configuration Service failed to execute 'SnapshotSynchronization' engine work item due to the following exception

    Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
       at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.EndSnapshot(String deltaWatermark)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.EndSnapshot(String deltaWatermark)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.ExecuteSharedWorkItem()
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
    -----------------------------------
    System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
       at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
       at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
       at System.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, String endMethod)
       at System.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
       at System.Data.SqlClient.SqlCommand.EndExecuteReader(IAsyncResult asyncResult)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
    ClientConnectionId:724196c1-d9ec-4f29-8807-b16cab05fcc6

     

    Our initial issue was due to the fact that the management servers were running Windows 2012 RTM, with .NET 4.5.  There is an issue here and we needed to install .NET 4.5.1 to resolve these timeouts.  This got us past the initial failing for Snapshot Config failing.

    Next – we saw that Delta Config started failing:

    Log Name:      Operations Manager
    Source:        OpsMgr Management Configuration
    Event ID:      29181
    Computer:      managementserver.domain.com
    Description:
    OpsMgr Management Configuration Service failed to execute 'DeltaSynchronization' engine work item due to the following exception

    Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
       at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.CmdbDataProvider.GetConfigurationDelta(String watermark)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.TracingConfigurationDataProvider.GetConfigurationDelta(String watermark)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.TransferData(String watermark)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.ExecuteSharedWorkItem()
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
       at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
    -----------------------------------
    System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
       at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
       at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
       at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
       at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
       at System.Data.SqlClient.SqlDataReader.Read()
       at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadManagedEntitiesProperties(SqlDataReader reader)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadData(SqlDataReader reader)
       at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
    ClientConnectionId:9d9ec759-e9bf-4c1e-a958-581377c630b3

    We run a snapshot config every 24 hours by default.  We run a delta config every 30 seconds by default.  These are controlled via the ConfigService.config file located in the \Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\ directory.  Delta config timing out was odd.  There can be many reasons for this, so the next step was to take a SQL trace and see what expensive queries were running.

    If you want to see these in more clarity – the Config service logs these jobs to the CS.WorkItem table:

    SELECT * FROM cs.workitem
    ORDER BY WorkItemRowId DESC

    You can filter these by Delta Sync or the daily Snapshot sync as well:

    SELECT * FROM cs.workitem
    WHERE WorkItemName like '%delta%'
    ORDER BY WorkItemRowId DESC

    SELECT * FROM cs.workitem
    WHERE WorkItemName like '%snap%'
    ORDER BY WorkItemRowId DESC

    WorkItemStateId is the value of success or fail for the job.  It is normal to see some failures, for instance when multiple management servers try and execute the same job, some of those will fail, by design.

    1    Running
    10    Failed
    12    Abandoned
    15    Timed out
    20    Succeeded

    What we found – was one of the MP’s – the Dell Hardware MP – was consuming a large amount of SQL server CPU time, just to queries some standard Managed Type views in the database, many of these lasting over 10 minutes.

    When we researched further, we found that the “Dell Windows Server (Detailed Edition)” management pack had been imported, and in the documentation there was no mention of scalability limitations.  However, we found in a much older (4.x) version of the documentation, Dell specifically states that they recommend the Detailed MP only for small environments, when the monitored server count is less than 300 agents!!!!  We had already discovered and were monitoring over 5000 Dell servers.

    This massive discovery data influx was also causing Config Churn – and binding showing up as 2115 errors for discovery data:

    Log Name:      Operations Manager
    Source:        HealthService
    Event ID:      2115
    Computer:      managementserver.domain.com
    Description:
    A Bind Data Source in Management Group Production has posted items to the workflow, but has not received a response in 1510 seconds.  This indicates a performance or functional problem with the workflow.
    Workflow Id : Microsoft.SystemCenter.CollectDiscoveryData
    Instance    : managementserver.domain.com
    Instance Id : {B3FA7F2F-3D4A-236D-D3FD-119B3E01C3E3}

    So, just delete the MP, right?

    Well, lets talk about what must happen when we delete an MP.  When you right click an MP in the console to delete it, we must first delete any discovered instances of any classes defined in that MP.  (Such as an instance of “Dell Server BIOS”.)  In order to delete an instance of a class, we must first also delete ALL monitoring data associated with that instance.  And I don’t mean just simply mark it as “deleted” in the database.  It must actually be deleted transactionally from the tables.  This means all alerts, all monitor based state changes, all events, all performance data, etc.  This can be MASSIVE overhead.

    What we actually experienced, is the console locking up, we could track the SQL statements trying to delete the management pack and all the instance data, however this would time out eventually and never return anything to the console.  It would just go away, all the while our MP still existed.

    So what can we do?

    Well, we do have a possible solution…. in the Remove-SCOMDisabledClassInstance PowerShell commandlet.  This cmdlet allows us to delete the discovered instance data methodically, and slowly.  What this cmdlet does, is to delete any discovered instances in the management group, where that instance’s discovery is explicitly disabled via override.

    So – we find all the discoveries in the Dell Detailed MP, and we create a new Override MP, to store a disable override for each discovery in.  Then, we run Remove-SCOMDisabledClassInstance.  This will run and run and run…. seemingly forever, until it returns with no errors.  In many cases, even this cmdlet will time out or crash with an exception, which can be normal when deleting a massive amount of data.

    One trick to help with this process – is to set your state, performance, and event retention in the OpsDB to ONE day, then run grooming.  This will greatly reduce the amount of data we must delete transactionally.

    Then – just keep running Remove-SCOMDisabledClassInstance.  In this specific case, because the amount of data was so large, it actually took over a day and probably over 100 executions, before the instances were all removed.  You can track the instances being removed, by creating a query that counts the records in the Managed Type tables you are deleting from.  Here is part of the one I crafted for this MP:

    select sum(TCount) As TotalCount
    from
    (
    select count (*) as Tcount
    from MT_Dell$WindowsServer$Server
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$BIOS
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$Detailed$MemoryUnit
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$Detailed$ProcUnit
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$Detailed$PSUnit
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$EnclosurePhysicalDisk
    union all
    select count (*) as Tcount
    from MT_Dell$WindowsServer$ControllerConnector
    ) as T

    As you run the Remove-SCOMDisabledClassInstance command, you will see these instance counts slowly eroding.  You just have to keep running it until it completes without a timeout or an exception.

    Once the instance count gets to zero…. you can delete the MP.  We found this time the MP deleted in seconds!

    Now that this MP was gone, the expensive query was over… and we saw the binding on Discovery Data go back to a more reasonable occurrence count and time value.

     

    The lesson to learn here is – be careful when importing MP’s.  A badly written MP, or an MP designed for small environments, might wreak havoc in larger ones.  Sometimes the recovery from this can be long and quite painful.   An MP that tests out fine in your Dev SCOM environment might have issues that wont be seen until it moves into production.  You should always monitor for changes to a production SCOM deployment after a new MP is brought in, to ensure that you don’t see a negative impact.  Check the management server event logs, MS CPU performance, database size, and disk/CPU performance to see if there is a big change from your established baselines.

    If you are designing a large agent deployment that nears our maximum scalability (currently 15,000 agents) great consideration must go into the management packs in scope.  If you require management packs that discover a large instance space per agent, and/or have a large number of workflows, you might find that you cannot achieve the maximum scale.