Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Kevin Holman's System Center Blog

Posts
  • Collecting and Monitoring information from WMI as performance data

    Many times, we would like to collect information for reporting, or measure and alert on something.  Normally, we use Windows Performance Monitors to do this.  But, what do we use when a perfmon object/counter/instance doesn't exist?

     

    This post is an example of how to collect WMI information, and insert it into OpsMgr as performance data.  From there we can use it in reports and create threshold monitors.

     

    For starters... we need to find the location of the data in WMI.  We can use wbemtest to locate it and test our query.

     

    image

     

    Hit "Connect" and connect to root\cimv2.

    For this example - I am going to look at the Win32_OperatingSystem class.

    Using Enum Classes, Recursive, I find the class.  I notice the class has a property of "NumberOfProcesses".  That will do well for this example since the output will be an Integer.

    I form the query....   select numberofprocesses from win32_operatingsystem

     

    image

     

    Ok.... we know our WMI query we want.... now lets dive into the console.

     

    We will start by creating a performance collection rule.... for this query output.  Authoring pane, Create a New Rule, Collection Rules, Performance Based, WMI Performance.

    Give the rule a name (in accordance with your documented custom rule naming standards), then change the Rule Category to "PerformanceCollection", and then choose a target.  I am using Windows Server for this example.

     

    image

     

    Click Next, and on the WMI Namespace page, enter your namespace, query, and interval.  The interval in general should be no more than every 15 minutes, unless you really need a large amount of raw data for reporting.  I am using every 10 seconds for an example only.... this is not recommended generally because of the large amount of perf data that will flood the database if we targeted all Agents, or Windows Servers.

     

    image

     

    The last screen, and most confusing.... is the Performance Mapper.  This is where we will give the rule the information it needs to populate the data into the database as ordinary performance data.

    First - we need to make up custom names for Object, Counter, and Instance.  Just like data from collected from perfmon, we need to supply this.... so I will make up a name for each that makes sense.  I will use "NULL" for Instance, as I don't have any instance for this type of data in my example.

    For the Value field, this is where we will input a variable, which represents our query output.  In general, following this example, it will be $Data/Property[@Name='QueryObject']$  where you replace "QueryObject" with the name of your instance name that you queried from the WMI class.  So for my example, we will use:

     

    $Data/Property[@Name='NumberOfProcesses']$

     

    image

     

     

    Click "Create" and we are done!  How do we know if it is working?

     

    Well, we can create a new Performance view and go look at the data it is collecting:

    Create a new Custom Performance View in My Workspace.  Scope it to Windows Server (or whatever you targeted your rule to).  Then check the box for "collected by specific rules" and choose your rule from the popup box.  As long as you chose "PerformanceCollection" as the rule category, it will show up here.

     

    image

     

    And check out the Performance view - we have a nice snapshot and historical record of number of processes from WMI:   Also not the custom performance Object, Counter, and Instance being entered from our rule:

     

    image

     

     

    Ok - fun is over.  Lets use WMI to monitor for when an agent has more than 40 processes running!

     

    Create a Unit Monitor.  WMI Performance Counters, Static Thresholds, Single Thresholds, Simple Threshold.  We will fill out the Monitor wizard exactly as we did the Rule above.  However, on Interval, since this monitor will only be inspecting the data on an agent, and not collecting the performance data into the database, we can use a more frequent interval.  Checking every 1 minute is typically sufficient.  Fill out the Performance Mapper exactly as we did above.

     

    Now.... on the threshold value... I want to set a threshold of 40 processes in this example.

     

    image

     

    Over 40 = Critical, and under = Healthy.  Sounds good.

    On the Configure Alerts pane, I am going to enable Alerts, and then add Alert Descirption Variables from http://blogs.technet.com/kevinholman/archive/2007/12/12/adding-custom-information-to-alert-descriptions-and-notifications.aspx

     

    image

     

     

    Create it - and lets see if we get any alerts:

     

    Yep.  Works perfectly:

     

    image

     

     

    A quick check of Health Explorer shows it is working as designed:

     

    image

  • A little tidbit on Hot-fixes for OpsMgr

    When you apply a hot-fix to a RMS, or Management Server, or Gateway server... a couple things will happen.  First... it will update the server itself with whatever the hot-fix is supposed to fix... registry, DLL's, database updates, etc.  Next, if the update needs to flow down to all agents... it will place a MSP file in the \AgentManagement directory under the OpsMgr installation directory.

     

    image

     

     

    Then, it will put the agents that report to the hot-fixed management server, into pending actions for the update.  It will only place the agents reporting to that MS/RMS into pending... not all agents.  For this reason - you really should patch ALL your RMS, MS, and GW's first, before approving any agents.

    Then, when you "approve" an agent for the update... what it does is actually reinstall the agent, from its management server, then apply any update MSP's that are present, and that are not already installed.

     

    So - when you apply a hot-fix to a management group - before approving any agents, it is a good idea to check your \AgentManagement directories on all MS/GW roles, and make sure the \x86 and \AMD64  folders have consistent AND CORRECT patch files present.

     

    When you "approve" agents for the update... or perform a "repair", we recommend only doing 200 agents at a time, max.  Phase the updates out in batches.

     

    Then, use the "Patch List" view described in my previous blog post, to ensure all agents got updated.  For agents that still need to be updated, simply run a "Repair" on those from the console, or patch them manually. 

     

    Any new agents that get pushed will automatically get the current hot-fixes applied, as long as the hot-fix MSP's are present in the \AgentManagent directory.  However, manually installed agents must be hot-fixed manually.

     

    Lastly... on the current batch of hot-fixes....  950853 and 951380 BOTH update the SAME file.... mommodules.dll  950853 (memory leak) updates this file to 6.0.6278.11, and 951380 (cluster discovery) updates the same file to 6.0.6278.20.  IF you are planning on applying both of these fixes... technically, you only need the latter, since it includes the previous fix.

     

    Update 10-15-2008

    Now - if you are applying 954903.... this contains mommodules.dll 6.0.6278.36 which supercedes BOTH 951380 and 950853....  so if you need all three hotfixes - just apply 954903.  However - note in the picture below, if you apply two hotfixes that update the same file, the management server \AgentManagement directory still keeps the older one.... apparently the hotfix process does not understand that they update the same file, nor does it clean out the older 951380.  The problem with this - is any major agent deployment will get impacted... because we will add to the install time, and impact the network worse.  In this example - an agent push will be copying over the agent MSI (9MB) plus each hotfix in this directory....  while we dont have any direct guidance on this area - I would recommend removing the older hotfixes that no long apply, or are superceded by other hotfixes already in this directory.

     

    image

  • Surface RT battery life after the upgrade to Windows 8.1

    I have a Surface RT device (the original) and absolutely love it.  I use it every day.  However, one of the challenges I have been dealing with was after the upgrade for Windows 8.1 came out, I noticed the battery is always dead when I picked it up.  Before, I could go 3 to 5 days between charges, depending on how much I was using it.  Now, it would discharge in standby within 24 hours!

    To track this – you can create a Battery Report.  Open an elevated command prompt on the SurfaceRT device, and type in:

    powercfg /batteryreport

    image

    This will save an HTML file as seen above. 

    This is a pretty cool report that will show some interesting statistics about your batter.  But it will also show your periods of use, and how much the battery drains during connected standby:

    image

    In the table above, I can see I entered standby at around 9pm on the 17th, and when I picked it back up around 4pm the next day, the battery was almost dead!  The chart in the report shows this as well, pretty cool:

    image

    What's the fix???

    If you look on this page:  http://www.microsoft.com/Surface/en-US/support/hardware-and-drivers/battery-and-power

    There is an interesting section at the bottom:

    Surface RT only: Battery issue when updating from Windows RT 8.1 Preview

    If you updated Surface RT from Windows RT 8.1 Preview to Windows RT 8.1, you may notice a decrease in battery life. During the update, the wireless adapter power policy isn’t migrated. Instead, the power policy is set to a default value that consumes more power both during use and in the connected standby state.

    To restore the wireless adapter power policy to the correct settings, open an administrator command prompt:

    Step 1:
    Swipe in from the right edge of the screen, and then tap Search.
    (If you're using a mouse, point to the lower-right corner of the screen, move the mouse pointer up, and then click Search.)

    Step 2:
    In the search box, enter command prompt.

    Step 3:
    Touch and hold (or right-click) Command Prompt to bring up the context menu. Tap or click Run as administrator.

    Step 4:
    On the User Account Control dialog box, tap or click Yes.

    Step 5:
    At the Administrator: Command Prompt, enter the following:

    powercfg -setdcvalueindex SCHEME_CURRENT 19cbb8fa-5279-450e-9fac-8a3d5fedd0c1 12bbebe6-58d6-4636-95bb-3217ef867c1a 3

    Step 6:
    Then enter
    powercfg -setactive scheme_current

    Voila…. this fixed mine immediately.  And yes, I did update from 8.1 preview to the full release of 8.1.  My surface can now sit in connected standby mode for an entire day and only consume about 10% of the battery life.  Smile

    image

  • Upgrading Domain Controllers to Windows Server 2012 R2

    Ok, not really an upgrade, but more of “replacement”.  Smile

    With the release of Windows Server 2012 R2 to MSDN which was recently announced HERE, it is time for me to upgrade my lab domain controllers to Windows Server 2012 R2.

    I started by first “upgrading” my Hyper-V hosts to Windows Server 2012 R2.  This would allow me to take full advantage of all the new benefits of 2012 R2 for Hyper-V.  That was pretty simple, just shut down the OS, unplug all my additional storage in the machine which contains all my VM’s, and boot from my USB key that contained WS2012R2.  Then, once I added the Hyper-V role back, I simply connect my storage back to the system, and import the previous VM’s I was running.

    My next step in upgrading my VM’s is targeting the domain controllers.  I have two DC’s, each running AD services, certificate services, DHCP, DNS, etc.  Since I don’t want to risk messing up the complex configuration of each service, I choose to deploy two NEW VM’s for additional DC’s, and I will migrate these additional roles to the new DC’s later.

    My first step is to deploy the two new VM’s.  First decision I need to make is whether to use Gen1 or Gen2 VM’s:

    image

    Gen2 VM’s are a new feature of Hyper-V in Windows Server 2012 R2, and offer significant advantages over Gen1 VM’s, such as secure boot, discarding the emulated devices like IDE and using SCSI disks event for the boot volumes, PXE capability on a standard NIC, etc.  Read more about Gen2 VM’s here: http://technet.microsoft.com/en-us/library/dn282285.aspx

    Installing Windows Server 2012 R2 is just like any other OS install.  When it stops on the Activation Key screen, I decided to leverage another new feature for Windows Server 2012 R2 – Automatic VM Activation.  You can use these new keys to activate servers when they are running on Windows Server 2012 R2 Hyper-V.  Read more about Automatic VM Activation here:  http://technet.microsoft.com/en-us/library/dn303421.aspx

    I rename the VM’s with the correct server names, and join them to my domain.

    The first step in promoting these new VM’s to Domain Controllers is to add that role, which you can perform from Server Manager. A walkthrough of the process is described here:  http://technet.microsoft.com/en-us/library/jj574134.aspx

    image

    image

    When the role is added – you will see a post-deployment task warning, to run the promotion:

    image

    The wizard will run AD forest prep, schema update, and domain prep for 2012 R2 when you promote the first DC on Windows Server 2012 R2. 

    When it is complete, you will see your new DC’s added to the domain controllers OU in Active Directory.

    The next step in the process is to migrate the AD Operations Master roles.  The simplest way to move these roles is via PowerShell.  On Server 2012 AD PowerShell modules, this can be done from anywhere.  Simply run the following command to view you current configuration, and change them:

    PS C:\> netdom query FSMO
    Schema master                   DC1.opsmgr.net
    Domain naming master     DC1.opsmgr.net
    PDC                                    DC1.opsmgr.net
    RID pool manager              DC1.opsmgr.net
    Infrastructure master         DC1.opsmgr.net

    Then use the Move-ADDirectoryServerOperationMasterRole cmdlets to move them.  You can do this with a simple one liner!

    Move-ADDirectoryServerOperationMasterRole -identity "DC01" -OperationMasterRole 0,1,2,3,4

    The identity is the server you want to transfer these roles to, and the 0-4 numeric represents each role to move.  Read more about this cmdlets here:  http://technet.microsoft.com/en-us/library/ee617229.aspx

     

    When complete, you can run a “netdom query FSMO” again and ensure that your master roles have been moved successfully.

    Then, you simply need to migrate any other roles or services running on the DC’s, then demote them when complete.  To demote the domain controller on Server 2012, simply begin by removing the Active Directory Services role, which will prompt you to demote first with a task link.  Once demoted, you can remove the server from the domain.

  • DPM 2012 R2 – QuickStart Deployment Guide

    The following article will cover a basic install of Data Protection Manager 2012 R2.   A dedicated DPM server, and shared SQL server will be deployed.    This is to be used as a template only, for a customer to implement as their own pilot or POC, or customized deployment guide. It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.

    This is not an architecture guide or intended to be a design guide in any way. This is provided "AS IS" with no warranties, and confers no rights. Use is subject to the terms specified in the Terms of Use.

    Server Names\Roles:

    • DB01               SQL Database Services, Reporting Services
    • SCDPM01       Management Server, Web Console server

    Windows Server 2012 R2 will be installed as the base OS for all platforms.  All servers will be a member of the AD domain.

    SQL 2012 with SP1  will be the base standard for all database and SQL reporting services. 

    High Level Deployment Process:

    1.  In AD, create the following accounts and groups, according to your naming convention:

    • DOMAIN\DPMAdmins        DPM Administrators group
    • DOMAIN\SQLSVC               SQL service account

    2.  Add the domain user accounts for yourself and your team to the “DPMAdmins” group.

    3.  Install Windows Server 2012 R2 to all server role servers.

    4.  Install Prerequisites and SQL 2012 with SP1.

    5.  Install the DPM Server

    6.  Install the DPM Central Console

    6.  Deploy Agents

    7.  Configure the Central Console

    Prerequisites:

    1.  Install Windows Server 2012 R2 to all Servers

    2.  Join all servers to domain.

    3.  Install all available Windows Updates.

    5.  Add the “DPMAdmins” domain global group to the Local Administrators group on each server

    6.  On the DPM server, .Net 3.5SP1 is required. Setup will not be able to add this feature on Windows Server 2012.  Open an elevated PowerShell session (run as an Administrator) and execute the following:

    Add-WindowsFeature NET-Framework-Core

    ***Note – .NET 3.5 source files are removed from the WS2012 R2 operating system.  You might require supplying a source path to the installation media for Windows Server 2012 R2, such as:   Add-WindowsFeature NET-Framework-Core –source D:\sources\sxs

    7.  On the SQL server, install the SQL Remote prep.  http://technet.microsoft.com/en-us/library/hh758058.aspx  Run the DPM Setup.exe, then from the screen choose “DPM Remote SQL Prep”.

    8.  On the DPM server, install SQL Management studio.  This is located on the media at \SCDPM\SQLSVR2012SP1\SQLManagementStudio_x64_ENU.exe.  Execute this and walk through the wizard, Installation, New SQL installation, and accept defaults.

    9. Install SQL 2012 with SP1 to the DB server role

    • Setup is fairly straightforward. This document will not go into details and best practices for SQL configuration. Consult your DBA team to ensure your SQL deployment is configured for best practices according to your corporate standards.
    • Run setup, choose Installation > New Installation…
    • When prompted for feature selection, install ALL of the following:
      • Database Engine Services
      • Full-Text and Semantic Extractions for Search
      • Reporting Services - Native
    • Optionally – consider adding the following to ease administration:
      • Management Tools – Basic and Complete (for running queries and configuring SQL services)
    • On the Instance configuration, choose a default instance, or a named instance. Default instances are fine for testing and labs. Production clustered instances of SQL will generally be a named instance. For the purposes of the POC, choose default instance to keep things simple.
    • On the Server configuration screen, set SQL Server Agent to Automatic.  You can accept the defaults for the service accounts, but I recommend using a Domain account for the service account.  Input the DOMAIN\sqlsvc account and password for Agent, Engine, and Reporting.
    • On the Collation Tab – you can use the default which is SQL_Latin1_General_CP1_CI_AS or choose another supported collation.
    • On the Account provisioning tab – add your personal domain user account or a group you already have set up for SQL admins. Alternatively, you can use the DPMAdmins global group here. This will grant more rights than is required to all DPMAdmin accounts, but is fine for testing purposes of the POC.
    • On the Data Directories tab – set your drive letters correctly for your SQL databases, logs, TempDB, and backup.
    • On the Reporting Services Configuration – choose to Install and Configure. This will install and configure SRS to be active on this server, and use the default DBengine present to house the reporting server databases. This is the simplest configuration. If you install Reporting Services on a stand-alone (no DBEngine) server, you will need to configure this manually.
    • Setup will complete.
    • You will need to disable Windows Firewall on the SQL server, or make the necessary modifications to the firewall to allow all SQL traffic.  See http://msdn.microsoft.com/en-us/library/ms175043.aspx

        Step by step deployment guide:

        1.  Install the DPM Server role on SCDPM01. You can also refer to: http://technet.microsoft.com/en-us/library/hh758153.aspx

        • Log on using your personal domain user account that is a member of the DPMAdmins group.  This use must have rights to the DPM server and the SQL server, as well as SA rights to the SQL instance.
        • Run Setup.exe
        • In the Install list, click Data Protection Manager.
        • Accept the license and click OK.
        • On the Welcome page, click Next
        • Choose to use stand alone SQL server, and input server name.  Input your credentials that has rights to this server and the SQL server and instance, and choose “Check and Install”.
        • Resolve any prerequisite issues.  Click Next.
        • Input the Product key, and click Next.
        • Choose an install path, click Next.
        • Chose to use Windows Update or not, click Next.
        • Choose to join the CEIP or not, Next.
        • Click Install.
        • Setup Completes.  Click Close.

        2.  Install the Central Console.

        • Installing the Central Console assumes you have already deployed SCOM, as DPM will use SCOM for the centralized management of multiple DPM servers.
        • First – deploy a SCOM agent to the SCDPM server.
        • On your SCOM server, run Setup.exe from the DPM media.  You might need some prerequisite software to run the install.  Correct any issues.  I needed to install the Visual C++ Redistributable from the media at \SCDPM\Redist\vcredist\vcredist2008_x64.exe
        • Install the “DPM Central Console” from the setup screen.
        • Accept the license, OK.
        • Click Next on the Welcome screen
        • Choose server-side and client-side.
        • Fix any prerequisites and click Next.
        • Choose a path, Next
        • Choose to use Windows Update or not, click Install.
        • Click OK, Close.
        • Install the client components anywhere you run the SCOM console and need to administer DPM servers.
        • Import the SCOM management packs for DPM 2012 R2.  They are located on the media at \SCDPM\ManagementPacks
        • Wait enough time for discovery to occur, and ensure that your DPM servers are discovered in the DPM Servers State View:

        image

        3.  Add DPM storage.

        • Add a disk to your VM or physical DPM server for the purposes of containing the replicas and recovery points.  This disk should not have any volumes defined.
        • Open the DPM Console, Management, Disks. 
        • Click “Add” and add any disks available that you want in the backup storage pool.

        image

        4.  Install protection agents

        • In the Console, Management, Agents.  Click “Install”
        • Select Install Agents, and select computers in your domain from the search box or list.  I select some SQL servers, my Domain Controllers, and my Hyper-V Hosts.

        image

        • Provide credentials that has local admin rights to install the agent on each computer you chose.
        • Choose No, don’t let DPM restart computers.
        • Start the agent install.  The “Task” results view will show you progress.  There “Errors” tab will display details about any that failed.  One of mine failed due to a firewall issue.  See the product documentation about ports necessary for firewalls.

        5.  Create a Protection Group

        • Console > Protection.  Click “New”
        • Choose Servers
        • Select objects to protect on your servers.  DPM automatically detects specific roles, such as SQL, Hyper-V, Exchange, SharePoint.
        • Here I have selected my domain controllers:

        image

        • Give the Protection group a name.  Choose protection to Disk.  Click Next.
        • Set retention time, synchronization, and backup times.
        • Review the Disk Allocation and ensure you have enough storage available for the protection.
        • Start the protection of computers by kicking off the replica now.
        • For a system state/bare metal backup of domain controllers, you will need to ensure the Windows Server Backup feature is installed.

        6.  Protect SQL Server

        • The most common SQL server back routines call the VSS wirter in SQL to perform an online backup of the entire database.  This flushes the transaction logs and ensures the database is consistent and restorable for that point in time.  Then, another process would backup the uncommitted transactions on a much more frequent basis.  DPM works in a very similar fashion.
        • Create a new protection group.  Choose “Servers”  Click “Next”
        • Select a SQL server that has a DPM agent, and expand it in the list.  Select a SQL Database(s).   Click Next.

        image

        • Give the protection group a name, and choose disk.  Click Next.
        • Choose a retention period that works with your backup strategy, choose the synchronization frequency (transaction log backups) and select a recovery point time for the express full backup.
        • Review the disk allocation.  Click Next.
        • Select to create the initial replica now.  Next.  Choose defaults for the consistency check.
        • Review the summary and create the protection group.

        7.  Protect Hyper-V Virtual Machines

        • Create a New Protection group.  Choose Servers
        • Expand a Hyper-V server or Cluster in the list.
        • Check the box next to virtual machines that you would like to protect.  When you see “Online” this means the backup will be performed with zero interruption to the VM.  Offline means the backup will pause the VM, take a checkpoint (snapshot) of the VM, and then backup that checkpoint.

        image

        • Give the protection group a name, and choose disk.  Click Next.
        • Choose a retention period that works with your backup strategy, choose the synchronization frequency (transaction log backups) and select a recovery point time for the express full backup.
        • Review the disk allocation.  Click Next.

        8.  Protect SharePoint

        • Ensure you have installed a protection agent on at least one Front End server in the farm, and all SQL servers that hosts databases for the SharePoint Farm.
        • On the SharePoint Web Front End server, once you have installed the DPM protection agent, you must run ConfigureSharepoint.exe –EnableSharePointProtection from an elevated powershell.  Provide a sharepoint service account that has full access to sharepoint.  This will configure permissions and the VSS writer for DPM.

        image

        • Create a protection group.  Servers.  Expand your SharePoint Front End server, expand SharePoint, and select your Farm config database.

        image

        • Give your Protection group a name, such as “SharePoint Protection Group”.  Choose Disk protection
        • Select a retention range and a recovery point schedule.  The default is one recovery point per day.  You can select multiple recovery points as frequent as every 30 minutes.
        • Configure disk allocation if needed, choose the create the Replica now, and accept defaults to run consistency checks when inconsistent.  Create the protection group.
        • The search catalog for individual items is a job that runs once per day.  You will need to wait up to 24 hours after your first replica before this catalog will be available to search individual items in the DPM console. 

        9.  Backup DPM with Windows Azure

        • This is covered at http://technet.microsoft.com/library/jj728752.aspx
        • You will want to create a new self-signed certificate using MakeCert.exe.  Details on making the cert are located here:  http://technet.microsoft.com/en-US/library/hh831761.aspx
        • In your Windows Azure account, create a New > Data Services > Recovery Services > Backup Vault
        • Upload your .CER certificate to the vault, so registered servers with the same certs private key can authenticate to this vault.
        • Download and install the Windows Azure Backup agent on the DPM server.
        • Open the DPM console AFTER the WAB agent is installed, select Management, Online.  In the ribbon, choose Register.
        • Browse for your locally installed certificate that you created with MakeCert and imported on the DPM server from a PFX file.
        • Now you will automatically connect and browse Windows Azure valuts that correspond to this certificate.  Select the vault you recently created from the drop down.
        • Choose a Proxy Server if necessary.
        • Set up throttling for your internet traffic.
        • Create a local folder on a volume that has enough space for a staging area for any recoveries.
        • Create an encryption passphrase, and copy this to a safe location.
        • Click Register.

        Validate your protection is working.  Look at protection groups, and view the monitoring jobs and alerts in the console.

        After enough time has passed, you will see new data in the Central (SCOM) Console.  Such as discovered disks, Protection groups, Protected servers, etc.

        image

        10.  Enable End User Self Service Recovery

        • A Schema Extension is required in the Domain in order to use Self Service Recovery.  There is an issue with the Schema Extension tool that ships with DPM 2012 R2, it crashes when trying to run this on my Windows Server 2012 R2 domain controllers.  The workaround is to go get the same tool from the SP1 installation, and use that.  The file is located at C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\End User Recovery\DPMADSchemaExtension.exe.  You need to deploy a DPM 2012 SP1 server, and get the file from there.  The schema extensions have not changed.  Copy this file to a domain controller and log in with an account that is a Schema Admin with rights to update the AD schema.  Execute the file. 

        **Note – if you have already updated your schema previously for DPM in the past, you don’t need to do this step again.

        image

        • Enter in the DPM server name (NetBIOS name only not FQDN)

        image

        • Next enter in the domain name of the DPM server

        image

        • Leave the third window blank, we will assume we are only using a single domain here.  Just click OK.

        image

        • The update will start when you click OK on the next screen, and will notify you when complete.

        image

        image

        • Now on your DPM server, close, and reopen the console.
        • In the ribbon at the top – click Options.
        • Select the End-user Recovery tab.
        • Now you have the option to enable End User Recovery:

        image

        • Enabling this will cause this popup:

        image

        image

        • Input the Database Server name and instance names.  For a default instance just use Servername.  You must use FQDN:

        image

        • Configure SSR to recover to alternate locations or not.
        • Complete the role creation:

        image

        image

        • Once installed, run the tool, and connect to your DPM server
        • Select “New Recovery Job”
        • The wizard will allow you to see the instances and the databases that you have rights to recover:

        image

        • You can then select a Date and Time that you want to recover from, and specify location, etc.

        image

      • DBcreatewizard or just run good old SetupOM.exe - which should I use to install the Database component of OpsMgr?

        There has always been a bit of confusion on when to run the DBCreateWizard.exe tool, or when to just use SetupOM.exe to create the Operational DB or Data Warehouse DB.

        Historically.... in MOM 2005, we used the DBcreate Wizard in order to create the Onepoint database on Active/Active clusters..... or when SQL DBA teams refused to run a MSI based setup on one of their SQL servers.  The DB create wizard was a better option for them.... since it did not have to install any binaries on a SQL server.  In practice.... it was pretty rare to see this in widespread use.

         

        In OpsMgr 2007, we haven't really documented all the scenarios for when you should run the DBcreate Wizard.... and I will try and do that here. 

         

        The DB create wizard is located on the CD - In the \SupportTools folder.  It does require some additional files to run it - these don't have to be "installed", just need to be copied over to the SQL DB server where you will run the wizard.  Follow:  http://support.microsoft.com/kb/938997/en-us

        ***  Note - the additional files required to run DBCreateWizard.exe are documented in the KB article above.  They were also provided on the SP1 Select CD.  However - the files provided on CD are for 32bit x86 only.  If you are using the DBCreateWizard on a x64 platform - you MUST copy these files listed in the KB article from an x64 server.... any x64 server with the console installed will have them.

        Note - there were some significant issues with the RTM version of this tool... in detecting the correct SQL instance on a multi-instance cluster, and leaving some table information blank (http://support.microsoft.com/kb/942865/en-us).  When deploying SP1 - Use the SP1 version of this tool.  If you MUST deploy the RTM version - I would recommend using SetupOM.exe for all installs.

         

        Ok.... first, you will notice in the OpsMgr Deployment guide, they instruct to use the DBcreateWizard when installing the database on an Active/Passive cluster.  That's pretty much our first introduction to this tool.  While this isn't required (you can simply run SetupOM.exe on the Active node) it is recommended to use DBCreateWizard.  Essentially, our recommendation is that anytime you have a dedicated SQL server for the OpsDB role... with no other OpsMgr role present, then you should use the DBcreateWizard to create the Operational database.  The reason for this, from an internal discussion I have been involved in.... is because using SetupOM.exe will create some additional registry entries on the database server... and will change how updates are applied to the server from an OpsMgr perspective.  Another scenario to leverage this tool, is anytime your SQL DBA teams refuse to allow you to run a MSI based setup on their SQL servers/clusters.

         

        Below, I will just walk through some of the scenarios where using this stand-alone tool really makes good sense.

         

        Scenarios:

        1.  All in one role/shared roles.  This is where a single server hosts SQL Server 2005 and the Operational Database role, along with the RMS role.  In this case.... you might as well just run SetupOM.exe and create the database while installing the management group.  You potentially could run the DBcreatewizard first.... but this would be an additional step and provides no value.

        2.  Split roles:  Dedicated SQL server (Server A) and dedicated RMS (Server B).   In this scenario - we recommend using DBcreatewizard.exe instead of just running SetupOM.exe on the SQL server.   However - you certainly can do either one.... both are fully supported.

        3.  Split roles - clustered DB:  Dedicated cluster for SQL (can be A/P or A/A or multi-instance or multi node.... doesn't matter)  In this scenario - we recommend using DBcreatewizard.exe instead of just running SetupOM.exe on the SQL server.  That said.... you can run SetupOM.exe on any node that owns the SQL instance you are creating the DB in.... we just favor using DBcreateWizard.

        4.  Draconian DBA's.  In general.... DBA's are used to creating an empty database for an application, then granting permissions to the DB only.... then washing their hands of it.  They don't like running setup's... or even running tools on their SQL servers....  If they must have an application create a database as part of that application install - they MUCH prefer that all the DB creation be handled remotely.  Unfortunately.... MOM 2005 and OpsMgr 2007 do not support what DBA's would most like to see.  We must run our setup or tool on the database server/node in order to install that component.  I suppose we could install the OpsDB using the DBcreatewizard in a test lab SQL box, then detach it.... then hand the files to a SQL team and have them drop in into a production environment to make them happier.... but I haven't really done much testing there.  Anyway.... the DBcreateWizard is the best option when working with a rigid DBA team.  Just follow the KB article listed above... and have the SQL team run the tool to create the database.... then they can delete to tool from the server.  We will still require SA priv over the instance to complete the RMS setup.... but once that is done, they can remove these advanced rights, per my previous post:  http://blogs.technet.com/kevinholman/archive/2008/04/15/opsmgr-security-account-rights-mapping-what-accounts-need-what-privileges.aspx

        5.  Multiple Operational Databases in the same SQL instance.  It is possible, if you have multiple management groups, that you could place all the Operational DB's into a single SQL instance.  Now - these had better be small environments (test/dev) or a beefy SQL server to handle all that I/O.... but just for grins.... lets say you are doing it.  If you tried to run SetupOM.exe and install the database component multiple times.... it would detect it was already installed and ask you if you wish to repair or remove OpsMgr.  No good.  In comes the DBcreateWizard.  This tool is the supported method for creating multiple OpsDB's in a single SQL instance.

      • Creating custom dynamic computer groups based on registry keys on agents

        I have had a few requests now for this, so I thought I would take the time to write up the process.

         

        image

         

        Lets say I have three support levels of servers:

         

        Level 1 – servers critical to business operations (ex: customer facing web applications, SQL back-ends)

        Level 2 – important servers (ex: messaging, internal apps)

        Level 3 – non-essential servers (ex: non-critical or highly redundant internal apps)

         

        Lets say we want to create overrides for certain rules…  where we will page on anything in Level 1 group, email notify on Level 2 group, and simply alert for Level 3.  Possibly we want to create views, and only see alerts for Level 1 servers.  Perhaps we wish to scope users so they only see Level 1 and Level 2 servers in the console?

        Well – the first step is to place these servers into groups.

        Sure – we can do this manually, with explicit assignments to the group.  But that is resource intensive over time, and we might miss one down the road.  I’d prefer to dynamically create the groups of Windows Computers based on a name…. but this can be difficult sometimes – where we don't have a solid naming scheme, or other criteria to group by.

         

        I will demonstrate another way to accomplish this… by coming up with a business process to use a registry key on your managed servers, and collect this registry attribute with SCOM.  Then – use this Registry attribute for dynamic group memberships.

         

        Ultimately – there are three simple steps to this process:

        1.  Create registry keys on agents.

        2.  Extended a class with an attribute, to discover the registry keys and values.

        3.  Create dynamic groups based on the attribute values from the registry.

        It is just that simple.

         

         

        To get started – lets talk about our custom registry key.  For this example, I am going to create a new Key at HKLM\Software\ and call it “CompanyName”

        Next – in that key – I will create a new DWORD Value, named “SupportLevel”

        Lastly – I will assign a numeric value to “SupportLevel” on each server, either 1, 2, or 3.

        image

         

        In my environment…. my Hyper-V servers are critical.  They host all of my VM’s, including many business critical applications.  Therefore – they will get Level 1.

        My Exchange 2007 servers handle all my mail traffic and notifications, so I will set their registry value to Level 2.

        My Exchange 2003 servers have been retired – for MP testing only… so we will set those to Level 3.

         

        Here is a table that shows what I am planning:

        ServerName SupportLevel
        VS1 1
        VS2 1
        VS3 1
        EX1CLN1 2
        EX1CLN2 2
        EXCAS 2
        EX2CLN1 3
        EX2CLN2 3
        OWA 3

         

        So – I get all my registry values set on all computers.  This is a big job at first, but it is a one time deal, and you can even script it if you are handy.

         

        Next… we need to discover these registry entries in SCOM, as attributes of a class.  Then were can use that attribute to group objects.  Since I want Windows Computer objects in my groups (Windows Computer is a good object for most overrides, scoping, notifications…etc..) we would like to have these attributes added to the Windows Computer class.

        However – there is a problem.  The Windows Computer object is in a sealed MP.  We cannot just add information to that class as we would like.  Therefore – OpsMgr allows us to “Extend” an existing class… and add our custom attributes to it.  This “Extended” class is basically a copy of the existing class… it will have all the built in attributes of Windows Computer, and will also have our custom attribute properties.  It’s is easier to see it than to talk about it.

         

        First – in the Ops console – authoring pane – go to Attributes.  Create a new attribute.  I am going to call this one “SupportLevel”

         

        image

         

        Next – choose “Registry” for the discovery type.

        Next – We need to pick the Target class.  We want Windows Computer.  Note – this will create a new class, named “Windows Computer_Extended” by default.  We can use this name, or you can rename this whatever you want.  It is your class.  I will leave it at the default.

         

        image

         

        Most important!  Management Pack location.

         

        This is CRITICAL.  Spend some time making sure you are creating these attributes in the correct location.  If you leave this MP unsealed XML…. then any groups you create that use these attributes, will have to be placed in this same MP.  Then – if you use these groups for Overrides – those overrides will be force to go in this same MP.  There is a “cardinal rule” in SCOM… objects in one unsealed MP cannot reference another unsealed MP.  So – we cannot have a group in one unsealed MP, and then use that group for an override in another unsealed override MP. 

        So – we have two choices. 

        1.  Keep an unsealed MP… and live with the fact that attribute, group, and override will all have to be placed here. 

        2.  Create the attribute and the dynamic group in the MP, then seal it.  Then – you can use this group in ANY of your override MP’s… for Exchange, SQL, etc…

        I strongly recommend option #2 for this exercise… but you can make this decision for yourself.

         

        image

         

        Ok…. I will choose Option #2 (seal the MP), so I will create a new MP just for this extended class, and groups.

         

        On the next screen – we can put in our registry information:

        In this example – I am looking for a registry Value (1, 2, or 3), and my attribute type is “Int” for integer.

        For the frequency, set this to a reasonable frequency to discover you machines as they come on to you network.  Typically, once per day is sufficient (86400 seconds)  Remember – this will run against ALL your Windows Computers… so never set this more frequent than once per hour… that creates unnecessary overhead.

        image

         

        Ok – lets examine our work!

        Go to Monitoring, Discovered Inventory, and change target type to our new class “Windows Computer_Extended”

        If you do this quickly – you may find it is empty.  This is what is happening behind the scenes:  All Windows Computers are now downloading our newly created MP.  They are going to run the registry attribute discovery, and submit their discovery data to the management server.  The Management Server will insert this discovery data in the database.  Over time, you will start to see all your Windows Computers pop into this class membership.  You will notice a new attribute now, in addition to all the existing Windows Computer attributes.  This attribute is “SupportLevel” and will be 1, 2, 3, or empty… depending on what each agent find in the registry.

        Now – I set my registry discovery to once per day…. so I will need to wait 24 hours before I can expect all my healthy agents to show up in this list.  To speed things up – I am going to bounce the HealthService on these example agents.  (Agents run all discoveries when a HealthService restarts, and then on their frequency schedule)

        Here is an example a few minutes after bouncing the HealthService on some agents:

         

        image

         

        Next on the list – create the groups.  I will create these in the same MP that the attributes exist in.

         

        I will call my first group “CompanyName – Support Level 1 Servers Group”.  I like to append the word “Group” to all groups I create as a best practice.  This helps us determine this group class is actually a group when we see it in the list of classes in the UI.  I sure wish all MP authors would take this to heart, since every group is actually a singleton class.

         

        image

         

        On the dynamic members screen – I will fins my “Windows Computer_Extended” class – and click Add.  What we now see – is that we have a new attribute to use, “Support Level”

         

        image

         

        I will set this group to “SupportLevel Equals 1” and click OK.

         

        image

         

        Now – I can right-click my new group – and choose “View Group Members”

         

        image

         

        image

         

        Yee-haw!  It works!  Now – I simply repeat this above step – creating groups for SupportLevel 2, and 3.

         

        image image

         

         

        Now – that is done.  This is the area, that I recommend we stop… take a breather…. then seal the MP.  If you seal the MP – we will be able to use the groups for overrides in any other override MP.  If you choose not to seal the MP now… any overrides you use the groups for – will be forced into this same MP.  Please keep that in mind.

        Since I am harping on sealing the MP…. I am going to do a quick example of just that.  Jonathan Almquist has an excellent tutorial on sealing MP’s HERE and we will use his example.

        **Note – when running the sn.exe commands to create our key…. we only need to do this one… not every time we want to seal an MP.

        ***Critical note – you need to keep a backup of this key… because it will be required for making updates to this MP in the future, re-sealing, and keeping the ability to upgrade the existing MP in production.

        So, I create the folders, create the key using sn.exe, copy over the referenced MP’s from the RMS,  and now I am ready to seal.

        MPSeal.exe c:\mpseal\input\CompanyName.SupportLevel.MP.xml /I "c:\mpseal\mp" /Keyfile "c:\mpseal\key\PairKey.snk" /Company "CompanyName" /Outdir "c:\mpseal\output"

         

        Works great.

         

        image

         

        Now – I can delete my unsealed MP from the management group, and import my sealed MP.

         

        Phew.  All the heavy lifting is done.  Now… I have my groups… I can start setting up overrides using these groups, or scoping notifications. 

        On my Support Level 1 group – I will use this to set up my pager Notification subscriptions to only page based on specific classes, and this group.

        On my Support Level 2 group – I will use this to override important alerts to High Priority… because I am using High Priority as a filter for email notifications, per my previous blog post here:  http://blogs.technet.com/kevinholman/archive/2008/06/26/using-opsmgr-notifications-in-the-real-world-part-1.aspx

        On my Support Level 3 group – I will use this group for tweaking/disabling rules and monitors for the group… turning off discoveries so they don't discover lab servers, scoping views, etc.

         

        Maybe in my next post…. I will build on this MP… and show a really simple way to add the Health Service Watcher objects to these dynamic groups… for each Windows Computer object that is in the group – so we can use these groups for Heartbeat failure notifications.

      • Renaming your Default Management Pack

        I didn’t come up with this idea…. I got it from Cameron Fuller who got it from Rory McCaw’s session at MMS last year.  So credit goes to both of them for the idea and initially spreading it.  As I was talking to many of my colleagues, I found out this is not a commonly known practice.  So, maybe this is a good topic to write about to spread the information.

         

        The default management pack is the first MP that shows up in the list when creating an override.  However, it is a best practice to NEVER save anything to this.  Admins will often forget this best practice, and accidentally save items here.  Then – a cleanup of this MP is often required, as documented here:  Clean up the Default MP  This problem has even prompted some customers to monitor when changes are made to their default MP.

         

        There is no supportable way to make this MP “read only” or sealed.  However – we CAN rename this MP to provide a visual warning that might help you or your customer remember not to save things here.  While we cannot rename the ID of an MP, we can rename the Display Name of any unsealed MP.

         

        In the console, Administration pane, Management Packs, find you default MP.  Bring up the properties – and you can rename this:

         

        image

         

        To something like this:

         

        image

         

         

        This will hopefully remind you or your customer not to save overrides to this MP.  Below is what you will see when creating an override:

         

        image

      • After moving your OperationsManager Database–you might find event 18054 errors in the SQL server application log

        I recently wrote about My Experience Moving the Operations Database to New Hardware

        Something I noticed today – is that the application event log on the SQL server was full of 18054 events, such as below:

        Log Name:      Application
        Source:        MSSQL$I01
        Date:          10/23/2010 5:40:14 PM
        Event ID:      18054
        Task Category: Server
        Level:         Error
        Keywords:      Classic
        User:          OPSMGR\msaa
        Computer:      SQLDB1.opsmgr.net
        Description:
        Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

        You might also notice some truncated events in the OpsMgr event log, on your RMS or management servers:

        Event Type:    Warning
        Event Source:    DataAccessLayer
        Event Category:    None
        Event ID:    33333
        Date:        10/23/2010
        Time:        5:40:13 PM
        User:        N/A
        Computer:    OMMS3
        Description:
        Data Access Layer rejected retry on SqlError:
        Request: p_DiscoverySourceUpsert -- (DiscoverySourceId=f0c57af0-927a-335f-1f74-3a3f1f5ca7cd), (DiscoverySourceType=0), (DiscoverySourceObjectId=74fb2fa8-94e5-264d-5f7e-57839f40de0f), (IsSnapshot=True), (TimeGenerated=10/23/2010 10:37:36 PM), (BoundManagedEntityId=3304d59d-5af5-ba80-5ba7-d13a07ed21d4), (IsDiscoveryPackageStale=), (RETURN_VALUE=1)
        Class: 16
        Number: 18054
        Message: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.

        Event Type:    Error
        Event Source:    Health Service Modules
        Event Category:    None
        Event ID:    10801
        Date:        10/23/2010
        Time:        5:40:13 PM
        User:        N/A
        Computer:    OMMS3
        Description:
        Discovery data couldn't be inserted to the database. This could have happened because  of one of the following reasons:

             - Discovery data is stale. The discovery data is generated by an MP recently deleted.
             - Database connectivity problems or database running out of space.
             - Discovery data received is not valid.

        The following details should help to further diagnose:

        DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f
        HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88
         Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..

         

        After a little research – apparently this is caused when following the guide to move the Operations Database to new hardware. 

        Marnix blogged about this issue http://thoughtsonopsmgr.blogspot.com/2009/02/moving-scom-database-to-another-server.html which references Matt Goedtel’s article http://blogs.technet.com/b/mgoedtel/archive/2007/08/06/update-to-moving-operationsmanager-database-steps.aspx

         

        Because in this process – we simply restore the Operations Database ONLY, we do not carry over some of the modifications to the MASTER database that are performed when you run the Database Installation during setup to create the original operations database.

        For some OpsMgr events, which stem from database activity, we get the event data from SQL.  If these messages do not exist in SQL – you see the above issue.

        What is bad about this – is that it will keep some event rules from actually alerting us to the condition!  For instance – the rule “Discovery Data Submission Failure” which will alert when there is a failure to insert discovery data – will not trigger now, because it is looking for specific information in parameter 3 of the event, which is part of the missing data:

         

        image

         

        To resolve this – we need to add back the missing information into the MASTER database. 

        • IF you have moved your OperationsManager database to new hardware

        AND:

        • IF you are seeing event 18054 events in the application log of the OpsDB SQL instance server.

        Then you are impacted.  To resolve this – you should run the attached SQL script against the Master database of the SQL instance that hosts your OperationsManager Database.  You should ONLY consider running this if you are 100% sure that you are impacted by this issue.

        See attached:  Fix_OpsMgrDB_ErrorMsgs.sql

      • DNS MP update ships – support for DNS on Windows Server 2008 R2 and many fixes

        The DNS Management pack has been updated.  The current version as of this article is 6.0.7000.0

         

        Get it from the download center:

        http://www.microsoft.com/downloads/en/details.aspx?FamilyID=633B718F-5FE8-47D5-A395-8203F8EC354F

         

         

        This is a GREAT update.  Here are some key changes in this version:

         

        • Added support for Windows Server® 2008 R2 DNS server.

        That’s pretty self explanatory.  This version now fully supports the DNS services running on Server 2008 R2 OS.

        • Changes in how “PrimaryServer” and “SerialNumber” properties are updated

        This is HUGE!  The DNS MP was one of the primary causes of Config Churn which I wrote about here:  http://blogs.technet.com/b/kevinholman/archive/2009/10/05/what-is-config-churn.aspx   With this update – that churn is now resolved.  The properties of PrimaryServer and SerialNumber no longer change on a frequent basis.  This is a big improvement and the biggest reason to get this update in place ASAP.

        • Enhancements to Forwarder Availability Monitor

        Several changes were made here:  Internal > Public, Error > Warning, Interval 900 > 913, NSLookUp uses the timeout parameter

        • New views defined:
          • Forwarder state view
          • Zone state view

        These views were not present before – now you can spot-check the health of individual zones and forwarders quickly.

        • Scripts-Timeout changed from 30 to 300

        This enhancement will allow scripts longer to complete on busy DNS servers or DNS servers with large numbers of components.  The change really wasn’t from 30 > 300 on all workflows – rather – all workflows, regardless of their previous timeout, have been set to 300 seconds, or more.

        • All monitors changed from “Internal” to “Public”

        This allows you to be able to create and add recoveries and diagnostics on any monitor.  When flagged as “internal” they cannot be referenced in an unsealed MP.

        • All classes made “Public”

        This allows you to create custom scoped views for any class in the MP – referencing them in another custom MP of your choice.

        • New rules defined for all monitors with manual reset

        There are 4 new rules added which are disabled out of the box.  These can be used to quickly replace the included (matching names) manual reset monitors if your organization cannot use Manual Reset monitors due to lack of console use (enterprise connectors as the primary ticketing and notification system)

        Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints

        Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints.Warning

        Microsoft.Windows.DNSServer.2008.EventCollection.RPCProtocolInitialization.RestartRPCService

        Microsoft.Windows.DNSServer.2008.EventCollection.WINSNetbiosInitialization.ConfigureWINSRSettings

         

        Some other changes were also made, in addition to what's in the guide.  Most were setting a handful of monitors from Error to Warning (State and Alert) and changing the frequency of many workflows from 900 seconds to 913 seconds… this was likely done to keep multiple workflows from running at the same time and creating false alerts due to server load when multiple workflows trigger on the same frequency.  Views were renamed to reflect the 2008 R2 support.

      • How to monitor a process on a multi-CPU agent using ScaleBy

        The business need:

        It is a very common request to monitor a process on a given set of servers, and collect that data for reporting, or monitor it for a given threshold.

        One thing you might notice when trying to monitor some performance counters, is that not all perf counters in perfmon behave the way you might assume.

        For instance, I want to monitor “how much CPU a process is using”.  Perhaps we wish to monitor our SQLServer.exe process on our SQL servers?

        This is easy – because Perfmon already has a Performance Object, Counter, and Instance for that.  In perfmon, we would use:

        Process > % Processor Time > Sqlserver.exe

        image

         

        Easy enough!

        So, we can quite easily create a performance threshold monitor, and a performance collection rule using this.  Let’s say we set the monitor to alert anytime the SQLserver.exe process is consuming more than 80% of the CPU sustained for 5 minutes.

         

        The issue:

         

        However, quite quickly we might notice erratic behavior from our monitor and rule.  The monitor is generating TONS of alerts from almost all our SQL servers, and then quickly closing them… essentially flip-flopping.  When we check the performance data we have collected, we see the process is using up to 800% CPU!!!  So – thinking something is wrong with OpsMgr – we inspect a busy SQL server in perfmon directly… but observe the exact same behavior:

        image

         

        As you can see – this process is using almost 400% CPU.  Why?  How is this possible?

         

        This is because the Process monitoring counters in Windows are not multi-CPU aware.  When a server has 4 CPU’s (like this one above does) a process can use more than one CPU at a time, provided it is spawning multiple threads.  This way, it can be using up to 100% of each CPU or Core (logical processor).  A process on a 4 processor server can consume up to 400% of that process counter.  So if a process is really only consuming 20% of the total CPU, that will show up as 80% on a 4-core machine.  Think about today’s hardware… many boxes have up to 16 cores these days, which would register as 320% processor utilization for something really only using 20% of the total CPU.

        As you can see – this causes a BIG problem for monitoring processes.  As an IT Pro – you need to know when a process is consuming more than (x) percent of the *total system resources*…. and every server will likely have a different number of processors.

         

        The solution:

         

        In OpsMgr R2 – a new XML based function was created to help resolve this challenge.  This is known as <ScaleBy>

        The <ScaleBy> function essentially gives you the ability to take the monitoring data collected by something (that is an integer), and divide by something else (integer).

        I can input a fixed value here, in integer form, or I can input a variable.  For the variable, I can actually pull data from discovered properties of monitoring classes.  This is GREAT in this instance, because we already discover the number of Processors a Windows Computer has.  We can use this discovered data, along with this <ScaleBy> function, to fix our monitors and collection rules that need a little massaging to the data we get from perfmon.

        Here are the Windows Computer class properties:

        image

         

        Let’s walk through an example using the authoring console.

        • Open the Authoring console.
        • Create a new empty management pack.
        • Go to Health Model, Monitors, right click and create a new monitor. 
        • Windows Performance > Static Thresholds > Consecutive Samples.
        • Give your workflow an ID, Display Name, and choose a good target class which will contain your process.  I will use Windows Server Operating System for example purposes, but you want to always try and choose a target class that will have your process counter in perfmon.
        • Select System.Health.PerformanceState as the parent Monitor:

         

        image

         

        • Browse a SQL server for the process object you will need – or type in the relevant data.  I will set my samples for the monitor to inspect every minute.  This data is not collected and inserted in the database for a monitor – this sample data is kept on the agent for inspection of a threshold match… so we can monitor the process with a MUCH higher sample rate than we would ever do a performance collection rule.

         

        image 

         

        • I set my monitor to change state when 5 consecutive samples have all been over 80% CPU:

         

        image

         

        • Click finish – then open the properties of the monitor you just created.  Go to the configuration tab.  Here are all the typical configurable items in a performance monitor workflow. 

         

        image

         

        • However – we need to add one more – the <ScaleBy> function.

        We have to do this in XML – as there is no UI that added this capability.  Click “Edit” on the configuration tab which will pop up the XML of this configuration.

        We are going to add a single line after <Frequency> which will be this line:

        <ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy>

        What this does – is tell the workflow to take the numeric value received from perfmon, and then divide by the numeric value that is a property of the Windows Computer class for number of logical processors.  Then take THIS calculated output and use that for collection or threshold evaluation.

        Here is my finished XML snippet:

         

          <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
          <CounterName>% Processor Time</CounterName>
          <ObjectName>Process</ObjectName>
          <InstanceName>sqlservr</InstanceName>
          <AllInstances>false</AllInstances>
          <Frequency>60</Frequency>
          <ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy>
          <Threshold>80</Threshold>
          <Direction>greater</Direction>
          <NumSamples>5</NumSamples>
        </Configuration>

         

        Now – the authoring console was not updated to fully understand this new function, so you might see an error for this.  Simply hit ignore.

        Your new monitor configuration now looks like this:

        image

        You can do the exact same operation on a performance collection rule as well to “normalize” this counter into something that makes more sense for reporting.

         

        Some other uses of this might be for situations where a counter in bytes…. and you want it reported in Megabytes.  You could hard code a <ScaleBy> 1000000 (one million).  That way – if you wanted to report on how many megabytes a process was consuming over time… instead of representing this as 349,000,000 on a chart (bytes) you can represent this as a simple 349 Megabytes.  That XML would simply be:

        <ScaleBy>1000000</ScaleBy>

        Ok… I hope this made some sense…. this is a valuable method to normalize some perfmon data that might not be in what I call “human format”.  Keep in mind – you can ONLY use this XML functionality on an R2 management group, and it will only be understood by an R2 agent.

        You can quickly go back to your previously written process monitors, and add this single line of XML really easily, using your XML editor of choice.

         

        One last thing I want to point out…..  some of the previously delivered MP’s that Microsoft shipped might be impacted by this issue.  For instance – in the current ADMP version 6.0.7065.0 there is a monitor “AD_CPU_Overload.Monitor” (AD Processor Overload (lsass) Monitor) which does not take into account the number of logical processors.  This is often one of the MOST noisy monitors in my customer environments, especially on a busy domain controller.  This is simply because MOST DC’s have more than one CPU – and this skews the ability for this monitor to work.  The issue is – they could not add this <ScaleBy> functionality to this MP – because that would make the ADMP R2-only… which we don't want to do.

        You have two workarounds for SP1 management groups:  Monitor processes using a script that will query WMI for the number of CPU’s and handle the math for this function (ugly) OR create groups of all Windows Computers based on their number of logical processors (easy) and then override these types of monitor thresholds with relevant numeric's for their processor count.

        For R2 customers – I recommend disabling this monitor in the ADMP – and replacing it with a custom one that utilizes the <ScaleBy> functionality.

         

      • Deploying Unix/Linux Agents using OpsMgr 2012

        Microsoft started including Unix and Linux monitoring in OpsMgr directly in OpsMgr 2007 R2, which shipped in 2009.  Some significant updates have been made to this for OpsMgr 2012.  Primarily these updates are around:

        • Highly available Monitoring via Resource Pools
        • Sudo elevation support for using a low priv account with elevation rights for specific workflows.
        • ssh key authentication
        • New wizards for discovery, agent upgrade, and agent uninstallation
        • Additional Powershell cmdlets
        • Performance and scalability improvements
        • New monitoring templates for common monitoring tasks

         

        This article will cover the discovery, agent deployment, and monitoring configuration of a Linux server in OpsMgr 2012.  I am going to run through this as a typical user would – and show some of the pitfalls if you don’t follow the exact order of configuration required.

         

        So what would anyone do first?  They’d naturally run a discovery, just like they do for Windows agents.  However – this will likely end up in frustration.  There are several steps that you need to configure FIRST, before deploying Unix/Linux agents.

         

        High Level Overview:

         

        The high level process is as follows:

        • Import Management Packs
        • Create a resource pool for monitoring Unix/Linux servers
        • Configure the Xplat certificates (export/import) for each management server in the pool.
        • Create and Configure Run As accounts for Unix/Linux.
        • Discover and deploy the agents

         

         

        Import Management Packs:

         

        The core Unix/Linux libraries are already imported when you install OpsMgr 2012, but not the detailed MP’s for each OS version.  These are on the installation media, in the \ManagementPacks directory.  Import the specific ones for the Unix or Linux Operating systems that you plan to monitor.

         

         

        Create a resource pool for monitoring Unix/Linux servers

        The FIRST step is to create a Unix/Linux Monitoring Resource pool.  This pool will be used and associated with management servers that are dedicated for monitoring Unix/Linux systems in larger environments, or may include existing management servers that also manage Windows agents or Gateways in smaller environments.  Regardless, it is a best practice to create a new resource pool for this purpose, and will ease administration, and scalability expansion in the future.

        Under Administration, find Resource Pools in the console:

        image

         

        OpsMgr ships 3 resource pools by default:

        image

         

        Let’s create a new one by selecting “Create Resource Pool” from the task pane on the right, and call it “Unix Linux Monitoring Resource Pool”

         

        image

         

        Click Add and then click Search to display all management servers.  Select the Management servers that you want to perform Unix and Linux Monitoring.  If you only have 1 MS, this will be easy.  For high availability – you need at least two management servers in the pool.

         

        Add your management servers and create the pool.  In the actions pane – select “View Resource Pool Members” to verify membership.

         

        image

         

         

        Configure the Xplat certificates (export/import) for each management server in the pool

        This process is documented here:  http://technet.microsoft.com/en-us/library/hh287152.aspx

        Operations Manager uses certificates to authenticate access to the computers it is managing. When the Discovery Wizard deploys an agent, it retrieves the certificate from the agent, signs the certificate, deploys the certificate back to the agent, and then restarts the agent.

        To configure high availability, each management server in the resource pool must have all the root certificates that are used to sign the certificates that are deployed to the agents on the UNIX and Linux computers. Otherwise, if a management server becomes unavailable, the other management servers would not be able to trust the certificates that were signed by the server that failed.

        We provide a tool to handle the certificates, named scxcertconfig.exe.  Essentially what you must do, is to log on to EACH management server that will be part of a Unix/Linux monitoring resource pool, and export their SCX (cross plat) certificate to a file share.  Then import each others certificates so they are trusted.

        If you only have a SINGLE management server, or a single management server in your pool, you can skip this step, then perform it later if you ever add Management Servers to the Unix/Linux Monitoring resource pool.

         

        In this example – I have two management servers in my Unix/Linux resource pool, MS1 and MS2.  Open a command prompt on each MS, and export the cert:

        On MS1:

        C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS1.cer

        On MS2:

        C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS2.cer

        Once all certs are exported, you must IMPORT the other management server’s certificate:

        On MS1:

        C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS2.cer

        On MS2:

        C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS1.cer

        If you fail to perform the above steps – you will get errors when running the Linux agent deployment wizard later.

         

         

        Create and Configure Run As accounts for Unix/Linux

         

        Next up we need to create our run-as accounts for Linux monitoring.   This is documented here:  http://technet.microsoft.com/en-us/library/hh212926.aspx

         

        We need to select “UNIX/Linux Accounts” under administration, then “Create Run As Account” from the task pane.  This kicks off a special wizard for creating these accounts.

        image

         

        image

         

        Lets create the Monitoring account first.  Give the monitoring account a display name, and click Next.

         

        image

         

        On the next screen, type in the credentials that you want to use for monitoring the Linux system(s).

         

        image

         

        On the above screen – you have two choices.  You can provide a privileged account for handling monitoring, or you can use an existing account on the Linux system(s) that is not privileged.  Then – you can specify whether or not you want this account to be able to leverage sudo elevation.  Since I am providing a privileged account in this case – I will tell it to not use elevation.

        On the next screen, always choose more secure:

        image

         

        Now – since we chose More Secure – we must choose the distribution of the Run As account.  Find your “Linux Monitoring Account” under the UNIX/Linux Accounts screen, and open the properties.  On the Distribution Security screen, click Add, then select "Search by resource pool name” and click search.  Find your Unix/Linux monitoring resource pool, highlight it, and click Add, then OK.  This will distribute this account credential to all Management servers in our pool:

         

        image

         

        We would repeat the above process, as many times as necessary for the number of different accounts we need.  If all our Linux systems use the same credentials, then we need at a minimum, ONE monitoring account that is privileged, and it can be associated to the three Run As Profiles (covered in next section).

        However, what would be more typical, if all our systems had the same credentials and passwords, is to use THREE Run As accounts:

        • One for for Unprivileged (do not use elevation) monitoring
        • One for Privileged monitoring using EITHER a priv account (do not use elevation), OR a unpriv account using sudo (use elevation)
        • One for Agent Maintenance using EITHER a priv account (do not use elevation), OR a unpriv account using sudo (use elevation)

        For the purposes of this demo, I am just going to create a SINGLE priv Run As account (root) that I will use for all three scenarios.

         

        Next up – we must configure the Run As profiles.  This is covered here:  http://technet.microsoft.com/en-us/library/hh212926.aspx

         

        There are three profiles for Unix/Linux accounts:

         

        image

         

        The agent maintenance account is strictly for agent updates, uninstalls, anything that requires SSH.  This will always be associated with a privileged account that has access via SSH, and was created using the Run As account wizard above, but selecting “Agent Maintenance Account” as the account type.  We wont go into details on that here.

        The other two Profiles are used for Monitoring workflows.  These are:

        Unix/Linux Privileged account

        Unix/Linux Action Account

        The Privileged Account Profile will always be associated with a Run As account like we created above, that is Privileged (root or similar) OR a unprivileged account that has been configured with elevation via sudo.  This is what any workflows that typically require elevated rights will execute as.

        The Action account is what all your basic monitoring workflows will run as.  This will generally be associated with a Run As account, like we created above, but would be used with a non-privileged user account on the Linux systems.

        ***A note on sudo elevated accounts:

        • sudo elevation must be passwordless.
        • requiredtty must be disabled for the user.

         

        For my example – I am keeping it very simple.  I created a single Run As account, of the Monitoring type, which is the privileged root account and password credential.  I will associate this Run As account to BOTH the Privileged and Action account.  This will make all my workflows (both normal monitoring and elevated monitoring) run under this credential.  This is not recommended as the “lowest priv” design, but being leveraged in this example just to keep things simple.  Once we validate it is working, we can go back and change this configuration and experiment using low priv and sudo enabled elevation accounts, and associate them independently.

        For more information on configuring sudo elevation for OpsMgr monitoring accounts, including some sample configurations for your sudoers files for each OS version:  http://social.technet.microsoft.com/wiki/contents/articles/7375.configuring-sudo-elevation-for-unix-and-linux-monitoring-with-system-center-2012-operations-manager.aspx

         

        I will start with the Unix/Linux Action Account profile.  Right click it – choose properties, and on the Run As Accounts screen, click Add, then select our “Linux Monitoring Account”.  Leave the default of “All Targeted Objects” and click OK, then save.

        Repeat this same process for the Unix/Linux Privileged Account profile.

        Repeat this same process for the Unix/Linux Agent Maintenance Account profile.

         

         

        Discover and deploy the agents

         

        Run the discovery wizard.

         

        image

         

        Click “Add”:

        image

         

        Here you will type in the FQDN of the Linux/Unix agent, its SSH port, and then choose All Computers in the discovery type.  ((We have another option for discovery type – if you were manually installing the Unix/Linux agent (which is really just a simple provider) and then using a signed certificate to authenticate))

         

        Now – hit “Set Credentials”.  If we do not want to provide a root account here, and wanted to use SSH key authentication, we support that on this screen now.  For this example – I will simply type in my root account in order to use SSH to discover and deploy the Linux agent.

         

        image

         

        Notice above that you can tell the wizard if the account is privileged or not.  Here is an explanation:

        • A privileged account is a user account that has root-level access, including access to security logs and read, write, and execute permissions for the directories in which the Operations Manager agent is installed.
        • An unprivileged account is a normal user account that does not have root-level access or special permissions. However, an unprivileged account allows monitoring of system processes and of performance data.

        If you have to discover only UNIX and Linux computers that already have an agent installed, rather than installing an agent, you can use an unprivileged user account on the UNIX or Linux computer. If you have to install an agent, you must use a privileged account. If you do not have a privileged account, you can elevate an unprivileged account to a privileged account provided that the su or sudo elevation program has been configured on the UNIX or Linux computer for the user account.

         

        So – if we had pre-installed the agent already – we could simply use an unprivileged account to authenticate and discover the system, bringing it into OpsMgr.

        Or – we could provide an unprivileged account that was allowed elevation via a pre-existing sudo configuration on the Linux server.

         

        image

         

        Click save.  On the next screen – select a resource pool.  We will choose the resource pool that we already created.

         

        image

         

        Click Discover, and the results will be displayed:

         

        image

         

        Check the box next to your discovered system – and deploy the agent.

         

        image

         

        This will take some time to complete, as the agent is checked for the correct FQDN and SSL certificate, the management servers are inspected to ensure they all have trusted SCX certificates (that we exported/imported above) and the connection is made over SSH, the package is copied down, installed, and the final certificate signing occurs.  If all of these checks pass, we get a success!

         

        There are several things that can fail at this point.  See the troubleshooting section at the end of this article.

         

         

        Monitoring Linux servers:

         

        Assuming we got all the way to this point with a successful discovery and agent installation, we need to verify that monitoring is working.  After an agent is deployed, the Run As accounts will start being used to run discoveries, and start monitoring.  Once enough time has passed for these, check in the Administration pane, under Unix/Linux Computers, and verify that the systems are not listed as “Unknown” but discovered as a specific version of the OS:

         

        image

         

        Next – go to the Monitoring pane – and select the “Unix/Linux Computers” view at the top.  Look that your systems are present and there is a green healthy check mark next to them:

         

        image

         

        Next – expand the Unix/Linux Computers folder in the left tree (near the bottom) and make sure we have discovered the individual objects, like Linux Server State, Linux Disk State, and Network Adapter state:

         

        image

         

        Run Health explorer on one of the discovered disks.  Remove the filter at the top to see all the monitors for the disk:

         

        image

         

        Close health explorer. 

        Select the Operating System Performance view.   Review the performance counters we collect out of the box for each monitored OS.

         

        image

         

        Out of the box – we discover and apply a default monitoring template to the following objects:

        • Operating System
        • Logical disk
        • Network Adapters

        Optionally, you can enable discoveries for:

        • Individual Logical Processors
        • Physical Disks

        I don’t recommend enabling additional discoveries unless you are sure that your monitoring requirements cannot be met without discovering these additional objects, as they will reduce the scalability of your environment.

         

        Out of the box – for an OS like RedHat Enterprise Linux 5 – here is a list of the monitors in place, and the object they target:

         

        image

         

        There are also 50 rules enabled out of the box.  46 are performance collection rules for reporting, and 4 rules are event based, dealing with security.  Two are informational letting you know whenever a direct login is made using root credentials via SSH, and when su elevation occurs by a user session.  The other two deal with failed attempts for SSH or SU.

         

        To get more out of your monitoring – you might have other services, processes, or log files that you need to monitor.  For that, we provide Authoring Templates with wizards to help you add additional monitoring, in the Authoring pane of the console under Management Pack templates:

         

        image

         

        In the reporting pane – we also offer a large number of reports you can leverage, or you can always create your own using our generic report templates, or custom ones designed in Visual Studio for SQL reporting services.

         

        image

         

        As you can see, it is a fairly well rounded solution to include Unix and Linux monitoring into a single pane of glass for your other systems, from the Hardware, to the Operating System, to the network layer, to the applications.

        Partners and 3rd party vendors also supply additional management packs which extend our Unix and Linux monitoring, to discover and provide detailed monitoring on non-Microsoft applications that run on these Unix and Linux systems.

         

         

        Troubleshooting:

         

        The majority of troubleshooting comes in the form of failed discovery/agent deployments.

         

        Microsoft has written a wiki on this topic, which covers the majority of these, and how to resolve:

        http://social.technet.microsoft.com/wiki/contents/articles/4966.aspx

         

        • For instance – if your DNS name that you provided does not match the DNS hostname on the Linux server, or match it’s SSL certificate, or if you failed to export/import the SCX certificates for multiple management servers in the pool, you might see:

         

        image

         

        Agent verification failed. Error detail: The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors:
        The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.

        The SSL certificate is signed by an unknown certificate authority.
        It is possible that:
        1. The destination certificate is signed by another certificate authority not trusted by the management server.
        2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net.
        3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool.

        The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors:
        The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.
        The SSL certificate is signed by an unknown certificate authority.
        It is possible that:
        1. The destination certificate is signed by another certificate authority not trusted by the management server.
        2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net.
        3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool.

         

        The solution to these common issues is covered in the Wiki with links to the product documentation.

         

        • Perhaps – you failed to properly configure your Run As accounts and profiles.  You might see the following show as “Unknown” under administration:

         

        image

         

        Or you might see alerts in the console:

         

        Alert:  UNIX/Linux Run As profile association error event detected

        The account for the UNIX/Linux Action Run As profile associated with the workflow "Microsoft.Unix.AgentVersion.Discovery", running for instance "rh5501.opsmgr.net" with ID {9ADCED3D-B44B-3A82-769D-B0653BFE54F9} is not defined. The workflow has been unloaded. Please associate an account with the profile.

        This condition may have occurred because no UNIX/Linux Accounts have been configured for the Run As profile. The UNIX/Linux Run As profile used by this workflow must be configured to associate a Run As account with the target.

        Either you failed to configure the Run As accounts, or failed to distribute them, or you chose a low priv account that is not properly configured for sudo on the Linux system.  Go back and double-check your work there.

         

        If you want to check if the agent was deployed to a RedHat system, you can provide the following command in a shell session:

        image

      • Hyper-V, Live Migration, and the upgrade to 10 gigabit Ethernet

         

        My lab consists of 2 Dell Precision T7500 workstations, each configured with 96GB of RAM.  These are each nodes in a Hyper-V 2012 cluster.  They mount cluster shared volumes via iSCSI, some are SSD, and some are SAS RAID based disks, from a 3rd Dell Precision Workstation.

        One of the things I have experienced, is that when I want to patch the hosts, I pause the node, and drain the roles.  This kicks off a live migration of all the VM’s on Node1 to Node2.  This can take a substantial amount of time, as these VM’s are consuming around 80GB of memory. 

        image

         

        image

         

        When performing a full live migration of these 18 VM’s across a single 1GB Ethernet connection, the Ethernet link was 100% saturated, and it took exactly 13 minutes and 15 seconds.

         

        I recently got a couple 10 gigabit Ethernet cards for my lab environment.  I scored an awesome deal on eBay for 10 cards for $250, or $25 for each Dell/Broadcom 10GBe card!  The problem I have now is that the CHEAPEST 10GBe switch on the market is $850.  No way am I paying that for my lab.  The good news is, these cards, just like 1GB Ethernet cards, support direct connect auto MDI/MDIX detection, so you can form an old school “crossover” connection just using a standard patch cable.  I did order a CAT6A cable just to be safe.

        Once I installed and configured the new 10GBe cards, I set them up in the Cluster as a Live Migration network:

        image

        image

         

        image

         

         

         

        image

         

         

        The same live migration over 10GBe took 65 SECONDS!

         

         

        In summary -

         

        1GB Live migration, 18 VM’s, 13m15s.

        10GB Live migration, 18 VM’s, 65 seconds.

        In my case, I can drastically decrease the live migration latency, with minimal cost, by using a direct connection between two hosts in a cluster with 10 gigabit Ethernet.   Aidan Finn, MVP – has a post with similar results:  http://www.aidanfinn.com/?p=12228

         

         

        Next up, I wanted to create a “converged” network, by carving up my 10GBe NIC into multiple virtual NIC’s, by connecting it to the Hyper-V virtual switch, and then create virtual adapters.  Aidan has a good write-up on the concept here:  http://www.aidanfinn.com/?p=12588

         

        Here is a graphic that shows the concept from his blog:

        image

         

        The supported network configuration guide for Hyper-V clusters is located here:

        http://technet.microsoft.com/en-us/library/ff428137(v=WS.10).aspx

         

        Typically in the past, you would see 4 NIC’s, one for management, cluster, live migration, and virtual machines.  The common alternative would be to use a single 10GBe NIC (or two in a highly available team) and then use virtual network adapters on a Hyper-V switch, and QoS to carve up weighting.  In my case, I have a dedicated NIC for management (the parent partition/OS) and a dedicated NIC for Hyper-V virtual machines.  On my 10GBe NIC, I want to connect that one to a Hyper-V virtual switch, and then create virtual network adapters – one for Live Migration and one for Cluster/CSV communication, bot.

         

        We will be using the QoS guidelines posted at:  http://technet.microsoft.com/en-us/library/jj735302.aspx

        John Savill has also done a nice quick walkthrough of a similar configuration:  http://savilltech.com/blog/2013/06/13/new-video-on-networking-for-windows-server-2012-hyper-v-clusters/

         

        When I start – my current network configuration look like this:

        image

         

        We will be attaching the 10GbE network adapter to a new Hyper-V switch, and then creating two virtual network adapters, then applying QoS to each in order to ensure that both channels have their sufficient required bandwidth in the case of contention on the network.

         

        Open PowerShell.

        To get a list of the names of each NIC:

        Get-NetAdapter

        To create the new switch, with bandwidth weighting mode:

        New-VMSwitch “ConvergedSwitch” –NetAdapterName “10GBE NIC” –MinimumBandwidthMode Weight –AllowManagementOS $false

        To see our new virtual switch:

        Get-VMSwitch

         

        You will also see this in Hyper-V manager:

         

        image

         

        Next up, Create a virtual NIC in the management operating system for Live Migration, and connect it to the new virtual switch:

        Add-VMNetworkAdapter –ManagementOS –Name “LM” –SwitchName “ConvergedSwitch”

        Create a virtual NIC in the management operating system for Cluster/CSV communications, and connect it to the new virtual switch:

        Add-VMNetworkAdapter –ManagementOS –Name “Cluster” –SwitchName “ConvergedSwitch”

        View the new virtual network adapters in powershell:

        Get-VMNetworkAdapter –All

        View them in the OS:

        image

         

        Assign a minimum bandwidth weighting to give QoS for both virtual NIC’s, but apply heavier weighting to Live Migrations in the case of contention on the network:

        Set-VMNetworkAdapter –ManagementOS –Name “LM” –MinimumBandwidthWeight 90
        Set-VMNetworkAdapter –ManagementOS –Name “Cluster” –MinimumBandwidthWeight 10

        Set the weighting so that the total of all VMNetworkAdapters on the switch equal 100.  The configuration above will (roughly) allow ~90% for the LM network, and ~10% for the Cluster network.

        To view the bandwidth settings of each virtual NIC:

        Get-VMNetworkAdapter -All | fl

         

        At this point, I need to assign IP address information to each virtual NIC, and then repeat this configuration on all nodes in my cluster.

         

        After this step is completed, and you confirm that you can ping each other’s interfaces, you can configure the networks in Failover Cluster Administrator.  Rename each network appropriately, and configure Live Migration and Cluster communication settings:

         

         

        image

         

        image

        In the above picture – I don’t allow cluster communication on the live migration network – but this is optional and you certainly can allow that if the primary cluster communication fails.

         

         

        image

        image

         

         

        Test Live Migration and ensure performance and communications are working properly.

         

        In Summary – here is all the PowerShell used:

        Get-NetAdapter
        New-VMSwitch “ConvergedSwitch” –NetAdapterName “10GBE NIC” –MinimumBandwidthMode Weight –AllowManagementOS $false
        Get-VMSwitch
        Add-VMNetworkAdapter –ManagementOS –Name “LM” –SwitchName “ConvergedSwitch”
        Add-VMNetworkAdapter –ManagementOS –Name “Cluster” –SwitchName “ConvergedSwitch”
        Get-VMNetworkAdapter -All | fl
        Set-VMNetworkAdapter –ManagementOS –Name “LM” –MinimumBandwidthWeight 90
        Set-VMNetworkAdapter –ManagementOS –Name “Cluster” –MinimumBandwidthWeight 10

         

        This configuration worked.  HOWEVER, it did exposed a limitation.  I noticed that using vNICs I was only able to sustain about 3GB/s on the same live migrations, where I was achieving 10GB/s before.  This is due to the fact that RSS is not exposed to virtual NIC’s on the host/management partition, which own the live migration networks.  When using these virtual NIC’s to transfer a data stream from host to host, you will see a single CPU core pegged, as it manages the traffic in this scenario.

         

        Here is the maximum traffic that could be sent using this configuration on my server:

         

         

        image

         

         

        Below you will see the single core that was pegged during the live migration:

         

        image

         

         

        If you are sharing a converged network design, this still might be acceptable, as some of the bandwidth will be needed for all your VM’s on the host, some will be needed for management and client access traffic, some for CSV and cluster communications.  However, if you want a design with high speed live migrations, you should likely plan for using physical NIC’s for Live Migration, and for CSV (in the cases of redirected IO).  These can use teaming for redundancy, but better to use SMB multi-channel in Server 2012 R2, as live migration will leverage SMB advanced features, like multi-channel and RDMA (SMB Direct).

      • Getting and keeping the SCOM agent on a Domain Controller – how do YOU do it?

        I’d like to hear some community feedback on this….

         

        In OpsMgr – deploying a SCOM agent to a DC often presents companies with a bit of a challenge.  The reason is – in order to install software to a DC and manage it – we need rights on the DC to accomplish this.  These rights are needed, anytime we are going to deploy an agent, hotfix an agent, or run a repair on a broken agent to keep the agent healthy.

        When we push agents from the console, the default account used to perform the push is the Management Server Action Account.  If this account does not have Domain Admin rights – the push will fail to a DC, with an Access Denied.  We do allow the option to type in temporary (encrypted) credentials, which are used to deploy the agent, one time, and then are discarded.  See the image below:

        clip_image002

         

        Here is a list of the most common options I have observed, in place at customer sites… and potential custom options that can be developed.  I’d be interested in any community feedback on any options you are using, that I dont cover or haven't seen before.

         

         

        1. Grant the Management Server Action account Domain Admin or Builtin\Administrators.

        a. Not recommended as a best practice, this gives rights to the MSAA that are not required for day to day activities.

        b. Con - SCOM Admins now control a domain admin account.

         

        2. Grant a SCOM Administrator a special domain account, for this purpose, that is a domain admin.

        a. This allows us to track the actions of that SCOM admin, when he/she uses that special privileged account.

        b. That SCOM admin will be able to do repairs, hotfixes, and deployments for DC’s.

        c.  Con – Domain Admin teams often wont delegate these rights as they are tightly controlled.

         

        3. The SCOM admin team delegates console based agent management to a Domain Administrator for DC agent health.

        a.  The domain admin must become a SCOM Admin, and therefore could potentially hurt the SCOM environment.

        b.  Pro – the admins in charge of the DC’s now have full responsibility to keep the agents healthy.

        c.  Con – the Domain Admins might not understand components of SCOM, and create something that impacts the monitoring environment.

         

        4. The SCOM admin team must partner with the Domain Admin team, and have the Domain Administrator type in his credentials any time the SCOM administrator needs to deploy/hotfix/repair an agent on a domain controller.

        a. This is a bit more labor intensive… because the SCOM admin must wait for a domain admin to be available to work on DC agents, but tight security boundaries are maintained.

         

        5. All DC based agents will be manually installed/updated/repaired.

        a. This is very common, when the two teams do not trust each other.  The Domain Admin team is now required to manually deploy agents to domain controllers, and keep them up to date, and healthy.

         

        6. Use a software deployment tool already in place to deploy/update/repair agents.

        a. If a software deployment tool is already in place on DC’s, like SMS/SCCM, you can create packages to deploy, hotfix, and repair agents, similar to your patching of the OS today.

         

        7. Customized solution:  Create a Run-As account that is a domain admin, one time, for use in agent deployment/repair.

        a. This involves the domain admin typing in credentials ONCE, into a RUN-AS account, which is stored securely and encrypted in the SCOM database. 

        b. This run-as account can be associated with a run-as profile, which is used by a custom task, which will remotely deploy the agent to the domain controller.  This task will execute under the security context of the privileged run-as account.

        c. The benefit is that the domain admin gets to control the password for this account, the SCOM admin does not need to know the account credentials.

        d. The downside, is that this run-as account could potentially be leveraged by some other workflow, if a SCOM admin intentionally misused it…. Similar to solution #2 above.

        e.  This is just an idea I had – curious if anyone has already developed a solution like this?

      • Using OpsMgr for intrusion detection and security hardening

        Here is an interesting little concept of how to use OpsMgr.

        Because I have a lab, that is exposed to the internet over port 3389, I get a LOT of hacking attempts on this lab.  Mostly the source is from bots running on other compromised systems.  These bots just do brute force attacks against the typical Admin accounts and passwords via RDP.  In this article, I am going to show how OpsMgr can not only alert on this condition, but also respond by configuring the Windows Firewall to block these attacks.

         

        I will start by analyzing the Server 2008 event that occurs when someone tries to attack using my “Administrator” account:

         

        Log Name:          Security
        Source:              Microsoft-Windows-Security-Auditing
        Date:                  7/14/2009 12:44:05 PM
        Event ID:            4625
        Task Category:   Account Lockout
        Level:                  Information
        Keywords:          Audit Failure
        User:                   N/A
        Computer:           terminalserver.domain.com

        Description:   An account failed to log on.

        Subject:
            Security ID:             SYSTEM
            Account Name:        TERMINALSERVER$
            Account Domain:     DOMAIN
            Logon ID:                 0x3e7

        Logon Type:            10

        Account For Which Logon Failed:
            Security ID:             NULL SID
            Account Name:        administrator
            Account Domain:     TERMINALSERVER

        Failure Information:
            Failure Reason:        Account locked out.
            Status:                      0xc0000234
            Sub Status:               0x0

        Process Information:
            Caller Process ID:          0x14f0
            Caller Process Name:    C:\Windows\System32\winlogon.exe

        Network Information:
            Workstation Name:    TERMINALSERVER
            Source Network Address:    10.10.10.1
            Source Port:        1261

        Detailed Authentication Information:
            Logon Process:           User32
            Authentication Package:    Negotiate
            Transited Services:    -
            Package Name (NTLM only):    -
            Key Length:        0

         

        So… for starters, I want to alert on this condition… when ANYONE is trying multiple times… to RDP into the server, with a disabled account, non-existent account, or valid account, but bad password.  Therefore – I will create a monitor:  Windows Events > Repeated Event Detection > Timer Reset.

        The idea here is to only respond when multiple bad passwords are entered in a short time period…. representing an attack.  (I don't want to lock out or block access from my normal users who sometimes mis-type their password on a couple attempts.)

        So I create the monitor, target “Windows Server Operating System”, set it to “Security” for the Parent Monitor, and UNCHECK the box enabling it.  (I will later override this monitor and ONLY enable it for my entry terminal server.)

        I create my event expression for the security event log, event 4625, and I only want the Logon Type of 10, which is from RDP:

         

        image

         

         

        Next – I will set up my monitor, to Trigger on Count (of events), Sliding.  Compare count will be set to 5 (events) within a 3 minute interval.  Therefore, as soon as 5 events are captured, in ANY sliding 3 minute “window”, the monitor will change state.

         

        image

         

        Next… since my goal is really to execute a script/command/response…. (not really a state change is desired) I will set the timer reset to reset the state back to healthy after 2 minutes.  This will free the workflow up to block any other source IP’s which might attack soon after.

         

        image

         

        I don't want to impact availability data, which assumes critical state = unavailable…. so I will use a Warning State:

         

        image

         

        Now – I will enable a unique alert for this condition.  I want a critical, high priority alert in this case, and I will set this NOT to close the alert when we auto-resolve the state on the timer.  I also will customize the alert description, to give me a richer alert based on the even details and my custom response.  I talk more about these event parameters HERE.   I will be adding:

         

        $Data/Context/Context/DataItem/Params/Param[6]$ typed a bad password accessing directly from computer: $Data/Context/Context/DataItem/Params/Param[14]$ from IP: $Data/Context/Context/DataItem/Params/Param[20]$
        The Windows Firewall will be modified to block this IP address in response to this monitor state.

         

        image

         

         

        Next – I will go back and find my monitor, and add a Recovery for the Warning State:

         

        image 

         

        I will choose to Run Command.  Give it a name “Modify Windows Firewall”

         

        image

         

        Next – for the command – I am going to run Netsh.exe which can configure the Windows Firewall running on the terminal server.  Here is the command:

         

        C:\Windows\System32\netsh.exe

        advfirewall firewall set rule name="Block RDP" new remoteip=$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$

         

        $Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$ is based on an Event Parameter of the Server 2008 event, which I will pass to the command, so it will gather the IP address of the attacker, and pass that to the command which configures the firewall rule.  Getting this variable was the most complicated for me…..   Marius talked about how to derive this variable HERE  Just understand that the variables you use in an alert description are not the same was used in a diagnostic or recovery.

         

        image

         

        Cool:

         

        image

         

         

        My Netsh.exe command modifies an existing custom rule in the Windows Firewall, so I need to make sure I create that and name it “Block RDP”.

        Now – I will override this rule and enabled it for my published terminal server, and then test this monitor… by attempting to log into my terminal server via RDP 5 times in a short period, using a disabled account.  This will cause the event in the security event log for each event, and eventually trip the repeated event detection monitor.

         

        Alert generates:

        image

         

        Monitor changes state:

        image

         

        Recovery runs:

         

        image

         

        Windows Firewall rule gets modified:

         

        image

         

        Attack is stopped.

        Pretty cool, eh? 

      • Fixing troubled agents

        Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

        This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

        • DNS name resolution (Agent to MS, and MS to Agent)
        • DNS domain membership (disjointed)
        • DNS suffix search order
        • Kerberos connectivity
        • Kerberos SPN’s accessible
        • Firewalls blocking 5723
        • Firewalls blocking access to AD for authentication
        • Packet loss
        • Invalid or old registry entries
        • Missing registry entries
        • Corrupt registry
        • Default agent action accounts locked down/out (HSLockdown)
        • HealthService Certificate configuration issues.
        • Hotfixes required for OS Compatibility
        • Management Server rejecting the agent

         

        How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

         

        One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

         

        Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:

         

        image

         

         

        This view is important – because it gives us a perspective of the agent from two different points:

        1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

        2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".

         

        If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren't – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

        Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

         

        Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

         

        • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.

         

        • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.

         

        • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.

         

        • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane - “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.

         

        • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.

         

        • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
          • Uninstall the agent via add/remove programs.
          • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
          • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
          • Ensure that the following registry keys are deleted:
            • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
            • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
          • Reboot the agent machine (if possible)
          • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
          • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
          • Remember to import certificats at this point if you are using those on the individual agent.
          • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

         

        To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

        1. Review OpsMgr event log on agent
        2. Bounce HealthService
        3. Bounce HealthService clearing \Health Service State folder.
        4. Complete brute force reinstall of the agent.

        If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.

         

        Also – make sure you see my other posts on agent health and troubleshooting during deployment:

        Console based Agent Deployment Troubleshooting table

        Agent discovery and push troubleshooting in OpsMgr 2007

        Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility

        Agent Pending Actions can get out of synch between the Console, and the database

        Which hotfixes should I apply-

      • Using a recovery in OpsMgr - Basic

        This is a simple overview of using a recovery for a custom Monitor in OpsMgr

        Lets say we create a simple service monitor in OpsMgr... for this example - I will use the Print Spooler service:

        Create a new monitor, unit monitor, and choose windows services - Basic Service Monitor:

        image

        Choose an appropriate management pack to save it to... such as a Base OS custom rule MP you create.

        Give it a name - such as "Check Windows Spooler Service" and choose a valid target, such as "Windows Server"

        image

        Browse the service name - and pick the Print Spooler (Spooler):

        image

        Accept defaults for health, and let it create an alert, or not - depending on your requirements.

        Once the monitor is created.... open it up in the Authoring tab of the Ops console.  Choose the "Diagnostic and Recovery" tab.

        Under "Configure Recovery Tasks" add a a recovery for Critical Health State.  Choose "Run Command" and click Next.

        Give the recovery a name.... such as "Restart service" and click Next.

        For the command line settings... we need to provide a path to the file we want to run.  For a simple service restart - we can use the "NET" command, as in "NET START (servicename)"  For the path - just specify the original executable - do not add any command line switches.... such as:  "%windir%\system32\net.exe"

        Under "Parameters" - this is where we will add the command line switches.... such as "start spooler" in this case:

        image

        Click "Create"  Click OK.

        Now - pick a managed agent - and stop the Spooler service.  This will create a state change for the monitor.  If you told the monitor to alert - it will also create an alert at this time.  As soon as the state change occurs, our recovery will run.... which should restart the service.

        Check the system event log to view the activity.  I got the following two events:

        Event Type:    Information
        Event Source:    Service Control Manager
        Event Category:    None
        Event ID:    7036
        Date:        3/26/2008
        Time:        1:24:44 AM
        User:        N/A
        Computer:    OMTERM
        Description:
        The Print Spooler service entered the stopped state.

        Event Type:    Information
        Event Source:    Service Control Manager
        Event Category:    None
        Event ID:    7036
        Date:        3/26/2008
        Time:        1:25:04 AM
        User:        N/A
        Computer:    OMTERM
        Description:
        The Print Spooler service entered the running state.

        So the service was down for about 20 seconds.... for the monitor to detect the unhealthy state, and then to run a recovery to restart the service.

        Open health explorer for the computer object for the test machine, and find the "Print Spooler Service Check" monitor.  It should show up as healthy... if the recovery worked.  Select this monitor, and then click the "State Change Events" tab.  We should see the service is running currently as the last logged state change.  Find the "Service is Not running" state change just below the current one.... and in the details pane - we should be able to see the recovery output where the recovery task ran automatically, and logged the output:

        image

        So what if we want a more advanced recovery?  Perhaps we have a service that just doesn't always start reliably on the first try.  Perhaps we want to try and start the service three time over a 3 minute period, and THEN create the alert?   This can be done.... but will have to be done using a custom script that provides this logic, and then create the alert, or creates an event, and then a rule will alert from the event created.

      • System Center Operations Manager SDK service failed to register an SPN

        System Center Operations Manager SDK service failed to register an SPN

         

         

        Have you seen this event in your RMS OpsMgr event logs?

         

        Event Type:      Warning

        Event Source:   OpsMgr SDK Service

        Event Category:            None

        Event ID:          26371

        Date:                12/13/2007

        Time:                2:58:24 PM

        User:                N/A

        Computer:         RMSCOMPUTER

        Description:

        The System Center Operations Manager SDK service failed to register an SPN. A domain admin needs to add MSOMSdkSvc/rmscomputer and MSOMSdkSvc/rmscomputer.domain.com to the servicePrincipalName of DOMAIN\sdkaccount

         

        This seems to appear in the RC1-SP1 build of OpsMgr.

         

        Every time the SDK service starts, it tries to update the SPN’s on the AD account that the SDK service runs under.  It fails, because by default, a user cannot update its own SPNs.  Therefore we see this error logged.

         

        If the SDK account is a domain admin – it does not fail – because a domain admin would have the necessary rights.  Obviously – we don’t want the SDK account being a domain admin…. That isn’t required nor is it a best practice.

         

        Therefore – to resolve this error, we need to allow the SDK service account rights to update the SPN.  The easiest way, is to go to the user account object for the SDK account in AD – and grant SELF to have full control.

         

        A better, more granular way – is to only grant SELF the right of modifying the SPN:

         

        • Run ADSIEdit as a domain admin.
        • Find the SDK domain account, right click, properties.
        • Select the Security tab, click Advanced.
        • Click Add.  Type “SELF” in the object box.  Click OK.
        • Select the Properties Tab.
        • Scroll down and check the “Allow” box for “Read servicePrincipalName” and “Write servicePrincipalName”
        • Click OK.  Click OK.  Click OK.
        • Restart your SDK service – if AD has replicated from where you made the change – all should be resolved.

         To check SPN's:

        The following command will show all the HealthService SPN's in the domain:

            Ldifde -f c:\ldifde.txt -t 3268 -d DC=DOMAIN,DC=COM -r "(serviceprincipalname=MSOMHSvc/*)" -l serviceprincipalname -p subtree
         

        To view SPN's for a specific server: 

            "setspn -L servername"

         

         

      • How grooming and auto-resolution work in the OpsMgr 2007 Operational database

        How Grooming and Auto-Resolution works in the OpsMgr 2007 Operations DB

         

         

        Warning – don’t read this if you are bored easily. 

         

         

        In a simplified view to groom alerts…..

         

        Grooming of the ops DB is called once per day at 12:00am…. by the rule:  “Partitioning and Grooming  You can search for this rule in the Authoring space of the console, under Rules.  It is targeted to the “Root Management Server” and is part of the System Center Internal Library.

         

        It calls the “p_PartitioningAndGrooming” stored procedure, which calls p_Grooming, which calls p_GroomNonPartitionedObjects (Alerts are not partitioned) which inspects the PartitionAndGroomingSettings table… and executes each stored procedure.  The Alerts stored procedure in that table is referenced as p_AlertGrooming which has the following sql statement:

         

            SELECT AlertId INTO #AlertsToGroom

            FROM dbo.Alert

            WHERE TimeResolved IS NOT NULL

            AND TimeResolved < @GroomingThresholdUTC

            AND ResolutionState = 255

         

        So…. the criteria for what is groomed is pretty simple:  In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)

         

        We won’t groom any alerts that are in New (0), or any custom resolution-states (custom ID #).  Those will have to be set to “Closed” (255)…. either by autoresolution of a monitor returning to healthy, direct user interaction, our built in autoresolution mechanism, or your own custom script.

         

        Ok – that covers grooming.

         

        However – I can see that brings up the question – how does auto-resolution work?

         

         

         

         

        That specifically states “alerts in the new resolution state”.  I don’t think that is completely correct:

         

        That is called upon by the rule “Alert Auto Resolve Execute All” which runs p_AlertAutoResolveExecuteAll once per day at 4:00am.  This calls p_AlertAutoResolve twice…. once with a variable of “0” and once with a “1”.

         

        Here is the sql statement:

         

        IF (@AutoResolveType = 0)

            BEGIN

                SELECT @AlertResolvePeriodInDays = [SettingValue]

                FROM dbo.[GlobalSettings]

                WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()

         

                SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

                SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()

           

                -- We will resolve all alerts that have green state and are un-resolved

                -- and haven't been modified for N number of days.

                INSERT INTO @AlertsToBeResolved

                SELECT A.[AlertId]

                FROM dbo.[Alert] A

                JOIN dbo.[State] S

                    ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId

                WHERE A.[LastModified] < @AutoResolveThreshold

                AND A.[ResolutionState] <> 255

                AND S.[HealthState] = 1

         

        <snip>

         

            ELSE IF (@AutoResolveType = 1)

            BEGIN

                SELECT @AlertResolvePeriodInDays = [SettingValue]

                FROM dbo.[GlobalSettings]

                WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()

         

                SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())

         

                -- We will resolve all alerts that are un-resolved

                -- and haven't been modified for N number of days.

                INSERT INTO @AlertsToBeResolved

                SELECT A.[AlertId]

                FROM dbo.[Alert] A

                WHERE A.[LastModified] < @AutoResolveThreshold

                AND ResolutionState <> 255

         

         

        So we are basically checking that Resolution state <> 255….. not specifically “New” (0) as we would lead you to believe by the wording in the interface.  There are simply two types of auto-resolution:  Resolve all alerts where the object has returned to a healthy state in “N” days….. and Resolve all alerts no matter what, as long as they haven’t been modified in “N” days.

      • Configuring Notifications - to include specific alerts from specific groups and classes

        So.... Say I am an Exchange Administrator in a global company.... in the good old USA.

        My company has recently implemented OpsMgr 2007 to monitor our Exchange servers.  I am going to configure my notification subscriptions so I can get an email anytime one of my Exchange servers has an issue.

        Try #1:  I start by creating a notification subscription, and I dont scope it by groups or classes (all groups, all classes).  I think this sounds fine.  However, instantly I find I am flooded with email notifications from every single alert coming into the console.  This is NOT good!

        Try #2:  Therefore – I decide I really need to see only Exchange alerts.  I scope the notification *classes* down to just Exchange classes.  This will ensure I only receive notifications from Exchange target classes.  Good?  Nope....  I soon find that when an alert comes in from the base OS, or heartbeat, or hardware, we won’t get those.  We need to add those classes back.  If we add the heartbeat (Health Service Watcher) class – we will now get heartbeat failures for ALL machines… not just restricted to exchange servers.  No good.

        Try #3:  So – we need to scope the subscription using groups.  We create a group with all our Exchange Server Windows Computer objects in it.  We can manually add these in (Explicit) or we can use a dynamic rule based on criteria - I chose NetBIOS name, and used a naming standard of EX* (all my exchange servers start with "ex").  I used an "OR" statement since the wildcard is case sensitive.

        image

        Now I create a subscriptions - and scope it to this group - and choose ALL classes....  thinking that this way, we should get ALL notifications, including base OS, exchange, and heartbeat alerts… right? 

        Nope.  Because of the object oriented monitoring model – we will only receive alerts from a rule/monitor with a target class that has a child relationship to the Windows Computer class.  This is the only class type in the group we created.  So – using the model in #3, we will get notifications from pretty much any class needed – except heartbeats.  These come from the Health Service Watcher class, and have no relation to the Windows Computer class.

        Try #4:  I am thinking, we must add the class type to our group – and any instances of that class we are interested in.  Since most object classes are a child of Windows Computer, there should not be many of these that we will have to do.

        In the group – add the Health Service Watcher display name instances, in the same way we add the Windows Computer NetBIOS names:

        clip_image002

        The AND/OR verbiage is misleading…. This was opened as a bug then closed – because it is “as designed”.

        Essentially – The or group at the top will include ANY of the following and groups below it…. BOTH the windows computer objects AND the Health Service Watcher objects are included:  (you can right click any group and choose to show members)

        clip_image004

        I tested all kinds of Exchange alerts, and heartbeat failures – and this works.  It is possible there will be other alerts we wont get in this subscription.... IF the rule or monitor that created the alert was using a target class that was unique, and not a child of "Windows Computer"

        I don’t think this will be a huge hassle moving forward… because MOST alerting is done on a target which is a child of Windows computer.  If we find one that is not – we just need to go back and add that class’s instances to the groups we create for notifications.

         

        Want alert by alert notifications?  Where you can subscribe to a single alert, rule by rule, monitor by monitor?  Check out:

        http://code4ward.net/cs2/blogs/code4ward/archive/2007/09/19/set-notificationforalert.aspx

         

      • A report to show all agents missing a specific hotfix

        This is a continuation of my previous post on determining which agents are missing a hot-fix:

        How do I know which hotfixes have been applied to which agents-

        I wrote up a report that allows you to paste in a KB article number into the report as a parameter, and then it will show all agents that are potentially missing that hotfix.  This will help you easily find agent which need to be patched and got missed for some reason.

        You can run this report if you create the SQL reporting data source as specified in my previous post:

        Creating a new data source for reporting against the Operational Database

        Once imported - it will show up in the console.  Open the report, and paste in any KB article number for a OpsMgr hotfix you have applied.  The number MUST begin and end with "%".... such as %951380% as shown:

         

        image

        The report is attached below:

      • Update Rollup 5 for System Center 2012 SP1 released

         

        System Center 2012 SP1 has shipped Update Rollup 5.

        http://support.microsoft.com/kb/2904730/en-us

        There are updates available for OpsMgr 2012 SP1, VMM 2012 SP1, Orchestrator 2012 SP1 and DPM 2012 SP1, in this release.

        See the KB article for full details of each, with links to the individual updates and downloads.

      • UR1 for SCOM 2012 R2 – Step by Step

         

        KB Article:   http://support.microsoft.com/kb/2904678/en-us

        Download catalog site:  http://catalog.update.microsoft.com/v7/site/Search.aspx?q=2904678

         

        Key fixes:

        Issue 1 - An error occurs when you run the p_DataPurging stored procedure. This error occurs when the query processor runs out of internal resources and cannot produce a query plan.
        Issue 2 - Data warehouse BULK INSERT commands use an unchangeable, default 30-second time-out value that may cause query time-outs.
        Issue 3 - Many 26319 errors are generated when you use the Operator role. This issue causes performance problems.
        Issue 4 - The diagram component does not publish location information in the component state.
        Issue 5 - Renaming a group works correctly on the console. However, the old name of the group appears when you try to override a monitor or scope a view based on group.
        Issue 6 - SCOM synchronization is not supported in the localized versions of Team Foundation Server.
        Issue 7 - An SDK process deadlock causes the Exchange correlation engine to fail.
        Issue 8 - The "Microsoft System Center Advisor monitoring server" reserved group is visible in a computer or group search.
        Issue 9 - Multiple Advisor Connector are discovered for the same physical computer when the computer hosts a cluster.
        Issue 10 - A Dashboard exception occurs if the criteria that are used for a query include an invalid character or keyword.

        Xplat updates:

        Issue 1 - On a Solaris-based computer, an error message that resembles the following is logged in the Operations Manager log. This issue occurs if a Solaris-based computer that has many monitored resources runs out of file descriptors and does not monitor the resources. Monitored resources may include file systems, physical disks, and network adapters.
        Note The Operations Manager log is located at /var/opt/microsoft/scx/log/scx.log.     errno = 24 (Too many open files)    This issue occurs because the default user limit on Solaris is too low to allocate a sufficient number of file descriptors. After the rollup update is installed, the updated agent overrides the default user limit by using a user limit for the agent process of 1,024.

        Issue 2 - If Linux Container (cgroup) entries in the /etc/mtab path on a monitored Linux-based computer begin with the "cgroup" string, a warning that resembles the following is logged in the agent log.  Note When this issue occurs, some physical disks may not be discovered as expected.  Warning [scx.core.common.pal.system.disk.diskdepend:418:29352:139684846989056] Did not find key 'cgroup' in proc_disk_stats map, device name was 'cgroup'.

        Issue 3 - Physical disk configurations that cannot be monitored, or failures in physical disk monitoring, cause failures in system monitoring on UNIX and Linux computers. When this issue occurs, logical disk instances are not discovered by Operations Manager for a monitored UNIX-based or Linux-based computer.

        Issue 4 - A monitored Solaris zone that is configured to use dynamic CPU allocation with dynamic resource pools may log errors in the agent logs as CPUs are removed from the zone and do not identify the CPUs currently in the system. In rare cases, the agent on a Solaris zone with dynamic CPU allocation may hang during routine monitoring.  Note This issue applies to any monitored Solaris zones that are configured to use dynamic resource pools and a "dedicated-cpu" configuration that involves a range of CPUs.

        Issue 5 - An error that resembles the following is generated on Solaris 9-based computers when the /opt/microsoft/scx/bin/tools/setup.sh script does not set the library pathcorrectly. When this issue occurs, the omicli tool cannot run.  ld.so.1: omicli: fatal: libssl.so.0.9.7: open failed: No such file or directory

        Issue 6 - If the agent does not retrieve process arguments from the getargs subroutine on an AIX-based computer, the monitored daemons may be reported incorrectly as offline. An error message that resembles the following is logged in the agent log:   Calling getargs() returned an error

        Issue 7 - The agent on AIX-based computers considers all file cache to be available memory and does not treat minperm cache as used memory. After this update rollup is installed, available memory on AIX-based computer is calculated as: free memory + (cache – minperm).

        Issue 8 - The Universal Linux agent is not installed on Linux computers that have OpenSSL versions greater than 1.0.0 if the library file libssl.so.1.0.0 does not exist. An error message that resembles the following is logged:  /opt/microsoft/scx/bin/tools/.scxsslconfig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory

         

        I have seen *several* customers having issues with the OpsDB grooming/purging process, so that looks like a good one to get implemented, especially if this was affecting you.

         

        Lets get started.

        From reading the KB article – the order of operations is:

         

        1. Install the update rollup package on the following server infrastructure:
          • Management servers
          • Gateway servers
          • Web console server role computers
          • Operations console role computers
        2. Apply SQL scripts (see installation information).
        3. Manually import the management packs.

        There are no agent updates in this UR1.  Agents will be placed into pending, however there are no updates.  You must reject the agents in pending.

        Now, we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.

        4.  Update Unix/Linux MP’s and Agents.

         

        1.  Management Servers

        Since there is no RMS anymore, it doesn’t matter which management server I start with.  There is no need to begin with whomever holds the RMSe role.  I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.

        I can apply this update manually via the MSP files, or I can use Windows Update.  I have 3 management servers, so I will demonstrate both.  I will do the first management server manually.  This management server holds 3 roles, and each must be patched:  Management Server, Web Console, and Console.

        The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:

        image

        Then extract the contents:

        image

        Once I have the MSP files, I am ready to start applying the update to each server by role.

        ***Note:  You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.

        My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:

        image

        This launches a quick UI which applies the update.  It will bounce the SCOM services as well.  The update does not provide any feedback that it had success or failure.  You can check the application log for the MsiInstaller events for that.

        You can also spot check a couple DLL files for the file version attribute. 

        image

         

        Next up – run the Web Console update:

        image

        This runs much faster.   A quick file spot check:

        image

        Lastly – install the console update (make sure your console is closed):

        image

        A quick file spot check:

        image

         

        Secondary Management Servers:

        I now move on to my secondary management servers, applying the server update, then the console update. 

        On this next management server, I will use Windows Update.  I check online, and make sure that I have configured Windows Update to give me updates for additional products:

        image

        This shows me two applicable updates for this server:

        image

        I apply these updates (along with some additional Windows Server Updates I was missing, and reboot each management server, until all management servers are updated.

         

        Updating Gateways:

        I can use Windows Update or manual installation.

        image

        The update launches a UI and quickly finishes.

        Then I will spot check the DLL’s:

        image

         

        2. Apply the SQL Script

        In the path on your management servers, where you installed/extracted the update, there is a SQL script file: 

        %SystemDrive%\Program Files\System Center 2012\Operations Manager\Server\SQL Script for Update Rollups

        Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file.  Make sure it is pointing to your OperationsManager database, then execute the script.

        ****Note – at the time of this writing – the KB article says to run this against the DataWarehouse – the KB article is in error

        image

        Click the “Execute” button in SQL mgmt. studio.  The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.

        You will see the following (or similar) output:

        image

         

        3. Manually import the management packs?

        We have four updated MP’s to import  (MAYBE!).

        image

        The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc.  If you are not currently using these MP’s, there is no need to import or update them.  I’d skip this MP import unless you already have these MP’s present in your environment.

        The Advisor MP’s are only needed if you are using System Center Advisor services.

        However, the Image and Visualization libraries deal with Dashboard updates, and these need to be updated.

        I import all of these without issue.

         

         

        Reject the agent update

        Agents are placed into pending actions by this update.  HOWEVER – there are no updates for the agents in the Update Rollup.  You must REJECT the agents in pending, using the console or PowerShell.

         

        4.  Update Unix/Linux MPs and Agents

        Next up – I download and extract the updated Linux MP’s for SCOM 2012 SP1 UR2

        http://www.microsoft.com/en-us/download/details.aspx?id=29696

        7.5.101 is current at this time for SCOM 2012 R2. 

        ****Note – take GREAT care when downloading – that you select the correct download for R2.  You must scroll down in the list and select the MSI for 2012 R2:

        image

         

        Download the MSI and run it.  It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\

        Update any MP’s you are already using.

        image

        You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports.  Give it plenty of time to complete the process of the import and MPB deployments.

        Next up – you would upgrade your agents on the Unix/Linux monitored agents.  You can now do this straight from the console:

        image

        image

        You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.

        image

        I have an environmental issue that caused my Ubuntu server to fail. 

         

        5.  Update the remaining deployed consoles

        This is an important step.  I have consoles deployed around my infrastructure – on my Orchestrator server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc.  These should all get the UR1 update.

         

         

        Review:

        Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.

        image

        Known issues:

        See the existing list of known issues documented in the KB article.

        1.  Many people are reporting that the SQL script is failing to complete when executed.  You should attempt to run this multiple times until it completes without error.  You might need to stop the Exchange correlation engine, stop the services on the management servers, or bounce the SQL server services in order to get a successful completion in a busy management group.  The errors reported appear as below:

        ------------------------------------------------------

        (1 row(s) affected)

        (1 row(s) affected)

        Msg 1205, Level 13, State 56, Line 1

        Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

        Msg 3727, Level 16, State 0, Line 1

        Could not drop constraint. See previous errors.

        --------------------------------------------------------

      • New MP Authoring tool released – “MP Author”

         

        A new MP authoring tool was announced today.  Read the release at http://blogs.technet.com/b/momteam/archive/2014/01/13/mp-blog-the-right-tool-for-the-right-job.aspx

        This is a FREE tool which Silect is releasing.  This tool essentially replaces the functionality of the previous “Visio Management Pack Designer”  This is targeted at the IT Pro, who needs to create custom management packs and author new classes, discoveries, rule, monitors, etc… but is not a developer.  This new tool will make simple work of creating a new class to monitor a specific application, quickly discover it, and add several types of monitoring.

        You can register and download here:  http://bridgeways.com/products/mp-author

        One of the big benefits of this over the Visio tool, is that it can open existing MP’s and make changes to them, where the Visio tool was a “one way” solution.  This new tool also expands on the types of workflows that are possible to create over the Visio tool.  If you were using the Visio MP designer, I’d recommend migrating to this new solution immediately.  If you considered but didn’t like the Visio designer, – try this one out.

        This is the initial release, I imagine we will see additional capabilities as time progresses.  Keep in mind – this is meant for SIMPLE management packs, not a full development suite.  The Visual Studio authoring extensions are the right place for a more full featured management pack development environment.

        Here are some simple examples of using MP Author:

        Open MP Author.  Click “New” to create a new MP.

        image

        Most fields come pre-populated, but are simple to change. 

        image

        Provide a location for your new MP:

        image

        The MP Author automatically creates the necessary references, and you can add more if you need to reference classes in other MP’s:

        image

        image

        Now we can choose what we want to create from common templates.

        image

        The MOST common should be “empty management pack”.  Even a “single server application” create a class for our app, but it also creates an additional distributed application for each as well, and this is not commonly needed.  I’d prefer the “single server application” only create a single, simple class, based on Microsoft.Windows.Local.Application, but that is open for discussion.  When we choose to create an empty MP we still have full use of wizards to help create our MP.

        I choose Empty MP and click Next, Finish.

        Now – what I want to do is to create a class (or “Target”) in this MP to represent an application that I need to discover and monitor.  For this example, I will use the WINS server role.

        Go to “Targets” and choose “New Registry Target”

        image

        Connect to an existing WINS server to browse the registry of that machine.

        I will base the discovery of my class on the Registry value for the WINS service – in this case it is located at:

        HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\WINS\Start

        image

        When Start = 2 (automatic) I consider that a WINS server.  Click OK, Next.

        Provide an ID and displayname for the Class, or accept defaults:

        image

        Provide an ID and displayname for the Discovery, or accept defaults:

        image

        Validate or modify the expression for class membership:

        image

        Set schedule for once every day, and Finish.

        SAVE YOUR MP AT THIS POINT.  We’d hate to lose all our work.  Smile

        Now we can quickly add in some event rules, service monitor, performance monitors, etc….  When happy, right-click the top level folder for your MP in the left tree view, and choose Import Management pack:

        image

         

        We provide a MS name, and credentials.  mine popped up and said it could not validate my creds which was odd.  The next screen shows what referenced MP’s must also be imported.  This appears a little odd to me because these MP’s are already imported in my environment anyway.  This operation crashed on my PC so there might be some issues to work out yes on this process.  No bother, I’d rather manually import anyway.  Unfortunately I didn’t save my work FIRST.  So off I go to recreate what we just did.  Smile

        image

        I manually import my MP, and I can view my discovered servers using Discovered Inventory for my new class:

        image

        Could not be any easier to create classes for granular targeting of applications, and creating common authored workflows to rapidly provide monitoring.

      • Auditing on Alerts from the Data Warehouse

        Do you want auditing information on how many alerts are being closed or modified by your OpsMgr users?

        You can use the following queries to get this information from the data warehouse, and I have attached some reports below as well:

        To get all raw alert data from the data warehouse to build reports from:

        select * from Alert.vAlertResolutionState ars
        inner join Alert.vAlertDetail adt on ars.alertguid = adt.alertguid
        inner join Alert.vAlert alt on ars.alertguid = alt.alertguid

        To view data on all alerts modified by a specific user:

        select ars.alertguid, alertname, alertdescription, statesetbyuserid, resolutionstate, statesetdatetime, severity, priority, managedentityrowID, repeatcount
        from Alert.vAlertResolutionState ars
        inner join Alert.vAlert alt on ars.alertguid = alt.alertguid
        where statesetbyuserid like '%username%'
        order by statesetdatetime

        To view a count of all alerts closed by all users:

        select statesetbyuserid, count(*) as 'Number of Alerts'
        from Alert.vAlertResolutionState ars
        where resolutionstate = '255'
        group by statesetbyuserid
        order by 'Number of Alerts' DESC

        In the reports I have attached, you can pick a date and a time window, and run these same basic queries

        image

        image

        Files attached below: