Blog Roll:
OpsMgr Websites:
Many times, we would like to collect information for reporting, or measure and alert on something. Normally, we use Windows Performance Monitors to do this. But, what do we use when a perfmon object/counter/instance doesn't exist?
This post is an example of how to collect WMI information, and insert it into OpsMgr as performance data. From there we can use it in reports and create threshold monitors.
For starters... we need to find the location of the data in WMI. We can use wbemtest to locate it and test our query.
Hit "Connect" and connect to root\cimv2.
For this example - I am going to look at the Win32_OperatingSystem class.
Using Enum Classes, Recursive, I find the class. I notice the class has a property of "NumberOfProcesses". That will do well for this example since the output will be an Integer.
I form the query.... select numberofprocesses from win32_operatingsystem
Ok.... we know our WMI query we want.... now lets dive into the console.
We will start by creating a performance collection rule.... for this query output. Authoring pane, Create a New Rule, Collection Rules, Performance Based, WMI Performance.
Give the rule a name (in accordance with your documented custom rule naming standards), then change the Rule Category to "PerformanceCollection", and then choose a target. I am using Windows Server for this example.
Click Next, and on the WMI Namespace page, enter your namespace, query, and interval. The interval in general should be no more than every 15 minutes, unless you really need a large amount of raw data for reporting. I am using every 10 seconds for an example only.... this is not recommended generally because of the large amount of perf data that will flood the database if we targeted all Agents, or Windows Servers.
The last screen, and most confusing.... is the Performance Mapper. This is where we will give the rule the information it needs to populate the data into the database as ordinary performance data.
First - we need to make up custom names for Object, Counter, and Instance. Just like data from collected from perfmon, we need to supply this.... so I will make up a name for each that makes sense. I will use "NULL" for Instance, as I don't have any instance for this type of data in my example.
For the Value field, this is where we will input a variable, which represents our query output. In general, following this example, it will be $Data/Property[@Name='QueryObject']$ where you replace "QueryObject" with the name of your instance name that you queried from the WMI class. So for my example, we will use:
$Data/Property[@Name='NumberOfProcesses']$
Click "Create" and we are done! How do we know if it is working?
Well, we can create a new Performance view and go look at the data it is collecting:
Create a new Custom Performance View in My Workspace. Scope it to Windows Server (or whatever you targeted your rule to). Then check the box for "collected by specific rules" and choose your rule from the popup box. As long as you chose "PerformanceCollection" as the rule category, it will show up here.
And check out the Performance view - we have a nice snapshot and historical record of number of processes from WMI: Also not the custom performance Object, Counter, and Instance being entered from our rule:
Ok - fun is over. Lets use WMI to monitor for when an agent has more than 40 processes running!
Create a Unit Monitor. WMI Performance Counters, Static Thresholds, Single Thresholds, Simple Threshold. We will fill out the Monitor wizard exactly as we did the Rule above. However, on Interval, since this monitor will only be inspecting the data on an agent, and not collecting the performance data into the database, we can use a more frequent interval. Checking every 1 minute is typically sufficient. Fill out the Performance Mapper exactly as we did above.
Now.... on the threshold value... I want to set a threshold of 40 processes in this example.
Over 40 = Critical, and under = Healthy. Sounds good.
On the Configure Alerts pane, I am going to enable Alerts, and then add Alert Descirption Variables from http://blogs.technet.com/kevinholman/archive/2007/12/12/adding-custom-information-to-alert-descriptions-and-notifications.aspx
Create it - and lets see if we get any alerts:
Yep. Works perfectly:
A quick check of Health Explorer shows it is working as designed:
When you apply a hot-fix to a RMS, or Management Server, or Gateway server... a couple things will happen. First... it will update the server itself with whatever the hot-fix is supposed to fix... registry, DLL's, database updates, etc. Next, if the update needs to flow down to all agents... it will place a MSP file in the \AgentManagement directory under the OpsMgr installation directory.
Then, it will put the agents that report to the hot-fixed management server, into pending actions for the update. It will only place the agents reporting to that MS/RMS into pending... not all agents. For this reason - you really should patch ALL your RMS, MS, and GW's first, before approving any agents.
Then, when you "approve" an agent for the update... what it does is actually reinstall the agent, from its management server, then apply any update MSP's that are present, and that are not already installed.
So - when you apply a hot-fix to a management group - before approving any agents, it is a good idea to check your \AgentManagement directories on all MS/GW roles, and make sure the \x86 and \AMD64 folders have consistent AND CORRECT patch files present.
When you "approve" agents for the update... or perform a "repair", we recommend only doing 200 agents at a time, max. Phase the updates out in batches.
Then, use the "Patch List" view described in my previous blog post, to ensure all agents got updated. For agents that still need to be updated, simply run a "Repair" on those from the console, or patch them manually.
Any new agents that get pushed will automatically get the current hot-fixes applied, as long as the hot-fix MSP's are present in the \AgentManagent directory. However, manually installed agents must be hot-fixed manually.
Lastly... on the current batch of hot-fixes.... 950853 and 951380 BOTH update the SAME file.... mommodules.dll 950853 (memory leak) updates this file to 6.0.6278.11, and 951380 (cluster discovery) updates the same file to 6.0.6278.20. IF you are planning on applying both of these fixes... technically, you only need the latter, since it includes the previous fix.
Update 10-15-2008
Now - if you are applying 954903.... this contains mommodules.dll 6.0.6278.36 which supercedes BOTH 951380 and 950853.... so if you need all three hotfixes - just apply 954903. However - note in the picture below, if you apply two hotfixes that update the same file, the management server \AgentManagement directory still keeps the older one.... apparently the hotfix process does not understand that they update the same file, nor does it clean out the older 951380. The problem with this - is any major agent deployment will get impacted... because we will add to the install time, and impact the network worse. In this example - an agent push will be copying over the agent MSI (9MB) plus each hotfix in this directory.... while we dont have any direct guidance on this area - I would recommend removing the older hotfixes that no long apply, or are superceded by other hotfixes already in this directory.
I have a Surface RT device (the original) and absolutely love it. I use it every day. However, one of the challenges I have been dealing with was after the upgrade for Windows 8.1 came out, I noticed the battery is always dead when I picked it up. Before, I could go 3 to 5 days between charges, depending on how much I was using it. Now, it would discharge in standby within 24 hours!
To track this – you can create a Battery Report. Open an elevated command prompt on the SurfaceRT device, and type in:
powercfg /batteryreport
This will save an HTML file as seen above.
This is a pretty cool report that will show some interesting statistics about your batter. But it will also show your periods of use, and how much the battery drains during connected standby:
In the table above, I can see I entered standby at around 9pm on the 17th, and when I picked it back up around 4pm the next day, the battery was almost dead! The chart in the report shows this as well, pretty cool:
What's the fix???
If you look on this page: http://www.microsoft.com/Surface/en-US/support/hardware-and-drivers/battery-and-power
There is an interesting section at the bottom:
Surface RT only: Battery issue when updating from Windows RT 8.1 Preview If you updated Surface RT from Windows RT 8.1 Preview to Windows RT 8.1, you may notice a decrease in battery life. During the update, the wireless adapter power policy isn’t migrated. Instead, the power policy is set to a default value that consumes more power both during use and in the connected standby state. To restore the wireless adapter power policy to the correct settings, open an administrator command prompt: Step 1: Swipe in from the right edge of the screen, and then tap Search. (If you're using a mouse, point to the lower-right corner of the screen, move the mouse pointer up, and then click Search.) Step 2: In the search box, enter command prompt. Step 3: Touch and hold (or right-click) Command Prompt to bring up the context menu. Tap or click Run as administrator. Step 4: On the User Account Control dialog box, tap or click Yes. Step 5: At the Administrator: Command Prompt, enter the following: powercfg -setdcvalueindex SCHEME_CURRENT 19cbb8fa-5279-450e-9fac-8a3d5fedd0c1 12bbebe6-58d6-4636-95bb-3217ef867c1a 3 Step 6: Then enter powercfg -setactive scheme_current
If you updated Surface RT from Windows RT 8.1 Preview to Windows RT 8.1, you may notice a decrease in battery life. During the update, the wireless adapter power policy isn’t migrated. Instead, the power policy is set to a default value that consumes more power both during use and in the connected standby state.
To restore the wireless adapter power policy to the correct settings, open an administrator command prompt:
Step 1: Swipe in from the right edge of the screen, and then tap Search. (If you're using a mouse, point to the lower-right corner of the screen, move the mouse pointer up, and then click Search.)
Step 2: In the search box, enter command prompt.
Step 3: Touch and hold (or right-click) Command Prompt to bring up the context menu. Tap or click Run as administrator.
Step 4: On the User Account Control dialog box, tap or click Yes.
Step 5: At the Administrator: Command Prompt, enter the following:
powercfg -setdcvalueindex SCHEME_CURRENT 19cbb8fa-5279-450e-9fac-8a3d5fedd0c1 12bbebe6-58d6-4636-95bb-3217ef867c1a 3
Step 6: Then enter powercfg -setactive scheme_current
Voila…. this fixed mine immediately. And yes, I did update from 8.1 preview to the full release of 8.1. My surface can now sit in connected standby mode for an entire day and only consume about 10% of the battery life.
Ok, not really an upgrade, but more of “replacement”.
With the release of Windows Server 2012 R2 to MSDN which was recently announced HERE, it is time for me to upgrade my lab domain controllers to Windows Server 2012 R2.
I started by first “upgrading” my Hyper-V hosts to Windows Server 2012 R2. This would allow me to take full advantage of all the new benefits of 2012 R2 for Hyper-V. That was pretty simple, just shut down the OS, unplug all my additional storage in the machine which contains all my VM’s, and boot from my USB key that contained WS2012R2. Then, once I added the Hyper-V role back, I simply connect my storage back to the system, and import the previous VM’s I was running.
My next step in upgrading my VM’s is targeting the domain controllers. I have two DC’s, each running AD services, certificate services, DHCP, DNS, etc. Since I don’t want to risk messing up the complex configuration of each service, I choose to deploy two NEW VM’s for additional DC’s, and I will migrate these additional roles to the new DC’s later.
My first step is to deploy the two new VM’s. First decision I need to make is whether to use Gen1 or Gen2 VM’s:
Gen2 VM’s are a new feature of Hyper-V in Windows Server 2012 R2, and offer significant advantages over Gen1 VM’s, such as secure boot, discarding the emulated devices like IDE and using SCSI disks event for the boot volumes, PXE capability on a standard NIC, etc. Read more about Gen2 VM’s here: http://technet.microsoft.com/en-us/library/dn282285.aspx
Installing Windows Server 2012 R2 is just like any other OS install. When it stops on the Activation Key screen, I decided to leverage another new feature for Windows Server 2012 R2 – Automatic VM Activation. You can use these new keys to activate servers when they are running on Windows Server 2012 R2 Hyper-V. Read more about Automatic VM Activation here: http://technet.microsoft.com/en-us/library/dn303421.aspx
I rename the VM’s with the correct server names, and join them to my domain.
The first step in promoting these new VM’s to Domain Controllers is to add that role, which you can perform from Server Manager. A walkthrough of the process is described here: http://technet.microsoft.com/en-us/library/jj574134.aspx
When the role is added – you will see a post-deployment task warning, to run the promotion:
The wizard will run AD forest prep, schema update, and domain prep for 2012 R2 when you promote the first DC on Windows Server 2012 R2.
When it is complete, you will see your new DC’s added to the domain controllers OU in Active Directory.
The next step in the process is to migrate the AD Operations Master roles. The simplest way to move these roles is via PowerShell. On Server 2012 AD PowerShell modules, this can be done from anywhere. Simply run the following command to view you current configuration, and change them:
PS C:\> netdom query FSMO Schema master DC1.opsmgr.net Domain naming master DC1.opsmgr.net PDC DC1.opsmgr.net RID pool manager DC1.opsmgr.net Infrastructure master DC1.opsmgr.net
Then use the Move-ADDirectoryServerOperationMasterRole cmdlets to move them. You can do this with a simple one liner!
Move-ADDirectoryServerOperationMasterRole -identity "DC01" -OperationMasterRole 0,1,2,3,4
The identity is the server you want to transfer these roles to, and the 0-4 numeric represents each role to move. Read more about this cmdlets here: http://technet.microsoft.com/en-us/library/ee617229.aspx
When complete, you can run a “netdom query FSMO” again and ensure that your master roles have been moved successfully.
Then, you simply need to migrate any other roles or services running on the DC’s, then demote them when complete. To demote the domain controller on Server 2012, simply begin by removing the Active Directory Services role, which will prompt you to demote first with a task link. Once demoted, you can remove the server from the domain.
The following article will cover a basic install of Data Protection Manager 2012 R2. A dedicated DPM server, and shared SQL server will be deployed. This is to be used as a template only, for a customer to implement as their own pilot or POC, or customized deployment guide. It is intended to be general in nature and will require the customer to modify it to suit their specific data and processes.
This is not an architecture guide or intended to be a design guide in any way. This is provided "AS IS" with no warranties, and confers no rights. Use is subject to the terms specified in the Terms of Use.
Server Names\Roles:
Windows Server 2012 R2 will be installed as the base OS for all platforms. All servers will be a member of the AD domain.
SQL 2012 with SP1 will be the base standard for all database and SQL reporting services.
High Level Deployment Process:
1. In AD, create the following accounts and groups, according to your naming convention: DOMAIN\DPMAdmins DPM Administrators group DOMAIN\SQLSVC SQL service account 2. Add the domain user accounts for yourself and your team to the “DPMAdmins” group. 3. Install Windows Server 2012 R2 to all server role servers. 4. Install Prerequisites and SQL 2012 with SP1. 5. Install the DPM Server 6. Install the DPM Central Console 6. Deploy Agents 7. Configure the Central Console
1. In AD, create the following accounts and groups, according to your naming convention:
2. Add the domain user accounts for yourself and your team to the “DPMAdmins” group.
3. Install Windows Server 2012 R2 to all server role servers.
4. Install Prerequisites and SQL 2012 with SP1.
5. Install the DPM Server
6. Install the DPM Central Console
6. Deploy Agents
7. Configure the Central Console
Prerequisites:
1. Install Windows Server 2012 R2 to all Servers
2. Join all servers to domain.
3. Install all available Windows Updates.
5. Add the “DPMAdmins” domain global group to the Local Administrators group on each server.
6. On the DPM server, .Net 3.5SP1 is required. Setup will not be able to add this feature on Windows Server 2012. Open an elevated PowerShell session (run as an Administrator) and execute the following:
Add-WindowsFeature NET-Framework-Core ***Note – .NET 3.5 source files are removed from the WS2012 R2 operating system. You might require supplying a source path to the installation media for Windows Server 2012 R2, such as: Add-WindowsFeature NET-Framework-Core –source D:\sources\sxs
Add-WindowsFeature NET-Framework-Core
***Note – .NET 3.5 source files are removed from the WS2012 R2 operating system. You might require supplying a source path to the installation media for Windows Server 2012 R2, such as: Add-WindowsFeature NET-Framework-Core –source D:\sources\sxs
7. On the SQL server, install the SQL Remote prep. http://technet.microsoft.com/en-us/library/hh758058.aspx Run the DPM Setup.exe, then from the screen choose “DPM Remote SQL Prep”.
8. On the DPM server, install SQL Management studio. This is located on the media at \SCDPM\SQLSVR2012SP1\SQLManagementStudio_x64_ENU.exe. Execute this and walk through the wizard, Installation, New SQL installation, and accept defaults.
9. Install SQL 2012 with SP1 to the DB server role
Step by step deployment guide:
1. Install the DPM Server role on SCDPM01. You can also refer to: http://technet.microsoft.com/en-us/library/hh758153.aspx
2. Install the Central Console.
3. Add DPM storage.
4. Install protection agents
5. Create a Protection Group
6. Protect SQL Server
7. Protect Hyper-V Virtual Machines
8. Protect SharePoint
9. Backup DPM with Windows Azure
Validate your protection is working. Look at protection groups, and view the monitoring jobs and alerts in the console.
After enough time has passed, you will see new data in the Central (SCOM) Console. Such as discovered disks, Protection groups, Protected servers, etc.
10. Enable End User Self Service Recovery
**Note – if you have already updated your schema previously for DPM in the past, you don’t need to do this step again.
There has always been a bit of confusion on when to run the DBCreateWizard.exe tool, or when to just use SetupOM.exe to create the Operational DB or Data Warehouse DB.
Historically.... in MOM 2005, we used the DBcreate Wizard in order to create the Onepoint database on Active/Active clusters..... or when SQL DBA teams refused to run a MSI based setup on one of their SQL servers. The DB create wizard was a better option for them.... since it did not have to install any binaries on a SQL server. In practice.... it was pretty rare to see this in widespread use.
In OpsMgr 2007, we haven't really documented all the scenarios for when you should run the DBcreate Wizard.... and I will try and do that here.
The DB create wizard is located on the CD - In the \SupportTools folder. It does require some additional files to run it - these don't have to be "installed", just need to be copied over to the SQL DB server where you will run the wizard. Follow: http://support.microsoft.com/kb/938997/en-us
*** Note - the additional files required to run DBCreateWizard.exe are documented in the KB article above. They were also provided on the SP1 Select CD. However - the files provided on CD are for 32bit x86 only. If you are using the DBCreateWizard on a x64 platform - you MUST copy these files listed in the KB article from an x64 server.... any x64 server with the console installed will have them.
Note - there were some significant issues with the RTM version of this tool... in detecting the correct SQL instance on a multi-instance cluster, and leaving some table information blank (http://support.microsoft.com/kb/942865/en-us). When deploying SP1 - Use the SP1 version of this tool. If you MUST deploy the RTM version - I would recommend using SetupOM.exe for all installs.
Ok.... first, you will notice in the OpsMgr Deployment guide, they instruct to use the DBcreateWizard when installing the database on an Active/Passive cluster. That's pretty much our first introduction to this tool. While this isn't required (you can simply run SetupOM.exe on the Active node) it is recommended to use DBCreateWizard. Essentially, our recommendation is that anytime you have a dedicated SQL server for the OpsDB role... with no other OpsMgr role present, then you should use the DBcreateWizard to create the Operational database. The reason for this, from an internal discussion I have been involved in.... is because using SetupOM.exe will create some additional registry entries on the database server... and will change how updates are applied to the server from an OpsMgr perspective. Another scenario to leverage this tool, is anytime your SQL DBA teams refuse to allow you to run a MSI based setup on their SQL servers/clusters.
Below, I will just walk through some of the scenarios where using this stand-alone tool really makes good sense.
Scenarios:
1. All in one role/shared roles. This is where a single server hosts SQL Server 2005 and the Operational Database role, along with the RMS role. In this case.... you might as well just run SetupOM.exe and create the database while installing the management group. You potentially could run the DBcreatewizard first.... but this would be an additional step and provides no value.
2. Split roles: Dedicated SQL server (Server A) and dedicated RMS (Server B). In this scenario - we recommend using DBcreatewizard.exe instead of just running SetupOM.exe on the SQL server. However - you certainly can do either one.... both are fully supported.
3. Split roles - clustered DB: Dedicated cluster for SQL (can be A/P or A/A or multi-instance or multi node.... doesn't matter) In this scenario - we recommend using DBcreatewizard.exe instead of just running SetupOM.exe on the SQL server. That said.... you can run SetupOM.exe on any node that owns the SQL instance you are creating the DB in.... we just favor using DBcreateWizard.
4. Draconian DBA's. In general.... DBA's are used to creating an empty database for an application, then granting permissions to the DB only.... then washing their hands of it. They don't like running setup's... or even running tools on their SQL servers.... If they must have an application create a database as part of that application install - they MUCH prefer that all the DB creation be handled remotely. Unfortunately.... MOM 2005 and OpsMgr 2007 do not support what DBA's would most like to see. We must run our setup or tool on the database server/node in order to install that component. I suppose we could install the OpsDB using the DBcreatewizard in a test lab SQL box, then detach it.... then hand the files to a SQL team and have them drop in into a production environment to make them happier.... but I haven't really done much testing there. Anyway.... the DBcreateWizard is the best option when working with a rigid DBA team. Just follow the KB article listed above... and have the SQL team run the tool to create the database.... then they can delete to tool from the server. We will still require SA priv over the instance to complete the RMS setup.... but once that is done, they can remove these advanced rights, per my previous post: http://blogs.technet.com/kevinholman/archive/2008/04/15/opsmgr-security-account-rights-mapping-what-accounts-need-what-privileges.aspx
5. Multiple Operational Databases in the same SQL instance. It is possible, if you have multiple management groups, that you could place all the Operational DB's into a single SQL instance. Now - these had better be small environments (test/dev) or a beefy SQL server to handle all that I/O.... but just for grins.... lets say you are doing it. If you tried to run SetupOM.exe and install the database component multiple times.... it would detect it was already installed and ask you if you wish to repair or remove OpsMgr. No good. In comes the DBcreateWizard. This tool is the supported method for creating multiple OpsDB's in a single SQL instance.
I have had a few requests now for this, so I thought I would take the time to write up the process.
Lets say I have three support levels of servers:
Level 1 – servers critical to business operations (ex: customer facing web applications, SQL back-ends)
Level 2 – important servers (ex: messaging, internal apps)
Level 3 – non-essential servers (ex: non-critical or highly redundant internal apps)
Lets say we want to create overrides for certain rules… where we will page on anything in Level 1 group, email notify on Level 2 group, and simply alert for Level 3. Possibly we want to create views, and only see alerts for Level 1 servers. Perhaps we wish to scope users so they only see Level 1 and Level 2 servers in the console?
Well – the first step is to place these servers into groups.
Sure – we can do this manually, with explicit assignments to the group. But that is resource intensive over time, and we might miss one down the road. I’d prefer to dynamically create the groups of Windows Computers based on a name…. but this can be difficult sometimes – where we don't have a solid naming scheme, or other criteria to group by.
I will demonstrate another way to accomplish this… by coming up with a business process to use a registry key on your managed servers, and collect this registry attribute with SCOM. Then – use this Registry attribute for dynamic group memberships.
Ultimately – there are three simple steps to this process:
1. Create registry keys on agents. 2. Extended a class with an attribute, to discover the registry keys and values. 3. Create dynamic groups based on the attribute values from the registry.
1. Create registry keys on agents.
2. Extended a class with an attribute, to discover the registry keys and values.
3. Create dynamic groups based on the attribute values from the registry.
It is just that simple.
To get started – lets talk about our custom registry key. For this example, I am going to create a new Key at HKLM\Software\ and call it “CompanyName”
Next – in that key – I will create a new DWORD Value, named “SupportLevel”
Lastly – I will assign a numeric value to “SupportLevel” on each server, either 1, 2, or 3.
In my environment…. my Hyper-V servers are critical. They host all of my VM’s, including many business critical applications. Therefore – they will get Level 1.
My Exchange 2007 servers handle all my mail traffic and notifications, so I will set their registry value to Level 2.
My Exchange 2003 servers have been retired – for MP testing only… so we will set those to Level 3.
Here is a table that shows what I am planning:
So – I get all my registry values set on all computers. This is a big job at first, but it is a one time deal, and you can even script it if you are handy.
Next… we need to discover these registry entries in SCOM, as attributes of a class. Then were can use that attribute to group objects. Since I want Windows Computer objects in my groups (Windows Computer is a good object for most overrides, scoping, notifications…etc..) we would like to have these attributes added to the Windows Computer class.
However – there is a problem. The Windows Computer object is in a sealed MP. We cannot just add information to that class as we would like. Therefore – OpsMgr allows us to “Extend” an existing class… and add our custom attributes to it. This “Extended” class is basically a copy of the existing class… it will have all the built in attributes of Windows Computer, and will also have our custom attribute properties. It’s is easier to see it than to talk about it.
First – in the Ops console – authoring pane – go to Attributes. Create a new attribute. I am going to call this one “SupportLevel”
Next – choose “Registry” for the discovery type.
Next – We need to pick the Target class. We want Windows Computer. Note – this will create a new class, named “Windows Computer_Extended” by default. We can use this name, or you can rename this whatever you want. It is your class. I will leave it at the default.
Most important! Management Pack location.
This is CRITICAL. Spend some time making sure you are creating these attributes in the correct location. If you leave this MP unsealed XML…. then any groups you create that use these attributes, will have to be placed in this same MP. Then – if you use these groups for Overrides – those overrides will be force to go in this same MP. There is a “cardinal rule” in SCOM… objects in one unsealed MP cannot reference another unsealed MP. So – we cannot have a group in one unsealed MP, and then use that group for an override in another unsealed override MP.
So – we have two choices.
1. Keep an unsealed MP… and live with the fact that attribute, group, and override will all have to be placed here.
2. Create the attribute and the dynamic group in the MP, then seal it. Then – you can use this group in ANY of your override MP’s… for Exchange, SQL, etc…
I strongly recommend option #2 for this exercise… but you can make this decision for yourself.
Ok…. I will choose Option #2 (seal the MP), so I will create a new MP just for this extended class, and groups.
On the next screen – we can put in our registry information:
In this example – I am looking for a registry Value (1, 2, or 3), and my attribute type is “Int” for integer.
For the frequency, set this to a reasonable frequency to discover you machines as they come on to you network. Typically, once per day is sufficient (86400 seconds) Remember – this will run against ALL your Windows Computers… so never set this more frequent than once per hour… that creates unnecessary overhead.
Ok – lets examine our work!
Go to Monitoring, Discovered Inventory, and change target type to our new class “Windows Computer_Extended”
If you do this quickly – you may find it is empty. This is what is happening behind the scenes: All Windows Computers are now downloading our newly created MP. They are going to run the registry attribute discovery, and submit their discovery data to the management server. The Management Server will insert this discovery data in the database. Over time, you will start to see all your Windows Computers pop into this class membership. You will notice a new attribute now, in addition to all the existing Windows Computer attributes. This attribute is “SupportLevel” and will be 1, 2, 3, or empty… depending on what each agent find in the registry.
Now – I set my registry discovery to once per day…. so I will need to wait 24 hours before I can expect all my healthy agents to show up in this list. To speed things up – I am going to bounce the HealthService on these example agents. (Agents run all discoveries when a HealthService restarts, and then on their frequency schedule)
Here is an example a few minutes after bouncing the HealthService on some agents:
Next on the list – create the groups. I will create these in the same MP that the attributes exist in.
I will call my first group “CompanyName – Support Level 1 Servers Group”. I like to append the word “Group” to all groups I create as a best practice. This helps us determine this group class is actually a group when we see it in the list of classes in the UI. I sure wish all MP authors would take this to heart, since every group is actually a singleton class.
On the dynamic members screen – I will fins my “Windows Computer_Extended” class – and click Add. What we now see – is that we have a new attribute to use, “Support Level”
I will set this group to “SupportLevel Equals 1” and click OK.
Now – I can right-click my new group – and choose “View Group Members”
Yee-haw! It works! Now – I simply repeat this above step – creating groups for SupportLevel 2, and 3.
Now – that is done. This is the area, that I recommend we stop… take a breather…. then seal the MP. If you seal the MP – we will be able to use the groups for overrides in any other override MP. If you choose not to seal the MP now… any overrides you use the groups for – will be forced into this same MP. Please keep that in mind.
Since I am harping on sealing the MP…. I am going to do a quick example of just that. Jonathan Almquist has an excellent tutorial on sealing MP’s HERE and we will use his example.
**Note – when running the sn.exe commands to create our key…. we only need to do this one… not every time we want to seal an MP. ***Critical note – you need to keep a backup of this key… because it will be required for making updates to this MP in the future, re-sealing, and keeping the ability to upgrade the existing MP in production.
**Note – when running the sn.exe commands to create our key…. we only need to do this one… not every time we want to seal an MP.
***Critical note – you need to keep a backup of this key… because it will be required for making updates to this MP in the future, re-sealing, and keeping the ability to upgrade the existing MP in production.
So, I create the folders, create the key using sn.exe, copy over the referenced MP’s from the RMS, and now I am ready to seal.
MPSeal.exe c:\mpseal\input\CompanyName.SupportLevel.MP.xml /I "c:\mpseal\mp" /Keyfile "c:\mpseal\key\PairKey.snk" /Company "CompanyName" /Outdir "c:\mpseal\output"
Works great.
Now – I can delete my unsealed MP from the management group, and import my sealed MP.
Phew. All the heavy lifting is done. Now… I have my groups… I can start setting up overrides using these groups, or scoping notifications.
On my Support Level 1 group – I will use this to set up my pager Notification subscriptions to only page based on specific classes, and this group.
On my Support Level 2 group – I will use this to override important alerts to High Priority… because I am using High Priority as a filter for email notifications, per my previous blog post here: http://blogs.technet.com/kevinholman/archive/2008/06/26/using-opsmgr-notifications-in-the-real-world-part-1.aspx
On my Support Level 3 group – I will use this group for tweaking/disabling rules and monitors for the group… turning off discoveries so they don't discover lab servers, scoping views, etc.
Maybe in my next post…. I will build on this MP… and show a really simple way to add the Health Service Watcher objects to these dynamic groups… for each Windows Computer object that is in the group – so we can use these groups for Heartbeat failure notifications.
I didn’t come up with this idea…. I got it from Cameron Fuller who got it from Rory McCaw’s session at MMS last year. So credit goes to both of them for the idea and initially spreading it. As I was talking to many of my colleagues, I found out this is not a commonly known practice. So, maybe this is a good topic to write about to spread the information.
The default management pack is the first MP that shows up in the list when creating an override. However, it is a best practice to NEVER save anything to this. Admins will often forget this best practice, and accidentally save items here. Then – a cleanup of this MP is often required, as documented here: Clean up the Default MP This problem has even prompted some customers to monitor when changes are made to their default MP.
There is no supportable way to make this MP “read only” or sealed. However – we CAN rename this MP to provide a visual warning that might help you or your customer remember not to save things here. While we cannot rename the ID of an MP, we can rename the Display Name of any unsealed MP.
In the console, Administration pane, Management Packs, find you default MP. Bring up the properties – and you can rename this:
To something like this:
This will hopefully remind you or your customer not to save overrides to this MP. Below is what you will see when creating an override:
I recently wrote about My Experience Moving the Operations Database to New Hardware.
Something I noticed today – is that the application event log on the SQL server was full of 18054 events, such as below:
Log Name: Application Source: MSSQL$I01 Date: 10/23/2010 5:40:14 PM Event ID: 18054 Task Category: Server Level: Error Keywords: Classic User: OPSMGR\msaa Computer: SQLDB1.opsmgr.net Description: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.
You might also notice some truncated events in the OpsMgr event log, on your RMS or management servers:
Event Type: Warning Event Source: DataAccessLayer Event Category: None Event ID: 33333 Date: 10/23/2010 Time: 5:40:13 PM User: N/A Computer: OMMS3 Description: Data Access Layer rejected retry on SqlError: Request: p_DiscoverySourceUpsert -- (DiscoverySourceId=f0c57af0-927a-335f-1f74-3a3f1f5ca7cd), (DiscoverySourceType=0), (DiscoverySourceObjectId=74fb2fa8-94e5-264d-5f7e-57839f40de0f), (IsSnapshot=True), (TimeGenerated=10/23/2010 10:37:36 PM), (BoundManagedEntityId=3304d59d-5af5-ba80-5ba7-d13a07ed21d4), (IsDiscoveryPackageStale=), (RETURN_VALUE=1) Class: 16 Number: 18054 Message: Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage.
Event Type: Error Event Source: Health Service Modules Event Category: None Event ID: 10801 Date: 10/23/2010 Time: 5:40:13 PM User: N/A Computer: OMMS3 Description: Discovery data couldn't be inserted to the database. This could have happened because of one of the following reasons: - Discovery data is stale. The discovery data is generated by an MP recently deleted. - Database connectivity problems or database running out of space. - Discovery data received is not valid. The following details should help to further diagnose: DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88 Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..
Event Type: Error Event Source: Health Service Modules Event Category: None Event ID: 10801 Date: 10/23/2010 Time: 5:40:13 PM User: N/A Computer: OMMS3 Description: Discovery data couldn't be inserted to the database. This could have happened because of one of the following reasons:
- Discovery data is stale. The discovery data is generated by an MP recently deleted. - Database connectivity problems or database running out of space. - Discovery data received is not valid.
The following details should help to further diagnose:
DiscoveryId: 74fb2fa8-94e5-264d-5f7e-57839f40de0f HealthServiceId: bf43c6a9-8f4b-5d6d-5689-4e29d56fed88 Error 777980007, severity 16, state 1 was raised, but no message with that error number was found in sys.messages. If error is larger than 50000, make sure the user-defined message is added using sp_addmessage..
After a little research – apparently this is caused when following the guide to move the Operations Database to new hardware.
Marnix blogged about this issue http://thoughtsonopsmgr.blogspot.com/2009/02/moving-scom-database-to-another-server.html which references Matt Goedtel’s article http://blogs.technet.com/b/mgoedtel/archive/2007/08/06/update-to-moving-operationsmanager-database-steps.aspx
Because in this process – we simply restore the Operations Database ONLY, we do not carry over some of the modifications to the MASTER database that are performed when you run the Database Installation during setup to create the original operations database.
For some OpsMgr events, which stem from database activity, we get the event data from SQL. If these messages do not exist in SQL – you see the above issue.
What is bad about this – is that it will keep some event rules from actually alerting us to the condition! For instance – the rule “Discovery Data Submission Failure” which will alert when there is a failure to insert discovery data – will not trigger now, because it is looking for specific information in parameter 3 of the event, which is part of the missing data:
To resolve this – we need to add back the missing information into the MASTER database.
AND:
Then you are impacted. To resolve this – you should run the attached SQL script against the Master database of the SQL instance that hosts your OperationsManager Database. You should ONLY consider running this if you are 100% sure that you are impacted by this issue.
See attached: Fix_OpsMgrDB_ErrorMsgs.sql
The DNS Management pack has been updated. The current version as of this article is 6.0.7000.0
Get it from the download center:
http://www.microsoft.com/downloads/en/details.aspx?FamilyID=633B718F-5FE8-47D5-A395-8203F8EC354F
This is a GREAT update. Here are some key changes in this version:
That’s pretty self explanatory. This version now fully supports the DNS services running on Server 2008 R2 OS.
This is HUGE! The DNS MP was one of the primary causes of Config Churn which I wrote about here: http://blogs.technet.com/b/kevinholman/archive/2009/10/05/what-is-config-churn.aspx With this update – that churn is now resolved. The properties of PrimaryServer and SerialNumber no longer change on a frequent basis. This is a big improvement and the biggest reason to get this update in place ASAP.
Several changes were made here: Internal > Public, Error > Warning, Interval 900 > 913, NSLookUp uses the timeout parameter
These views were not present before – now you can spot-check the health of individual zones and forwarders quickly.
This enhancement will allow scripts longer to complete on busy DNS servers or DNS servers with large numbers of components. The change really wasn’t from 30 > 300 on all workflows – rather – all workflows, regardless of their previous timeout, have been set to 300 seconds, or more.
This allows you to be able to create and add recoveries and diagnostics on any monitor. When flagged as “internal” they cannot be referenced in an unsealed MP.
This allows you to create custom scoped views for any class in the MP – referencing them in another custom MP of your choice.
There are 4 new rules added which are disabled out of the box. These can be used to quickly replace the included (matching names) manual reset monitors if your organization cannot use Manual Reset monitors due to lack of console use (enterprise connectors as the primary ticketing and notification system)
Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints.Warning Microsoft.Windows.DNSServer.2008.EventCollection.RPCProtocolInitialization.RestartRPCService Microsoft.Windows.DNSServer.2008.EventCollection.WINSNetbiosInitialization.ConfigureWINSRSettings
Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints
Microsoft.Windows.DNSServer.2008.EventCollection.RootHintsConfiguration.ConfigureRootHints.Warning
Microsoft.Windows.DNSServer.2008.EventCollection.RPCProtocolInitialization.RestartRPCService
Microsoft.Windows.DNSServer.2008.EventCollection.WINSNetbiosInitialization.ConfigureWINSRSettings
Some other changes were also made, in addition to what's in the guide. Most were setting a handful of monitors from Error to Warning (State and Alert) and changing the frequency of many workflows from 900 seconds to 913 seconds… this was likely done to keep multiple workflows from running at the same time and creating false alerts due to server load when multiple workflows trigger on the same frequency. Views were renamed to reflect the 2008 R2 support.
The business need:
It is a very common request to monitor a process on a given set of servers, and collect that data for reporting, or monitor it for a given threshold.
One thing you might notice when trying to monitor some performance counters, is that not all perf counters in perfmon behave the way you might assume.
For instance, I want to monitor “how much CPU a process is using”. Perhaps we wish to monitor our SQLServer.exe process on our SQL servers?
This is easy – because Perfmon already has a Performance Object, Counter, and Instance for that. In perfmon, we would use:
Process > % Processor Time > Sqlserver.exe
Easy enough!
So, we can quite easily create a performance threshold monitor, and a performance collection rule using this. Let’s say we set the monitor to alert anytime the SQLserver.exe process is consuming more than 80% of the CPU sustained for 5 minutes.
The issue:
However, quite quickly we might notice erratic behavior from our monitor and rule. The monitor is generating TONS of alerts from almost all our SQL servers, and then quickly closing them… essentially flip-flopping. When we check the performance data we have collected, we see the process is using up to 800% CPU!!! So – thinking something is wrong with OpsMgr – we inspect a busy SQL server in perfmon directly… but observe the exact same behavior:
As you can see – this process is using almost 400% CPU. Why? How is this possible?
This is because the Process monitoring counters in Windows are not multi-CPU aware. When a server has 4 CPU’s (like this one above does) a process can use more than one CPU at a time, provided it is spawning multiple threads. This way, it can be using up to 100% of each CPU or Core (logical processor). A process on a 4 processor server can consume up to 400% of that process counter. So if a process is really only consuming 20% of the total CPU, that will show up as 80% on a 4-core machine. Think about today’s hardware… many boxes have up to 16 cores these days, which would register as 320% processor utilization for something really only using 20% of the total CPU.
As you can see – this causes a BIG problem for monitoring processes. As an IT Pro – you need to know when a process is consuming more than (x) percent of the *total system resources*…. and every server will likely have a different number of processors.
The solution:
In OpsMgr R2 – a new XML based function was created to help resolve this challenge. This is known as <ScaleBy>
The <ScaleBy> function essentially gives you the ability to take the monitoring data collected by something (that is an integer), and divide by something else (integer).
I can input a fixed value here, in integer form, or I can input a variable. For the variable, I can actually pull data from discovered properties of monitoring classes. This is GREAT in this instance, because we already discover the number of Processors a Windows Computer has. We can use this discovered data, along with this <ScaleBy> function, to fix our monitors and collection rules that need a little massaging to the data we get from perfmon.
Here are the Windows Computer class properties:
Let’s walk through an example using the authoring console.
We have to do this in XML – as there is no UI that added this capability. Click “Edit” on the configuration tab which will pop up the XML of this configuration.
We are going to add a single line after <Frequency> which will be this line:
<ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy>
What this does – is tell the workflow to take the numeric value received from perfmon, and then divide by the numeric value that is a property of the Windows Computer class for number of logical processors. Then take THIS calculated output and use that for collection or threshold evaluation.
Here is my finished XML snippet:
<ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName> <CounterName>% Processor Time</CounterName> <ObjectName>Process</ObjectName> <InstanceName>sqlservr</InstanceName> <AllInstances>false</AllInstances> <Frequency>60</Frequency> <ScaleBy>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/LogicalProcessors$</ScaleBy> <Threshold>80</Threshold> <Direction>greater</Direction> <NumSamples>5</NumSamples> </Configuration>
Now – the authoring console was not updated to fully understand this new function, so you might see an error for this. Simply hit ignore.
Your new monitor configuration now looks like this:
You can do the exact same operation on a performance collection rule as well to “normalize” this counter into something that makes more sense for reporting.
Some other uses of this might be for situations where a counter in bytes…. and you want it reported in Megabytes. You could hard code a <ScaleBy> 1000000 (one million). That way – if you wanted to report on how many megabytes a process was consuming over time… instead of representing this as 349,000,000 on a chart (bytes) you can represent this as a simple 349 Megabytes. That XML would simply be:
<ScaleBy>1000000</ScaleBy>
Ok… I hope this made some sense…. this is a valuable method to normalize some perfmon data that might not be in what I call “human format”. Keep in mind – you can ONLY use this XML functionality on an R2 management group, and it will only be understood by an R2 agent.
You can quickly go back to your previously written process monitors, and add this single line of XML really easily, using your XML editor of choice.
One last thing I want to point out….. some of the previously delivered MP’s that Microsoft shipped might be impacted by this issue. For instance – in the current ADMP version 6.0.7065.0 there is a monitor “AD_CPU_Overload.Monitor” (AD Processor Overload (lsass) Monitor) which does not take into account the number of logical processors. This is often one of the MOST noisy monitors in my customer environments, especially on a busy domain controller. This is simply because MOST DC’s have more than one CPU – and this skews the ability for this monitor to work. The issue is – they could not add this <ScaleBy> functionality to this MP – because that would make the ADMP R2-only… which we don't want to do.
You have two workarounds for SP1 management groups: Monitor processes using a script that will query WMI for the number of CPU’s and handle the math for this function (ugly) OR create groups of all Windows Computers based on their number of logical processors (easy) and then override these types of monitor thresholds with relevant numeric's for their processor count.
For R2 customers – I recommend disabling this monitor in the ADMP – and replacing it with a custom one that utilizes the <ScaleBy> functionality.
Microsoft started including Unix and Linux monitoring in OpsMgr directly in OpsMgr 2007 R2, which shipped in 2009. Some significant updates have been made to this for OpsMgr 2012. Primarily these updates are around:
This article will cover the discovery, agent deployment, and monitoring configuration of a Linux server in OpsMgr 2012. I am going to run through this as a typical user would – and show some of the pitfalls if you don’t follow the exact order of configuration required.
So what would anyone do first? They’d naturally run a discovery, just like they do for Windows agents. However – this will likely end up in frustration. There are several steps that you need to configure FIRST, before deploying Unix/Linux agents.
High Level Overview:
The high level process is as follows:
Import Management Packs:
The core Unix/Linux libraries are already imported when you install OpsMgr 2012, but not the detailed MP’s for each OS version. These are on the installation media, in the \ManagementPacks directory. Import the specific ones for the Unix or Linux Operating systems that you plan to monitor.
Create a resource pool for monitoring Unix/Linux servers
The FIRST step is to create a Unix/Linux Monitoring Resource pool. This pool will be used and associated with management servers that are dedicated for monitoring Unix/Linux systems in larger environments, or may include existing management servers that also manage Windows agents or Gateways in smaller environments. Regardless, it is a best practice to create a new resource pool for this purpose, and will ease administration, and scalability expansion in the future.
Under Administration, find Resource Pools in the console:
OpsMgr ships 3 resource pools by default:
Let’s create a new one by selecting “Create Resource Pool” from the task pane on the right, and call it “Unix Linux Monitoring Resource Pool”
Click Add and then click Search to display all management servers. Select the Management servers that you want to perform Unix and Linux Monitoring. If you only have 1 MS, this will be easy. For high availability – you need at least two management servers in the pool.
Add your management servers and create the pool. In the actions pane – select “View Resource Pool Members” to verify membership.
Configure the Xplat certificates (export/import) for each management server in the pool
This process is documented here: http://technet.microsoft.com/en-us/library/hh287152.aspx
Operations Manager uses certificates to authenticate access to the computers it is managing. When the Discovery Wizard deploys an agent, it retrieves the certificate from the agent, signs the certificate, deploys the certificate back to the agent, and then restarts the agent.
To configure high availability, each management server in the resource pool must have all the root certificates that are used to sign the certificates that are deployed to the agents on the UNIX and Linux computers. Otherwise, if a management server becomes unavailable, the other management servers would not be able to trust the certificates that were signed by the server that failed.
We provide a tool to handle the certificates, named scxcertconfig.exe. Essentially what you must do, is to log on to EACH management server that will be part of a Unix/Linux monitoring resource pool, and export their SCX (cross plat) certificate to a file share. Then import each others certificates so they are trusted.
If you only have a SINGLE management server, or a single management server in your pool, you can skip this step, then perform it later if you ever add Management Servers to the Unix/Linux Monitoring resource pool.
In this example – I have two management servers in my Unix/Linux resource pool, MS1 and MS2. Open a command prompt on each MS, and export the cert:
On MS1: C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS1.cer On MS2: C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS2.cer
On MS1:
C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS1.cer
On MS2:
C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe -export \\servername\sharename\MS2.cer
Once all certs are exported, you must IMPORT the other management server’s certificate:
On MS1: C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS2.cer On MS2: C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS1.cer
C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS2.cer
C:\Program Files\System Center 2012\Operations Manager\Server>scxcertconfig.exe –import \\servername\sharename\MS1.cer
If you fail to perform the above steps – you will get errors when running the Linux agent deployment wizard later.
Create and Configure Run As accounts for Unix/Linux
Next up we need to create our run-as accounts for Linux monitoring. This is documented here: http://technet.microsoft.com/en-us/library/hh212926.aspx
We need to select “UNIX/Linux Accounts” under administration, then “Create Run As Account” from the task pane. This kicks off a special wizard for creating these accounts.
Lets create the Monitoring account first. Give the monitoring account a display name, and click Next.
On the next screen, type in the credentials that you want to use for monitoring the Linux system(s).
On the above screen – you have two choices. You can provide a privileged account for handling monitoring, or you can use an existing account on the Linux system(s) that is not privileged. Then – you can specify whether or not you want this account to be able to leverage sudo elevation. Since I am providing a privileged account in this case – I will tell it to not use elevation.
On the next screen, always choose more secure:
Now – since we chose More Secure – we must choose the distribution of the Run As account. Find your “Linux Monitoring Account” under the UNIX/Linux Accounts screen, and open the properties. On the Distribution Security screen, click Add, then select "Search by resource pool name” and click search. Find your Unix/Linux monitoring resource pool, highlight it, and click Add, then OK. This will distribute this account credential to all Management servers in our pool:
We would repeat the above process, as many times as necessary for the number of different accounts we need. If all our Linux systems use the same credentials, then we need at a minimum, ONE monitoring account that is privileged, and it can be associated to the three Run As Profiles (covered in next section).
However, what would be more typical, if all our systems had the same credentials and passwords, is to use THREE Run As accounts:
For the purposes of this demo, I am just going to create a SINGLE priv Run As account (root) that I will use for all three scenarios.
Next up – we must configure the Run As profiles. This is covered here: http://technet.microsoft.com/en-us/library/hh212926.aspx
There are three profiles for Unix/Linux accounts:
The agent maintenance account is strictly for agent updates, uninstalls, anything that requires SSH. This will always be associated with a privileged account that has access via SSH, and was created using the Run As account wizard above, but selecting “Agent Maintenance Account” as the account type. We wont go into details on that here.
The other two Profiles are used for Monitoring workflows. These are:
Unix/Linux Privileged account Unix/Linux Action Account
Unix/Linux Privileged account
Unix/Linux Action Account
The Privileged Account Profile will always be associated with a Run As account like we created above, that is Privileged (root or similar) OR a unprivileged account that has been configured with elevation via sudo. This is what any workflows that typically require elevated rights will execute as.
The Action account is what all your basic monitoring workflows will run as. This will generally be associated with a Run As account, like we created above, but would be used with a non-privileged user account on the Linux systems.
***A note on sudo elevated accounts:
For my example – I am keeping it very simple. I created a single Run As account, of the Monitoring type, which is the privileged root account and password credential. I will associate this Run As account to BOTH the Privileged and Action account. This will make all my workflows (both normal monitoring and elevated monitoring) run under this credential. This is not recommended as the “lowest priv” design, but being leveraged in this example just to keep things simple. Once we validate it is working, we can go back and change this configuration and experiment using low priv and sudo enabled elevation accounts, and associate them independently.
For more information on configuring sudo elevation for OpsMgr monitoring accounts, including some sample configurations for your sudoers files for each OS version: http://social.technet.microsoft.com/wiki/contents/articles/7375.configuring-sudo-elevation-for-unix-and-linux-monitoring-with-system-center-2012-operations-manager.aspx
I will start with the Unix/Linux Action Account profile. Right click it – choose properties, and on the Run As Accounts screen, click Add, then select our “Linux Monitoring Account”. Leave the default of “All Targeted Objects” and click OK, then save.
Repeat this same process for the Unix/Linux Privileged Account profile.
Repeat this same process for the Unix/Linux Agent Maintenance Account profile.
Discover and deploy the agents
Run the discovery wizard.
Click “Add”:
Here you will type in the FQDN of the Linux/Unix agent, its SSH port, and then choose All Computers in the discovery type. ((We have another option for discovery type – if you were manually installing the Unix/Linux agent (which is really just a simple provider) and then using a signed certificate to authenticate))
Now – hit “Set Credentials”. If we do not want to provide a root account here, and wanted to use SSH key authentication, we support that on this screen now. For this example – I will simply type in my root account in order to use SSH to discover and deploy the Linux agent.
Notice above that you can tell the wizard if the account is privileged or not. Here is an explanation:
If you have to discover only UNIX and Linux computers that already have an agent installed, rather than installing an agent, you can use an unprivileged user account on the UNIX or Linux computer. If you have to install an agent, you must use a privileged account. If you do not have a privileged account, you can elevate an unprivileged account to a privileged account provided that the su or sudo elevation program has been configured on the UNIX or Linux computer for the user account.
So – if we had pre-installed the agent already – we could simply use an unprivileged account to authenticate and discover the system, bringing it into OpsMgr.
Or – we could provide an unprivileged account that was allowed elevation via a pre-existing sudo configuration on the Linux server.
Click save. On the next screen – select a resource pool. We will choose the resource pool that we already created.
Click Discover, and the results will be displayed:
Check the box next to your discovered system – and deploy the agent.
This will take some time to complete, as the agent is checked for the correct FQDN and SSL certificate, the management servers are inspected to ensure they all have trusted SCX certificates (that we exported/imported above) and the connection is made over SSH, the package is copied down, installed, and the final certificate signing occurs. If all of these checks pass, we get a success!
There are several things that can fail at this point. See the troubleshooting section at the end of this article.
Monitoring Linux servers:
Assuming we got all the way to this point with a successful discovery and agent installation, we need to verify that monitoring is working. After an agent is deployed, the Run As accounts will start being used to run discoveries, and start monitoring. Once enough time has passed for these, check in the Administration pane, under Unix/Linux Computers, and verify that the systems are not listed as “Unknown” but discovered as a specific version of the OS:
Next – go to the Monitoring pane – and select the “Unix/Linux Computers” view at the top. Look that your systems are present and there is a green healthy check mark next to them:
Next – expand the Unix/Linux Computers folder in the left tree (near the bottom) and make sure we have discovered the individual objects, like Linux Server State, Linux Disk State, and Network Adapter state:
Run Health explorer on one of the discovered disks. Remove the filter at the top to see all the monitors for the disk:
Close health explorer.
Select the Operating System Performance view. Review the performance counters we collect out of the box for each monitored OS.
Out of the box – we discover and apply a default monitoring template to the following objects:
Optionally, you can enable discoveries for:
I don’t recommend enabling additional discoveries unless you are sure that your monitoring requirements cannot be met without discovering these additional objects, as they will reduce the scalability of your environment.
Out of the box – for an OS like RedHat Enterprise Linux 5 – here is a list of the monitors in place, and the object they target:
There are also 50 rules enabled out of the box. 46 are performance collection rules for reporting, and 4 rules are event based, dealing with security. Two are informational letting you know whenever a direct login is made using root credentials via SSH, and when su elevation occurs by a user session. The other two deal with failed attempts for SSH or SU.
To get more out of your monitoring – you might have other services, processes, or log files that you need to monitor. For that, we provide Authoring Templates with wizards to help you add additional monitoring, in the Authoring pane of the console under Management Pack templates:
In the reporting pane – we also offer a large number of reports you can leverage, or you can always create your own using our generic report templates, or custom ones designed in Visual Studio for SQL reporting services.
As you can see, it is a fairly well rounded solution to include Unix and Linux monitoring into a single pane of glass for your other systems, from the Hardware, to the Operating System, to the network layer, to the applications.
Partners and 3rd party vendors also supply additional management packs which extend our Unix and Linux monitoring, to discover and provide detailed monitoring on non-Microsoft applications that run on these Unix and Linux systems.
Troubleshooting:
The majority of troubleshooting comes in the form of failed discovery/agent deployments.
Microsoft has written a wiki on this topic, which covers the majority of these, and how to resolve:
http://social.technet.microsoft.com/wiki/contents/articles/4966.aspx
Agent verification failed. Error detail: The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors: The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable. The SSL certificate is signed by an unknown certificate authority. It is possible that: 1. The destination certificate is signed by another certificate authority not trusted by the management server. 2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net. 3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool. The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors: The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable. The SSL certificate is signed by an unknown certificate authority. It is possible that: 1. The destination certificate is signed by another certificate authority not trusted by the management server. 2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net. 3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool.
Agent verification failed. Error detail: The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors: The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.
The SSL certificate is signed by an unknown certificate authority. It is possible that: 1. The destination certificate is signed by another certificate authority not trusted by the management server. 2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net. 3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool.
The server certificate on the destination computer (rh5501.opsmgr.net:1270) has the following errors: The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable. The SSL certificate is signed by an unknown certificate authority. It is possible that: 1. The destination certificate is signed by another certificate authority not trusted by the management server. 2. The destination has an invalid certificate, e.g., its common name (CN) does not match the fully qualified domain name (FQDN) used for the connection. The FQDN used for the connection is: rh5501.opsmgr.net. 3. The servers in the resource pool have not been configured to trust certificates signed by other servers in the pool.
The solution to these common issues is covered in the Wiki with links to the product documentation.
Or you might see alerts in the console:
Alert: UNIX/Linux Run As profile association error event detected The account for the UNIX/Linux Action Run As profile associated with the workflow "Microsoft.Unix.AgentVersion.Discovery", running for instance "rh5501.opsmgr.net" with ID {9ADCED3D-B44B-3A82-769D-B0653BFE54F9} is not defined. The workflow has been unloaded. Please associate an account with the profile. This condition may have occurred because no UNIX/Linux Accounts have been configured for the Run As profile. The UNIX/Linux Run As profile used by this workflow must be configured to associate a Run As account with the target.
Alert: UNIX/Linux Run As profile association error event detected
The account for the UNIX/Linux Action Run As profile associated with the workflow "Microsoft.Unix.AgentVersion.Discovery", running for instance "rh5501.opsmgr.net" with ID {9ADCED3D-B44B-3A82-769D-B0653BFE54F9} is not defined. The workflow has been unloaded. Please associate an account with the profile.
This condition may have occurred because no UNIX/Linux Accounts have been configured for the Run As profile. The UNIX/Linux Run As profile used by this workflow must be configured to associate a Run As account with the target.
Either you failed to configure the Run As accounts, or failed to distribute them, or you chose a low priv account that is not properly configured for sudo on the Linux system. Go back and double-check your work there.
If you want to check if the agent was deployed to a RedHat system, you can provide the following command in a shell session:
My lab consists of 2 Dell Precision T7500 workstations, each configured with 96GB of RAM. These are each nodes in a Hyper-V 2012 cluster. They mount cluster shared volumes via iSCSI, some are SSD, and some are SAS RAID based disks, from a 3rd Dell Precision Workstation.
One of the things I have experienced, is that when I want to patch the hosts, I pause the node, and drain the roles. This kicks off a live migration of all the VM’s on Node1 to Node2. This can take a substantial amount of time, as these VM’s are consuming around 80GB of memory.
When performing a full live migration of these 18 VM’s across a single 1GB Ethernet connection, the Ethernet link was 100% saturated, and it took exactly 13 minutes and 15 seconds.
I recently got a couple 10 gigabit Ethernet cards for my lab environment. I scored an awesome deal on eBay for 10 cards for $250, or $25 for each Dell/Broadcom 10GBe card! The problem I have now is that the CHEAPEST 10GBe switch on the market is $850. No way am I paying that for my lab. The good news is, these cards, just like 1GB Ethernet cards, support direct connect auto MDI/MDIX detection, so you can form an old school “crossover” connection just using a standard patch cable. I did order a CAT6A cable just to be safe.
Once I installed and configured the new 10GBe cards, I set them up in the Cluster as a Live Migration network:
The same live migration over 10GBe took 65 SECONDS!
In summary -
1GB Live migration, 18 VM’s, 13m15s.
10GB Live migration, 18 VM’s, 65 seconds.
In my case, I can drastically decrease the live migration latency, with minimal cost, by using a direct connection between two hosts in a cluster with 10 gigabit Ethernet. Aidan Finn, MVP – has a post with similar results: http://www.aidanfinn.com/?p=12228
Next up, I wanted to create a “converged” network, by carving up my 10GBe NIC into multiple virtual NIC’s, by connecting it to the Hyper-V virtual switch, and then create virtual adapters. Aidan has a good write-up on the concept here: http://www.aidanfinn.com/?p=12588
Here is a graphic that shows the concept from his blog:
The supported network configuration guide for Hyper-V clusters is located here:
http://technet.microsoft.com/en-us/library/ff428137(v=WS.10).aspx
Typically in the past, you would see 4 NIC’s, one for management, cluster, live migration, and virtual machines. The common alternative would be to use a single 10GBe NIC (or two in a highly available team) and then use virtual network adapters on a Hyper-V switch, and QoS to carve up weighting. In my case, I have a dedicated NIC for management (the parent partition/OS) and a dedicated NIC for Hyper-V virtual machines. On my 10GBe NIC, I want to connect that one to a Hyper-V virtual switch, and then create virtual network adapters – one for Live Migration and one for Cluster/CSV communication, bot.
We will be using the QoS guidelines posted at: http://technet.microsoft.com/en-us/library/jj735302.aspx
John Savill has also done a nice quick walkthrough of a similar configuration: http://savilltech.com/blog/2013/06/13/new-video-on-networking-for-windows-server-2012-hyper-v-clusters/
When I start – my current network configuration look like this:
We will be attaching the 10GbE network adapter to a new Hyper-V switch, and then creating two virtual network adapters, then applying QoS to each in order to ensure that both channels have their sufficient required bandwidth in the case of contention on the network.
Open PowerShell.
To get a list of the names of each NIC:
Get-NetAdapter
To create the new switch, with bandwidth weighting mode:
New-VMSwitch “ConvergedSwitch” –NetAdapterName “10GBE NIC” –MinimumBandwidthMode Weight –AllowManagementOS $false
To see our new virtual switch:
Get-VMSwitch
You will also see this in Hyper-V manager:
Next up, Create a virtual NIC in the management operating system for Live Migration, and connect it to the new virtual switch:
Add-VMNetworkAdapter –ManagementOS –Name “LM” –SwitchName “ConvergedSwitch”
Create a virtual NIC in the management operating system for Cluster/CSV communications, and connect it to the new virtual switch:
Add-VMNetworkAdapter –ManagementOS –Name “Cluster” –SwitchName “ConvergedSwitch”
View the new virtual network adapters in powershell:
Get-VMNetworkAdapter –All
View them in the OS:
Assign a minimum bandwidth weighting to give QoS for both virtual NIC’s, but apply heavier weighting to Live Migrations in the case of contention on the network:
Set-VMNetworkAdapter –ManagementOS –Name “LM” –MinimumBandwidthWeight 90 Set-VMNetworkAdapter –ManagementOS –Name “Cluster” –MinimumBandwidthWeight 10
Set the weighting so that the total of all VMNetworkAdapters on the switch equal 100. The configuration above will (roughly) allow ~90% for the LM network, and ~10% for the Cluster network.
To view the bandwidth settings of each virtual NIC:
Get-VMNetworkAdapter -All | fl
At this point, I need to assign IP address information to each virtual NIC, and then repeat this configuration on all nodes in my cluster.
After this step is completed, and you confirm that you can ping each other’s interfaces, you can configure the networks in Failover Cluster Administrator. Rename each network appropriately, and configure Live Migration and Cluster communication settings:
In the above picture – I don’t allow cluster communication on the live migration network – but this is optional and you certainly can allow that if the primary cluster communication fails.
Test Live Migration and ensure performance and communications are working properly.
In Summary – here is all the PowerShell used:
Get-NetAdapter New-VMSwitch “ConvergedSwitch” –NetAdapterName “10GBE NIC” –MinimumBandwidthMode Weight –AllowManagementOS $false Get-VMSwitch Add-VMNetworkAdapter –ManagementOS –Name “LM” –SwitchName “ConvergedSwitch” Add-VMNetworkAdapter –ManagementOS –Name “Cluster” –SwitchName “ConvergedSwitch” Get-VMNetworkAdapter -All | fl Set-VMNetworkAdapter –ManagementOS –Name “LM” –MinimumBandwidthWeight 90 Set-VMNetworkAdapter –ManagementOS –Name “Cluster” –MinimumBandwidthWeight 10
This configuration worked. HOWEVER, it did exposed a limitation. I noticed that using vNICs I was only able to sustain about 3GB/s on the same live migrations, where I was achieving 10GB/s before. This is due to the fact that RSS is not exposed to virtual NIC’s on the host/management partition, which own the live migration networks. When using these virtual NIC’s to transfer a data stream from host to host, you will see a single CPU core pegged, as it manages the traffic in this scenario.
Here is the maximum traffic that could be sent using this configuration on my server:
Below you will see the single core that was pegged during the live migration:
If you are sharing a converged network design, this still might be acceptable, as some of the bandwidth will be needed for all your VM’s on the host, some will be needed for management and client access traffic, some for CSV and cluster communications. However, if you want a design with high speed live migrations, you should likely plan for using physical NIC’s for Live Migration, and for CSV (in the cases of redirected IO). These can use teaming for redundancy, but better to use SMB multi-channel in Server 2012 R2, as live migration will leverage SMB advanced features, like multi-channel and RDMA (SMB Direct).
I’d like to hear some community feedback on this….
In OpsMgr – deploying a SCOM agent to a DC often presents companies with a bit of a challenge. The reason is – in order to install software to a DC and manage it – we need rights on the DC to accomplish this. These rights are needed, anytime we are going to deploy an agent, hotfix an agent, or run a repair on a broken agent to keep the agent healthy.
When we push agents from the console, the default account used to perform the push is the Management Server Action Account. If this account does not have Domain Admin rights – the push will fail to a DC, with an Access Denied. We do allow the option to type in temporary (encrypted) credentials, which are used to deploy the agent, one time, and then are discarded. See the image below:
Here is a list of the most common options I have observed, in place at customer sites… and potential custom options that can be developed. I’d be interested in any community feedback on any options you are using, that I dont cover or haven't seen before.
1. Grant the Management Server Action account Domain Admin or Builtin\Administrators.
a. Not recommended as a best practice, this gives rights to the MSAA that are not required for day to day activities. b. Con - SCOM Admins now control a domain admin account.
a. Not recommended as a best practice, this gives rights to the MSAA that are not required for day to day activities.
b. Con - SCOM Admins now control a domain admin account.
2. Grant a SCOM Administrator a special domain account, for this purpose, that is a domain admin.
a. This allows us to track the actions of that SCOM admin, when he/she uses that special privileged account. b. That SCOM admin will be able to do repairs, hotfixes, and deployments for DC’s. c. Con – Domain Admin teams often wont delegate these rights as they are tightly controlled.
a. This allows us to track the actions of that SCOM admin, when he/she uses that special privileged account.
b. That SCOM admin will be able to do repairs, hotfixes, and deployments for DC’s.
c. Con – Domain Admin teams often wont delegate these rights as they are tightly controlled.
3. The SCOM admin team delegates console based agent management to a Domain Administrator for DC agent health.
a. The domain admin must become a SCOM Admin, and therefore could potentially hurt the SCOM environment. b. Pro – the admins in charge of the DC’s now have full responsibility to keep the agents healthy. c. Con – the Domain Admins might not understand components of SCOM, and create something that impacts the monitoring environment.
a. The domain admin must become a SCOM Admin, and therefore could potentially hurt the SCOM environment.
b. Pro – the admins in charge of the DC’s now have full responsibility to keep the agents healthy.
c. Con – the Domain Admins might not understand components of SCOM, and create something that impacts the monitoring environment.
4. The SCOM admin team must partner with the Domain Admin team, and have the Domain Administrator type in his credentials any time the SCOM administrator needs to deploy/hotfix/repair an agent on a domain controller.
a. This is a bit more labor intensive… because the SCOM admin must wait for a domain admin to be available to work on DC agents, but tight security boundaries are maintained.
5. All DC based agents will be manually installed/updated/repaired.
a. This is very common, when the two teams do not trust each other. The Domain Admin team is now required to manually deploy agents to domain controllers, and keep them up to date, and healthy.
6. Use a software deployment tool already in place to deploy/update/repair agents.
a. If a software deployment tool is already in place on DC’s, like SMS/SCCM, you can create packages to deploy, hotfix, and repair agents, similar to your patching of the OS today.
7. Customized solution: Create a Run-As account that is a domain admin, one time, for use in agent deployment/repair.
a. This involves the domain admin typing in credentials ONCE, into a RUN-AS account, which is stored securely and encrypted in the SCOM database. b. This run-as account can be associated with a run-as profile, which is used by a custom task, which will remotely deploy the agent to the domain controller. This task will execute under the security context of the privileged run-as account. c. The benefit is that the domain admin gets to control the password for this account, the SCOM admin does not need to know the account credentials. d. The downside, is that this run-as account could potentially be leveraged by some other workflow, if a SCOM admin intentionally misused it…. Similar to solution #2 above. e. This is just an idea I had – curious if anyone has already developed a solution like this?
a. This involves the domain admin typing in credentials ONCE, into a RUN-AS account, which is stored securely and encrypted in the SCOM database.
b. This run-as account can be associated with a run-as profile, which is used by a custom task, which will remotely deploy the agent to the domain controller. This task will execute under the security context of the privileged run-as account.
c. The benefit is that the domain admin gets to control the password for this account, the SCOM admin does not need to know the account credentials.
d. The downside, is that this run-as account could potentially be leveraged by some other workflow, if a SCOM admin intentionally misused it…. Similar to solution #2 above.
e. This is just an idea I had – curious if anyone has already developed a solution like this?
Here is an interesting little concept of how to use OpsMgr.
Because I have a lab, that is exposed to the internet over port 3389, I get a LOT of hacking attempts on this lab. Mostly the source is from bots running on other compromised systems. These bots just do brute force attacks against the typical Admin accounts and passwords via RDP. In this article, I am going to show how OpsMgr can not only alert on this condition, but also respond by configuring the Windows Firewall to block these attacks.
I will start by analyzing the Server 2008 event that occurs when someone tries to attack using my “Administrator” account:
Log Name: Security Source: Microsoft-Windows-Security-Auditing Date: 7/14/2009 12:44:05 PM Event ID: 4625 Task Category: Account Lockout Level: Information Keywords: Audit Failure User: N/A Computer: terminalserver.domain.com Description: An account failed to log on. Subject: Security ID: SYSTEM Account Name: TERMINALSERVER$ Account Domain: DOMAIN Logon ID: 0x3e7 Logon Type: 10 Account For Which Logon Failed: Security ID: NULL SID Account Name: administrator Account Domain: TERMINALSERVER Failure Information: Failure Reason: Account locked out. Status: 0xc0000234 Sub Status: 0x0 Process Information: Caller Process ID: 0x14f0 Caller Process Name: C:\Windows\System32\winlogon.exe Network Information: Workstation Name: TERMINALSERVER Source Network Address: 10.10.10.1 Source Port: 1261 Detailed Authentication Information: Logon Process: User32 Authentication Package: Negotiate Transited Services: - Package Name (NTLM only): - Key Length: 0
Log Name: Security Source: Microsoft-Windows-Security-Auditing Date: 7/14/2009 12:44:05 PM Event ID: 4625 Task Category: Account Lockout Level: Information Keywords: Audit Failure User: N/A Computer: terminalserver.domain.com
Description: An account failed to log on.
Subject: Security ID: SYSTEM Account Name: TERMINALSERVER$ Account Domain: DOMAIN Logon ID: 0x3e7
Logon Type: 10
Account For Which Logon Failed: Security ID: NULL SID Account Name: administrator Account Domain: TERMINALSERVER
Failure Information: Failure Reason: Account locked out. Status: 0xc0000234 Sub Status: 0x0
Process Information: Caller Process ID: 0x14f0 Caller Process Name: C:\Windows\System32\winlogon.exe
Network Information: Workstation Name: TERMINALSERVER Source Network Address: 10.10.10.1 Source Port: 1261
Detailed Authentication Information: Logon Process: User32 Authentication Package: Negotiate Transited Services: - Package Name (NTLM only): - Key Length: 0
So… for starters, I want to alert on this condition… when ANYONE is trying multiple times… to RDP into the server, with a disabled account, non-existent account, or valid account, but bad password. Therefore – I will create a monitor: Windows Events > Repeated Event Detection > Timer Reset.
The idea here is to only respond when multiple bad passwords are entered in a short time period…. representing an attack. (I don't want to lock out or block access from my normal users who sometimes mis-type their password on a couple attempts.)
So I create the monitor, target “Windows Server Operating System”, set it to “Security” for the Parent Monitor, and UNCHECK the box enabling it. (I will later override this monitor and ONLY enable it for my entry terminal server.)
I create my event expression for the security event log, event 4625, and I only want the Logon Type of 10, which is from RDP:
Next – I will set up my monitor, to Trigger on Count (of events), Sliding. Compare count will be set to 5 (events) within a 3 minute interval. Therefore, as soon as 5 events are captured, in ANY sliding 3 minute “window”, the monitor will change state.
Next… since my goal is really to execute a script/command/response…. (not really a state change is desired) I will set the timer reset to reset the state back to healthy after 2 minutes. This will free the workflow up to block any other source IP’s which might attack soon after.
I don't want to impact availability data, which assumes critical state = unavailable…. so I will use a Warning State:
Now – I will enable a unique alert for this condition. I want a critical, high priority alert in this case, and I will set this NOT to close the alert when we auto-resolve the state on the timer. I also will customize the alert description, to give me a richer alert based on the even details and my custom response. I talk more about these event parameters HERE. I will be adding:
$Data/Context/Context/DataItem/Params/Param[6]$ typed a bad password accessing directly from computer: $Data/Context/Context/DataItem/Params/Param[14]$ from IP: $Data/Context/Context/DataItem/Params/Param[20]$ The Windows Firewall will be modified to block this IP address in response to this monitor state.
Next – I will go back and find my monitor, and add a Recovery for the Warning State:
I will choose to Run Command. Give it a name “Modify Windows Firewall”
Next – for the command – I am going to run Netsh.exe which can configure the Windows Firewall running on the terminal server. Here is the command:
C:\Windows\System32\netsh.exe
advfirewall firewall set rule name="Block RDP" new remoteip=$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$
$Data/StateChange/DataItem/Context/DataItem/Context/DataItem/Params/Param[20]$ is based on an Event Parameter of the Server 2008 event, which I will pass to the command, so it will gather the IP address of the attacker, and pass that to the command which configures the firewall rule. Getting this variable was the most complicated for me….. Marius talked about how to derive this variable HERE Just understand that the variables you use in an alert description are not the same was used in a diagnostic or recovery.
Cool:
My Netsh.exe command modifies an existing custom rule in the Windows Firewall, so I need to make sure I create that and name it “Block RDP”.
Now – I will override this rule and enabled it for my published terminal server, and then test this monitor… by attempting to log into my terminal server via RDP 5 times in a short period, using a disabled account. This will cause the event in the security event log for each event, and eventually trip the repeated event detection monitor.
Alert generates:
Monitor changes state:
Recovery runs:
Windows Firewall rule gets modified:
Attack is stopped.
Pretty cool, eh?
Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine. Agent health is an ongoing task of any OpsMgr Admin’s life.
This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time. There are so many factors in an agent’s ability to communicate and work as expected. A few key areas that commonly affect this are:
How do you detect agent issues from the console? The problem might be that they are not showing up in the console at all! Perhaps they might be a manual install that never shows up in Pending Actions? Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”. Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.
One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log. This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent. That is ALWAYS one of my first steps in troubleshooting.
Another way of examining Agent health – is by the built in views in OpsMgr. In the console – there is a view – Located at the following:
This view is important – because it gives us a perspective of the agent from two different points:
1. The perspective of the agent monitors running on the agent, measuring its own “health”.
2. The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server".
If any of these are red or yellow – that is an excellent place to start. This should be an area that your level 1 support for Operations manager checks DAILY. We should never have a high number of agents that are not green here. If they aren't – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…
Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.
Now…. the following are some general steps to take to “fix” broken agents. These are not in definitive order. The order of steps really comes down to what you find when looking at the logs after taking these steps.
To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot. However – to summarize at a very general level, my typical steps are:
If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.
Also – make sure you see my other posts on agent health and troubleshooting during deployment:
Console based Agent Deployment Troubleshooting table
Agent discovery and push troubleshooting in OpsMgr 2007
Getting lots of Script Failed To Run alerts- WMI Probe Failed Execution- Backward Compatibility
Agent Pending Actions can get out of synch between the Console, and the database
Which hotfixes should I apply-
This is a simple overview of using a recovery for a custom Monitor in OpsMgr
Lets say we create a simple service monitor in OpsMgr... for this example - I will use the Print Spooler service:
Create a new monitor, unit monitor, and choose windows services - Basic Service Monitor:
Choose an appropriate management pack to save it to... such as a Base OS custom rule MP you create.
Give it a name - such as "Check Windows Spooler Service" and choose a valid target, such as "Windows Server"
Browse the service name - and pick the Print Spooler (Spooler):
Accept defaults for health, and let it create an alert, or not - depending on your requirements.
Once the monitor is created.... open it up in the Authoring tab of the Ops console. Choose the "Diagnostic and Recovery" tab.
Under "Configure Recovery Tasks" add a a recovery for Critical Health State. Choose "Run Command" and click Next.
Give the recovery a name.... such as "Restart service" and click Next.
For the command line settings... we need to provide a path to the file we want to run. For a simple service restart - we can use the "NET" command, as in "NET START (servicename)" For the path - just specify the original executable - do not add any command line switches.... such as: "%windir%\system32\net.exe"
Under "Parameters" - this is where we will add the command line switches.... such as "start spooler" in this case:
Click "Create" Click OK.
Now - pick a managed agent - and stop the Spooler service. This will create a state change for the monitor. If you told the monitor to alert - it will also create an alert at this time. As soon as the state change occurs, our recovery will run.... which should restart the service.
Check the system event log to view the activity. I got the following two events:
Event Type: Information Event Source: Service Control Manager Event Category: None Event ID: 7036 Date: 3/26/2008 Time: 1:24:44 AM User: N/A Computer: OMTERM Description: The Print Spooler service entered the stopped state.
Event Type: Information Event Source: Service Control Manager Event Category: None Event ID: 7036 Date: 3/26/2008 Time: 1:25:04 AM User: N/A Computer: OMTERM Description: The Print Spooler service entered the running state.
So the service was down for about 20 seconds.... for the monitor to detect the unhealthy state, and then to run a recovery to restart the service.
Open health explorer for the computer object for the test machine, and find the "Print Spooler Service Check" monitor. It should show up as healthy... if the recovery worked. Select this monitor, and then click the "State Change Events" tab. We should see the service is running currently as the last logged state change. Find the "Service is Not running" state change just below the current one.... and in the details pane - we should be able to see the recovery output where the recovery task ran automatically, and logged the output:
So what if we want a more advanced recovery? Perhaps we have a service that just doesn't always start reliably on the first try. Perhaps we want to try and start the service three time over a 3 minute period, and THEN create the alert? This can be done.... but will have to be done using a custom script that provides this logic, and then create the alert, or creates an event, and then a rule will alert from the event created.
Event Type: Warning
Event Source: OpsMgr SDK Service
Event Category: None
Event ID: 26371
Date: 12/13/2007
Time: 2:58:24 PM
User: N/A
Computer: RMSCOMPUTER
Description:
The System Center Operations Manager SDK service failed to register an SPN. A domain admin needs to add MSOMSdkSvc/rmscomputer and MSOMSdkSvc/rmscomputer.domain.com to the servicePrincipalName of DOMAIN\sdkaccount
This seems to appear in the RC1-SP1 build of OpsMgr.
Every time the SDK service starts, it tries to update the SPN’s on the AD account that the SDK service runs under. It fails, because by default, a user cannot update its own SPNs. Therefore we see this error logged.
If the SDK account is a domain admin – it does not fail – because a domain admin would have the necessary rights. Obviously – we don’t want the SDK account being a domain admin…. That isn’t required nor is it a best practice.
Therefore – to resolve this error, we need to allow the SDK service account rights to update the SPN. The easiest way, is to go to the user account object for the SDK account in AD – and grant SELF to have full control.
A better, more granular way – is to only grant SELF the right of modifying the SPN:
To check SPN's:
The following command will show all the HealthService SPN's in the domain:
Ldifde -f c:\ldifde.txt -t 3268 -d DC=DOMAIN,DC=COM -r "(serviceprincipalname=MSOMHSvc/*)" -l serviceprincipalname -p subtree
To view SPN's for a specific server:
"setspn -L servername"
In a simplified view to groom alerts…..
Grooming of the ops DB is called once per day at 12:00am…. by the rule: “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the “Root Management Server” and is part of the System Center Internal Library.
It calls the “p_PartitioningAndGrooming” stored procedure, which calls p_Grooming, which calls p_GroomNonPartitionedObjects (Alerts are not partitioned) which inspects the PartitionAndGroomingSettings table… and executes each stored procedure. The Alerts stored procedure in that table is referenced as p_AlertGrooming which has the following sql statement:
SELECT AlertId INTO #AlertsToGroom
FROM dbo.Alert
WHERE TimeResolved IS NOT NULL
AND TimeResolved < @GroomingThresholdUTC
AND ResolutionState = 255
So…. the criteria for what is groomed is pretty simple: In a resolution state of “Closed” (255) and older than the 7 day default setting (or your custom setting referenced in the table above)
We won’t groom any alerts that are in New (0), or any custom resolution-states (custom ID #). Those will have to be set to “Closed” (255)…. either by autoresolution of a monitor returning to healthy, direct user interaction, our built in autoresolution mechanism, or your own custom script.
Ok – that covers grooming.
However – I can see that brings up the question – how does auto-resolution work?
That specifically states “alerts in the new resolution state”. I don’t think that is completely correct:
That is called upon by the rule “Alert Auto Resolve Execute All” which runs p_AlertAutoResolveExecuteAll once per day at 4:00am. This calls p_AlertAutoResolve twice…. once with a variable of “0” and once with a “1”.
Here is the sql statement:
IF (@AutoResolveType = 0)
BEGIN
SELECT @AlertResolvePeriodInDays = [SettingValue]
FROM dbo.[GlobalSettings]
WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_HealthyAlertAutoResolvePeriod()
SET @AutoResolveThreshold = DATEADD(dd, -@AlertResolvePeriodInDays, getutcdate())
SET @RootMonitorId = dbo.fn_ManagedTypeId_SystemHealthEntityState()
-- We will resolve all alerts that have green state and are un-resolved
-- and haven't been modified for N number of days.
INSERT INTO @AlertsToBeResolved
SELECT A.[AlertId]
FROM dbo.[Alert] A
JOIN dbo.[State] S
ON A.[BaseManagedEntityId] = S.[BaseManagedEntityId] AND S.[MonitorId] = @RootMonitorId
WHERE A.[LastModified] < @AutoResolveThreshold
AND A.[ResolutionState] <> 255
AND S.[HealthState] = 1
<snip>
ELSE IF (@AutoResolveType = 1)
WHERE [ManagedTypePropertyId] = dbo.fn_ManagedTypePropertyId_MicrosoftSystemCenterManagementGroup_AlertAutoResolvePeriod()
-- We will resolve all alerts that are un-resolved
AND ResolutionState <> 255
So we are basically checking that Resolution state <> 255….. not specifically “New” (0) as we would lead you to believe by the wording in the interface. There are simply two types of auto-resolution: Resolve all alerts where the object has returned to a healthy state in “N” days….. and Resolve all alerts no matter what, as long as they haven’t been modified in “N” days.
So.... Say I am an Exchange Administrator in a global company.... in the good old USA.
My company has recently implemented OpsMgr 2007 to monitor our Exchange servers. I am going to configure my notification subscriptions so I can get an email anytime one of my Exchange servers has an issue.
Try #1: I start by creating a notification subscription, and I dont scope it by groups or classes (all groups, all classes). I think this sounds fine. However, instantly I find I am flooded with email notifications from every single alert coming into the console. This is NOT good!
Try #2: Therefore – I decide I really need to see only Exchange alerts. I scope the notification *classes* down to just Exchange classes. This will ensure I only receive notifications from Exchange target classes. Good? Nope.... I soon find that when an alert comes in from the base OS, or heartbeat, or hardware, we won’t get those. We need to add those classes back. If we add the heartbeat (Health Service Watcher) class – we will now get heartbeat failures for ALL machines… not just restricted to exchange servers. No good.
Try #3: So – we need to scope the subscription using groups. We create a group with all our Exchange Server Windows Computer objects in it. We can manually add these in (Explicit) or we can use a dynamic rule based on criteria - I chose NetBIOS name, and used a naming standard of EX* (all my exchange servers start with "ex"). I used an "OR" statement since the wildcard is case sensitive.
Now I create a subscriptions - and scope it to this group - and choose ALL classes.... thinking that this way, we should get ALL notifications, including base OS, exchange, and heartbeat alerts… right?
Nope. Because of the object oriented monitoring model – we will only receive alerts from a rule/monitor with a target class that has a child relationship to the Windows Computer class. This is the only class type in the group we created. So – using the model in #3, we will get notifications from pretty much any class needed – except heartbeats. These come from the Health Service Watcher class, and have no relation to the Windows Computer class.
Try #4: I am thinking, we must add the class type to our group – and any instances of that class we are interested in. Since most object classes are a child of Windows Computer, there should not be many of these that we will have to do.
In the group – add the Health Service Watcher display name instances, in the same way we add the Windows Computer NetBIOS names:
The AND/OR verbiage is misleading…. This was opened as a bug then closed – because it is “as designed”.
Essentially – The or group at the top will include ANY of the following and groups below it…. BOTH the windows computer objects AND the Health Service Watcher objects are included: (you can right click any group and choose to show members)
I tested all kinds of Exchange alerts, and heartbeat failures – and this works. It is possible there will be other alerts we wont get in this subscription.... IF the rule or monitor that created the alert was using a target class that was unique, and not a child of "Windows Computer"
I don’t think this will be a huge hassle moving forward… because MOST alerting is done on a target which is a child of Windows computer. If we find one that is not – we just need to go back and add that class’s instances to the groups we create for notifications.
Want alert by alert notifications? Where you can subscribe to a single alert, rule by rule, monitor by monitor? Check out:
http://code4ward.net/cs2/blogs/code4ward/archive/2007/09/19/set-notificationforalert.aspx
This is a continuation of my previous post on determining which agents are missing a hot-fix:
How do I know which hotfixes have been applied to which agents-
I wrote up a report that allows you to paste in a KB article number into the report as a parameter, and then it will show all agents that are potentially missing that hotfix. This will help you easily find agent which need to be patched and got missed for some reason.
You can run this report if you create the SQL reporting data source as specified in my previous post:
Creating a new data source for reporting against the Operational Database
Once imported - it will show up in the console. Open the report, and paste in any KB article number for a OpsMgr hotfix you have applied. The number MUST begin and end with "%".... such as %951380% as shown:
The report is attached below:
System Center 2012 SP1 has shipped Update Rollup 5.
http://support.microsoft.com/kb/2904730/en-us
There are updates available for OpsMgr 2012 SP1, VMM 2012 SP1, Orchestrator 2012 SP1 and DPM 2012 SP1, in this release.
See the KB article for full details of each, with links to the individual updates and downloads.
KB Article: http://support.microsoft.com/kb/2904678/en-us
Download catalog site: http://catalog.update.microsoft.com/v7/site/Search.aspx?q=2904678
Key fixes:
Issue 1 - An error occurs when you run the p_DataPurging stored procedure. This error occurs when the query processor runs out of internal resources and cannot produce a query plan.Issue 2 - Data warehouse BULK INSERT commands use an unchangeable, default 30-second time-out value that may cause query time-outs.Issue 3 - Many 26319 errors are generated when you use the Operator role. This issue causes performance problems.Issue 4 - The diagram component does not publish location information in the component state.Issue 5 - Renaming a group works correctly on the console. However, the old name of the group appears when you try to override a monitor or scope a view based on group.Issue 6 - SCOM synchronization is not supported in the localized versions of Team Foundation Server.Issue 7 - An SDK process deadlock causes the Exchange correlation engine to fail.Issue 8 - The "Microsoft System Center Advisor monitoring server" reserved group is visible in a computer or group search.Issue 9 - Multiple Advisor Connector are discovered for the same physical computer when the computer hosts a cluster.Issue 10 - A Dashboard exception occurs if the criteria that are used for a query include an invalid character or keyword.
Xplat updates:
Issue 1 - On a Solaris-based computer, an error message that resembles the following is logged in the Operations Manager log. This issue occurs if a Solaris-based computer that has many monitored resources runs out of file descriptors and does not monitor the resources. Monitored resources may include file systems, physical disks, and network adapters.Note The Operations Manager log is located at /var/opt/microsoft/scx/log/scx.log. errno = 24 (Too many open files) This issue occurs because the default user limit on Solaris is too low to allocate a sufficient number of file descriptors. After the rollup update is installed, the updated agent overrides the default user limit by using a user limit for the agent process of 1,024.
Issue 2 - If Linux Container (cgroup) entries in the /etc/mtab path on a monitored Linux-based computer begin with the "cgroup" string, a warning that resembles the following is logged in the agent log. Note When this issue occurs, some physical disks may not be discovered as expected. Warning [scx.core.common.pal.system.disk.diskdepend:418:29352:139684846989056] Did not find key 'cgroup' in proc_disk_stats map, device name was 'cgroup'.
Issue 3 - Physical disk configurations that cannot be monitored, or failures in physical disk monitoring, cause failures in system monitoring on UNIX and Linux computers. When this issue occurs, logical disk instances are not discovered by Operations Manager for a monitored UNIX-based or Linux-based computer.
Issue 4 - A monitored Solaris zone that is configured to use dynamic CPU allocation with dynamic resource pools may log errors in the agent logs as CPUs are removed from the zone and do not identify the CPUs currently in the system. In rare cases, the agent on a Solaris zone with dynamic CPU allocation may hang during routine monitoring. Note This issue applies to any monitored Solaris zones that are configured to use dynamic resource pools and a "dedicated-cpu" configuration that involves a range of CPUs.
Issue 5 - An error that resembles the following is generated on Solaris 9-based computers when the /opt/microsoft/scx/bin/tools/setup.sh script does not set the library pathcorrectly. When this issue occurs, the omicli tool cannot run. ld.so.1: omicli: fatal: libssl.so.0.9.7: open failed: No such file or directory
Issue 6 - If the agent does not retrieve process arguments from the getargs subroutine on an AIX-based computer, the monitored daemons may be reported incorrectly as offline. An error message that resembles the following is logged in the agent log: Calling getargs() returned an error
Issue 7 - The agent on AIX-based computers considers all file cache to be available memory and does not treat minperm cache as used memory. After this update rollup is installed, available memory on AIX-based computer is calculated as: free memory + (cache – minperm).
Issue 8 - The Universal Linux agent is not installed on Linux computers that have OpenSSL versions greater than 1.0.0 if the library file libssl.so.1.0.0 does not exist. An error message that resembles the following is logged: /opt/microsoft/scx/bin/tools/.scxsslconfig: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
I have seen *several* customers having issues with the OpsDB grooming/purging process, so that looks like a good one to get implemented, especially if this was affecting you.
Lets get started.
From reading the KB article – the order of operations is:
There are no agent updates in this UR1. Agents will be placed into pending, however there are no updates. You must reject the agents in pending.
Now, we need to add another step – if we are using Xplat monitoring – need to update the Linux/Unix MP’s and agents.
4. Update Unix/Linux MP’s and Agents.
1. Management Servers
Since there is no RMS anymore, it doesn’t matter which management server I start with. There is no need to begin with whomever holds the RMSe role. I simply make sure I only patch one management server at a time to allow for agent failover without overloading any single management server.
I can apply this update manually via the MSP files, or I can use Windows Update. I have 3 management servers, so I will demonstrate both. I will do the first management server manually. This management server holds 3 roles, and each must be patched: Management Server, Web Console, and Console.
The first thing I do when I download the updates from the catalog, is copy the cab files for my language to a single location:
Then extract the contents:
Once I have the MSP files, I am ready to start applying the update to each server by role.
***Note: You MUST log on to each server role as a Local Administrator, SCOM Admin, AND your account must also have System Administrator (SA) role to the database instances that host your OpsMgr databases.
My first server is a management server, and the web console, and has the OpsMgr console installed, so I copy those update files locally, and execute them per the KB, from an elevated command prompt:
This launches a quick UI which applies the update. It will bounce the SCOM services as well. The update does not provide any feedback that it had success or failure. You can check the application log for the MsiInstaller events for that.
You can also spot check a couple DLL files for the file version attribute.
Next up – run the Web Console update:
This runs much faster. A quick file spot check:
Lastly – install the console update (make sure your console is closed):
A quick file spot check:
Secondary Management Servers:
I now move on to my secondary management servers, applying the server update, then the console update.
On this next management server, I will use Windows Update. I check online, and make sure that I have configured Windows Update to give me updates for additional products:
This shows me two applicable updates for this server:
I apply these updates (along with some additional Windows Server Updates I was missing, and reboot each management server, until all management servers are updated.
Updating Gateways:
I can use Windows Update or manual installation.
The update launches a UI and quickly finishes.
Then I will spot check the DLL’s:
2. Apply the SQL Script
In the path on your management servers, where you installed/extracted the update, there is a SQL script file:
%SystemDrive%\Program Files\System Center 2012\Operations Manager\Server\SQL Script for Update Rollups
Open a SQL management studio query window, connect it to your Operations Manager database, and then open the script file. Make sure it is pointing to your OperationsManager database, then execute the script.
****Note – at the time of this writing – the KB article says to run this against the DataWarehouse – the KB article is in error
Click the “Execute” button in SQL mgmt. studio. The execution could take a considerable amount of time and you might see a spike in processor utilization on your SQL database server during this operation.
You will see the following (or similar) output:
3. Manually import the management packs?
We have four updated MP’s to import (MAYBE!).
The TFS MP bundles are only used for specific scenarios, such as DevOps scenarios where you have integrated APM with TFS, etc. If you are not currently using these MP’s, there is no need to import or update them. I’d skip this MP import unless you already have these MP’s present in your environment.
The Advisor MP’s are only needed if you are using System Center Advisor services.
However, the Image and Visualization libraries deal with Dashboard updates, and these need to be updated.
I import all of these without issue.
Reject the agent update
Agents are placed into pending actions by this update. HOWEVER – there are no updates for the agents in the Update Rollup. You must REJECT the agents in pending, using the console or PowerShell.
4. Update Unix/Linux MPs and Agents
Next up – I download and extract the updated Linux MP’s for SCOM 2012 SP1 UR2
http://www.microsoft.com/en-us/download/details.aspx?id=29696
7.5.101 is current at this time for SCOM 2012 R2. ****Note – take GREAT care when downloading – that you select the correct download for R2. You must scroll down in the list and select the MSI for 2012 R2:
7.5.101 is current at this time for SCOM 2012 R2.
****Note – take GREAT care when downloading – that you select the correct download for R2. You must scroll down in the list and select the MSI for 2012 R2:
Download the MSI and run it. It will extract the MP’s to C:\Program Files (x86)\System Center Management Packs\System Center 2012 R2 Management Packs for Unix and Linux\
Update any MP’s you are already using.
You will likely observe VERY high CPU utilization of your management servers and database server during and immediately following these MP imports. Give it plenty of time to complete the process of the import and MPB deployments.
Next up – you would upgrade your agents on the Unix/Linux monitored agents. You can now do this straight from the console:
You can input credentials or use existing RunAs accounts if those have enough rights to perform this action.
I have an environmental issue that caused my Ubuntu server to fail.
5. Update the remaining deployed consoles
This is an important step. I have consoles deployed around my infrastructure – on my Orchestrator server, on my personal workstation, on all the other SCOM admins on my team, on a Terminal Server we use as a tools machine, etc. These should all get the UR1 update.
Review:
Now at this point, we would check the OpsMgr event logs on our management servers, check for any new or strange alerts coming in, and ensure that there are no issues after the update.
Known issues:
See the existing list of known issues documented in the KB article.
1. Many people are reporting that the SQL script is failing to complete when executed. You should attempt to run this multiple times until it completes without error. You might need to stop the Exchange correlation engine, stop the services on the management servers, or bounce the SQL server services in order to get a successful completion in a busy management group. The errors reported appear as below:
------------------------------------------------------ (1 row(s) affected) (1 row(s) affected) Msg 1205, Level 13, State 56, Line 1 Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. Msg 3727, Level 16, State 0, Line 1 Could not drop constraint. See previous errors. --------------------------------------------------------
------------------------------------------------------
(1 row(s) affected)
Msg 1205, Level 13, State 56, Line 1
Transaction (Process ID 152) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
Msg 3727, Level 16, State 0, Line 1
Could not drop constraint. See previous errors.
--------------------------------------------------------
A new MP authoring tool was announced today. Read the release at http://blogs.technet.com/b/momteam/archive/2014/01/13/mp-blog-the-right-tool-for-the-right-job.aspx
This is a FREE tool which Silect is releasing. This tool essentially replaces the functionality of the previous “Visio Management Pack Designer” This is targeted at the IT Pro, who needs to create custom management packs and author new classes, discoveries, rule, monitors, etc… but is not a developer. This new tool will make simple work of creating a new class to monitor a specific application, quickly discover it, and add several types of monitoring.
You can register and download here: http://bridgeways.com/products/mp-author
One of the big benefits of this over the Visio tool, is that it can open existing MP’s and make changes to them, where the Visio tool was a “one way” solution. This new tool also expands on the types of workflows that are possible to create over the Visio tool. If you were using the Visio MP designer, I’d recommend migrating to this new solution immediately. If you considered but didn’t like the Visio designer, – try this one out.
This is the initial release, I imagine we will see additional capabilities as time progresses. Keep in mind – this is meant for SIMPLE management packs, not a full development suite. The Visual Studio authoring extensions are the right place for a more full featured management pack development environment.
Here are some simple examples of using MP Author:
Open MP Author. Click “New” to create a new MP.
Most fields come pre-populated, but are simple to change.
Provide a location for your new MP:
The MP Author automatically creates the necessary references, and you can add more if you need to reference classes in other MP’s:
Now we can choose what we want to create from common templates.
The MOST common should be “empty management pack”. Even a “single server application” create a class for our app, but it also creates an additional distributed application for each as well, and this is not commonly needed. I’d prefer the “single server application” only create a single, simple class, based on Microsoft.Windows.Local.Application, but that is open for discussion. When we choose to create an empty MP we still have full use of wizards to help create our MP.
I choose Empty MP and click Next, Finish.
Now – what I want to do is to create a class (or “Target”) in this MP to represent an application that I need to discover and monitor. For this example, I will use the WINS server role.
Go to “Targets” and choose “New Registry Target”
Connect to an existing WINS server to browse the registry of that machine.
I will base the discovery of my class on the Registry value for the WINS service – in this case it is located at:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\WINS\Start
When Start = 2 (automatic) I consider that a WINS server. Click OK, Next.
Provide an ID and displayname for the Class, or accept defaults:
Provide an ID and displayname for the Discovery, or accept defaults:
Validate or modify the expression for class membership:
Set schedule for once every day, and Finish.
SAVE YOUR MP AT THIS POINT. We’d hate to lose all our work.
Now we can quickly add in some event rules, service monitor, performance monitors, etc…. When happy, right-click the top level folder for your MP in the left tree view, and choose Import Management pack:
We provide a MS name, and credentials. mine popped up and said it could not validate my creds which was odd. The next screen shows what referenced MP’s must also be imported. This appears a little odd to me because these MP’s are already imported in my environment anyway. This operation crashed on my PC so there might be some issues to work out yes on this process. No bother, I’d rather manually import anyway. Unfortunately I didn’t save my work FIRST. So off I go to recreate what we just did.
I manually import my MP, and I can view my discovered servers using Discovered Inventory for my new class:
Could not be any easier to create classes for granular targeting of applications, and creating common authored workflows to rapidly provide monitoring.
Do you want auditing information on how many alerts are being closed or modified by your OpsMgr users?
You can use the following queries to get this information from the data warehouse, and I have attached some reports below as well:
To get all raw alert data from the data warehouse to build reports from:
select * from Alert.vAlertResolutionState ars inner join Alert.vAlertDetail adt on ars.alertguid = adt.alertguid inner join Alert.vAlert alt on ars.alertguid = alt.alertguid
To view data on all alerts modified by a specific user:
select ars.alertguid, alertname, alertdescription, statesetbyuserid, resolutionstate, statesetdatetime, severity, priority, managedentityrowID, repeatcount from Alert.vAlertResolutionState ars inner join Alert.vAlert alt on ars.alertguid = alt.alertguid where statesetbyuserid like '%username%' order by statesetdatetime
To view a count of all alerts closed by all users:
select statesetbyuserid, count(*) as 'Number of Alerts' from Alert.vAlertResolutionState ars where resolutionstate = '255' group by statesetbyuserid order by 'Number of Alerts' DESC
In the reports I have attached, you can pick a date and a time window, and run these same basic queries
Files attached below: