Welcome to TechNet Blogs Sign in | Join | Help

New IIS7 MP ships – support for Server 2008 R2 and IIS 7.5

Normally I don't spend a lot of time blogging on MP updates… unless they are a really big deal, but since the MP catalog is so messed up right now with Pinpoint, I will be blogging on these until it gets fixed, at least to a usable standard.

I am also including some recommendations at the bottom of this post for modifying the IIS MP’s.

 

 

2/8/2010 - Updated IIS management pack to support monitoring Windows Server 2008 SP2 and Windows Server 2008 R2, version 6.0.7600.0

http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=d351bca8-182b-4223-8c9e-627e184ba02b

 

The library, 2000, and 2003 MP’s have no changed – only the 2008 MP to add support for 2008 R2.

image

 

Also – interesting to note – there is NO GUIDE in the MSI anymore…. you need to download it as well.

 

Some points from the guide:

 

Note: In this guide, the term “Internet Information Services 7” applies equally to IIS 7.0 (which shipped with Windows Server 2008, Windows Server 2008 SP1, and Windows Server 2008 SP2), and IIS 7.5 (which shipped with Windows Server 2008 R2).

What’s New - Microsoft has updated the Management Pack for Internet Information Services 7 to support Windows Server 2008 SP2 and IIS 7.5 on Windows Server 2008 R2.

 

 

***Pay attention:

There is a section in the guide that is incorrect (outdated):

For Operations Manager agents that manage IIS 7 servers with more than 400 sites and application pools, you must override the Health Service Private Bytes Threshold monitor that is targeted to the Health Service. Override the Agent Performance Monitor Type—Threshold parameter to set it to 209715200 (the number of bytes=200 MB). If you do not override this threshold monitor, the agent might consume more than 100 MB of memory and be restarted automatically.

This entire section should be ignored…. because as long as you are running the most current core MP’s – the default value has been changed from 100 to 300MB.  If you followed the guide here – and set this to 200MB – you’d be going in the wrong direction.  Instead – just make sure you have the latest core MP updates documented HERE.

 

***Special note:

The IIS MP’s are one of the leading causes of CONFIG CHURN.  I recommend setting ALL discoveries in these MP’s to run once per DAY (or less frequent).  Once per day is 86,400 seconds.

The MP team obviously has taken note of this, because in this version of the MP – the discoveries for 2008 OS were all changed from 3600 seconds to 14,400 seconds, this change is not documented in the guide.  However, this is not enough change to the frequency in most large environments with substantial numbers of IIS being monitored.

You can load the following discoveries up in your console – by creating a scope of:

  • IIS 2000 Server Role
  • IIS 2003 Server Role
  • IIS 7.0 Server Role
  • IIS 7.0 Web Site
  • IIS 7.0 Application Pool

 

image

image

image

image

image

Posted by kevinhol | 4 Comments
Filed under: ,

OpsMgr 2007 R2 CU1 rollup hotfix ships - and my experience installing it

The Cumulative Update 1 (CU1) for OpsMgr 2007 R2 has shipped.

Get it from the download Center:

http://www.microsoft.com/downloads/details.aspx?FamilyID=05d30779-2ddc-48dc-aa91-a23167ee2cad&displaylang=en

The KB article describing the fixes and changes:

http://support.microsoft.com/kb/974144

 

There are MANY fixes in this update for R2 – and I am not going to go through each one.  Just understand – I will recommend applying this to any customer running OpsMgr R2 as soon as your formal testing and change window cycles allow it.  I have added it to my Recommended Hotfix page.

 

So – first – I download it, and extract it.  I always install to the default location when it comes to a hotfix like this.  Then I STOP – and go find in the installed path – the README file.  Open it up – and use it to formulate your deployment and testing plan.

Based on the Readme notes.. my plan looks something like the following:

 

  1. Backup the Operations and Warehouse databases.
  2. Apply the hotfix to the RMS
  3. Run the SQL script update against the OpsDB
  4. Import the updated Management pack provided
  5. Apply the hotfix to all secondary Management Servers.
  6. Apply the hotfix to my agents by approving them from pending
  7. Apply the hotfix my dedicated consoles (Terminal servers, desktop machines, etc…)
  8. Apply the hotfix to my Web Console server
  9. Apply the hotfix to my Audit collection servers
  10. Update manually installed agents…. well, manually.

 

Ok – looks like 10 easy steps.  This order is not set in stone – it is just a simple recommendation in the release notes.  For instance – if you wanted to update ALL your infrastructure before touching any agent updates – that probably makes more sense and should be fine.

Ok, Lets get started.

 

1.  I run a fresh backup on my OpsDB and Warehouse DB’s – just in case something goes really wrong.

 

2.  Since my RMS is running Server 2008 – I need to open an elevated command prompt to install any SCOM hotfixes.  That is just how it is.  So I launch that – and call the MSI I downloaded.  This will install the Hotfix Utility to the default location.  Then – a splash screen comes up:

image

 

I choose Run Server Update, and rock and roll.

Setup finishes very quickly – and I am presented with:

 

image

 

Which is odd – because typically in the past our hotfixes didn't require a restart.  Anyway - I hit yes – and saw a short error message about something failing to apply.  Ugh.  Probably a post-hotfix bootstrap process that updates something.  The more I think about it – there more I realized that hitting NO here would be the right thing to do.  This will allow the setup splash screen to finish all the post-hotfix processes that it runs.  Then – when I close all the screens and I am sure the hotfix applied, I can manually restart the OS.  So DONT restart.  Say NO.  Then – when you are all done and happy – you can bounce the server/nodes that required this.

 

I can see the following files got updated on the RMS:

image

 

Next I check my \Agentmanagement folder.

Oops – it looks like I didn't get my agent hotfix files copied over.  Most likely because of the reboot I said “yes” to.

So – time to start over.  From the elevated command prompt – I have to run the MSI again – this time choosing to UNINSTALL the hotfix utility.  Next – run it again to reinstall it, and choose “Run Server Update” again from the menu.  This time – I wasn’t even prompted for a reboot (and I would have said “NO” if I saw one.  It completes, and the splash screen comes back up.

Now – I got check \AgentManagement folders – and they are perfect:

 

image

 

In the AMD64, ia64, and x86 directories – I can now see the hotfix update for agents present there!  Time to move forward.

 

3.  Time to run the SQL script against the OperationsManager database.  This is simple enough – the script is located on the RMS in the \Program Files\System Center 2007 R2 Hotfix Utility\KB974144\SQLUpdate\ folder, named DiscoveryEntitySprocs.sql

I simply need to open this file with SQL management studio – or edit it with notepad – copy the contents – and paste it in a query window that is connected to my OpsDB.  I paste the contents of the file in my query window, it takes about 10 seconds to complete, and returns “Command Completed Successfully”.

 

4.  Next up – import the MP update.  That's easy enough.  It is located at \Program Files\System Center 2007 R2 Hotfix Utility\KB974144\ManagementPacks\ and is named Microsoft.SystemCenter.DataWarehouse.Report.Library.mp.  It takes a few minutes to import.

 

5.  Time to apply the hotfix to my management servers.  I have 3 secondary MS servers, one is Windows 2008 and the other two are older, they are running Windows 2003.  So on the 2008 server I open an elevated command prompt to apply the hotfix utility MSI, and just run it directly on the older servers.  Once the splash screen comes up I “Run Server Update”  These all install without issue, and thankfully don’t prompt for a reboot.  I spot check the \AgentManagement directories and the DLL versions, and all look great.

 

6.  I check my pending actions view in the console – and sure enough – all the agents that are set to “Remotely Manageable = Yes” in the console show up here pending an agent update.  I approve all my agents (generally we recommend to patch no more than 200 agents at any given time.)

After the agents update – I need to do a quick spot check to see that they are patched and good – so I use the “Patchlist” column in the HealthService state view to see that:

 

image

 

Looks good.  Note I will have to formulate a plan to go and update my manually installed agents (Remotely Manageable = No)

 

7.  I have a few dedicated consoles which need updating.  One is a desktop machine and the other is my terminal server which multiple people use to connect to the management group.  So – I kick off the installer – and just choose “Run Server Update as well.

Once again I was prompted to reboot the server.  I choose NO this time, and then this pops up:

image

 

I hit OK – and all is well.  I will plan a reboot at some point in the future or once my updates are all complete.

I do a spot check of the files – and see the following was updated on the terminal server:

 

image

 

8.  Next up – Web Consoles.  I actually have two – and both are running on management servers, which I have already patched.  So – I will simply just go check their DLL files to ensure they got updated:

From:   \Program Files\System Center Operations Manager 2007\Web Console\bin

image

Looks good!

 

9.  I don't have ACS set up at the moment – but at this point if I did – I would go hit those Management servers that have already been patched – but this time run the update and choose to “Run ACS Server Update”

 

image

 

10.  Manually installed agents.  I have a fair bit of these… so I will do this manually, or set up a SCCM package to deploy them.  Most of the time you will have manually installed agents on agents behind firewalls, or when you use AD integration for agent assignment, or when you installed manually on DC’s, or as a troubleshooting step. 

 

 

Now – the update is complete.  The next step is to implement your test plan steps.  You should build a test plan for any time you make a change to your OpsMgr environment.  This might include scanning the event logs on the RMS and all MS for critical and warning events… looking for anything new, or serious.  Testing reporting is working, check the database for any unreasonable growth, run queries to see if anything looks bad from a most common alerts, events, perf, state perspective.  Run a perfmon – and ensure your baselines are steady – and nothing is different on the database, or RMS.  If you utilize any product connectors – make sure they are functioning.

The implementation of a solid test plan is very important to change management.  Please don't overlook this step.

Posted by kevinhol | 60 Comments
Filed under: ,

The Windows Server Print Server role team is looking for your input

If you would like to give some feedback on what you would like to see in the next print server MP update – and how you use Print Services in your company – please take a little time and help make the MP’s better.

See:

http://blogs.technet.com/momteam/archive/2010/01/06/the-windows-server-print-server-role-team-is-looking-for-your-input.aspx

 

This is your chance to change the way new MP’s will be made!

Posted by kevinhol | 0 Comments

Understanding and modifying Data Warehouse retention and grooming

You will likely find that the default retention in the OpsMgr data warehouse will need to be adjusted for your environment.  I often find customers are reluctant to adjust these – because they don't know what they want to keep.  So – they assume the defaults are good – and they just keep EVERYTHING. 

This is a bad idea. 

A data warehouse will often be one of the largest databases supported by a company.  Large databases cost money.  They cost money to support.  They are more difficult to maintain.  They cost more to backup in time, tape capacity, network impact, etc.  They take longer to restore in the case of a disaster.  The larger they get, the more they cost in hardware (disk space) to support them.  The larger they get, can impact how long reports take to complete.

For these reasons – you should give STRONG consideration to reducing your warehouse retention to your reporting REQUIREMENTS.  If you don't have any – MAKE SOME!

Originally – when the product released – you had to directly edit SQL tables to adjust this.  Then – a command line tool was released to adjust these values – making the process easier and safer.  This post is just going to be a walk through of this process to better understand using this tool – and what each dataset actually means.

Here is the link to the command line tool: 

http://blogs.technet.com/momteam/archive/2008/05/14/data-warehouse-data-retention-policy-dwdatarp-exe.aspx

 

Different data types are kept in the Data Warehouse in unique “Datasets”.  Each dataset represents a different data type (events, alerts, performance, etc..) and the aggregation type (raw, hourly, daily)

Not every customer will have exactly the same data sets.  This is because some management packs will add their own dataset – if that MP has something very unique that it will collect – that does not fit into the default “buckets” that already exist.

 

So – first – we need to understand the different datasets available – and what they mean.  All the datasets for an environment are kept in the “Dataset” table in the Warehouse database.

select * from dataset
order by DataSetDefaultName

This will show us the available datasets.  Common datasets are:

Alert data set
Client Monitoring data set
Event data set
Microsoft.Windows.Client.Vista.Dataset.ClientPerf
Microsoft.Windows.Client.Vista.Dataset.DiskFailure
Microsoft.Windows.Client.Vista.Dataset.Memory
Microsoft.Windows.Client.Vista.Dataset.ShellPerf
Performance data set
State data set

Alert, Event, Performance, and State are the most common ones we look at.

 

However – in the warehouse – we also keep different aggregations of some of the datasets – where it makes sense.  The most common datasets that we will aggregate are Performance data, State data, and Client Monitoring data (AEM).  The reason we have raw, hourly, and daily aggregations – is to be able to keep data for longer periods of time – but still have very good performance on running reports.

In MOM 2005 – we used to stick ALL the raw performance data into a single table in the Warehouse.  After a year of data was reached – this meant the perf table would grow to a HUGE size – and running multiple queries against this table would be impossible to complete with acceptable performance.  It also meant grooming this table would take forever, and would be prone to timeouts and failures.

In OpsMgr – now we aggregate this data into hourly and daily aggregations.  These aggregations allow us to “summarize” the performance, or state data, into MUCH smaller table sizes.  This means we can keep data for a MUCH longer period of time than ever before.  We also optimized this by splitting these into multiple tables.  When a table reaches a pre-determined size, or number of records – we will start a new table for inserting.  This allows grooming to be incredibly efficient – because now we can simply drop the old tables when all of the data in a table is older than the grooming retention setting.

 

Ok – that’s the background on aggregations.  To see this information – we will need to look at the StandardDatasetAggregation table.

select * from StandardDatasetAggregation

That table contains all the datasets, and their aggregation settings.  To help make more sense of this -  I will join the dataset and the StandardDatasetAggregation tables in a single query – to only show you what you need to look at:

SELECT DataSetDefaultName,
AggregationTypeId,
MaxDataAgeDays
FROM StandardDatasetAggregation sda
INNER JOIN dataset ds on ds.datasetid = sda.datasetid
ORDER BY DataSetDefaultName

This query will give us the common dataset name, the aggregation type, and the current maximum retention setting.

For the AggregationTypeId:

0 = Raw

20 = Hourly

30 = Daily

Here is my output:

DataSetDefaultName AggregationTypeId MaxDataAgeDays
Alert data set 0 400
Client Monitoring data set 0 30
Client Monitoring data set 30 400
Event data set 0 100
Microsoft.Windows.Client.Vista.Dataset.ClientPerf 0 7
Microsoft.Windows.Client.Vista.Dataset.ClientPerf 30 91
Microsoft.Windows.Client.Vista.Dataset.DiskFailure 0 7
Microsoft.Windows.Client.Vista.Dataset.DiskFailure 30 182
Microsoft.Windows.Client.Vista.Dataset.Memory 0 7
Microsoft.Windows.Client.Vista.Dataset.Memory 30 91
Microsoft.Windows.Client.Vista.Dataset.ShellPerf 0 7
Microsoft.Windows.Client.Vista.Dataset.ShellPerf 30 91
Performance data set 0 10
Performance data set 20 400
Performance data set 30 400
State data set 0 180
State data set 20 400
State data set 30 400

 

You will probably notice – that we only keep 10 days of RAW Performance by default.  Generally – you don't want to mess with this.  This is simply to keep a short amount of raw data – to build our hourly and daily aggregations from.  All built in performance reports in SCOM run from Hourly, or Daily aggregations by default.

 

Now we are cooking!

Fortunately – there is a command line tool published that will help make changes to these retention periods, and provide more information about how much data we have currently.  This tool is called DWDATARP.EXE.  It is available for download HERE.

This gives us a nice way to view the current settings.  Download this to your tools machine, your RMS, or directly on your warehouse machine.  Run it from a command line.

Run just the tool with no parameters to get help:    

C:\>dwdatarp.exe

To get our current settings – run the tool with ONLY the –s (server\instance) and –d (database) parameters.  This will output the current settings.  However – it does not format well to the screen – so output it to a TXT file and open it:

C:\>dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW > c:\dwoutput.txt

Here is my output (I removed some of the vista/client garbage for brevity)

 

Dataset name Aggregation name Max Age Current Size, Kb
Alert data set Raw data 400 18,560 ( 1%)
Client Monitoring data set Raw data 30 0 ( 0%)
Client Monitoring data set Daily aggregations 400 16 ( 0%)
Configuration dataset Raw data 400 153,016 ( 4%)
Event data set Raw data 100 1,348,168 ( 37%)
Performance data set Raw data 10 467,552 ( 13%)
Performance data set Hourly aggregations 400 1,265,160 ( 35%)
Performance data set Daily aggregations 400 61,176 ( 2%)
State data set Raw data 180 13,024 ( 0%)
State data set Hourly aggregations 400 305,120 ( 8%)
State data set Daily aggregations 400 20,112 ( 1%)

 

Right off the bat – I can see how little data that daily performance actually consumes.  I can see how much data that only 10 days of RAW perf data consume.  I also see a surprising amount of event data consuming space in the database.  Typically – you will see that perf hourly will consume the most space in a warehouse.

 

So – with this information in hand – I can do two things….

  • I can know what is using up most of the space in my warehouse.
  • I can know the Dataset name, and Aggregation name… to input to the command line tool to adjust it!

 

Now – on to the retention adjustments.

 

First thing – I will need to gather my Reporting service level agreement from management.  This is my requirement for how long I need to keep data for reports.  I also need to know “what kind” of reports they want to be able to run for this period.

From this discussion with management – we determined:

  • We require detailed performance reports for 90 days (hourly aggregations)
  • We require less detailed performance reports (daily aggregations) for 1 year for trending and capacity planning.
  • We want to keep a record of all ALERTS for 6 months.
  • We don't use any event reports, so we can reduce this retention from 100 days to 30 days.
  • We don't use AEM (Client Monitoring Dataset) so we will leave this unchanged.
  • We don't report on state changes much (if any) so we will set all of these to 90 days.

Now I will use the DWDATARP.EXE tool – to adjust these values based on my company reporting SLA:

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Performance data set" -a "Daily aggregations" -m 365

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Alert data set" -a "Raw data" -m 180

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "Event data set" -a "Raw Data" -m 30

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Raw data" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Hourly aggregations" -m 90

dwdatarp.exe -s OMDW\i01 -d OperationsManagerDW -ds "State data set" -a "Daily aggregations" -m 90

 

Now my table reflects my reporting SLA – and my actual space needed in the warehouse will be much reduced in the long term:

 

Dataset name Aggregation name Max Age Current Size, Kb
Alert data set Raw data 180 18,560 ( 1%)
Client Monitoring data set Raw data 30 0 ( 0%)
Client Monitoring data set Daily aggregations 400 16 ( 0%)
Configuration dataset Raw data 400 152,944 ( 4%)
Event data set Raw data 30 1,348,552 ( 37%)
Performance data set Raw data 10 468,960 ( 13%)
Performance data set Hourly aggregations 90 1,265,992 ( 35%)
Performance data set Daily aggregations 365 61,176 ( 2%)
State data set Raw data 90 13,024 ( 0%)
State data set Hourly aggregations 90 305,120 ( 8%)
State data set Daily aggregations 90 20,112 ( 1%)

 

Here are some general rules of thumb (might be different if your environment is unique)

  • Only keep the maximum retention of data in the warehouse per your reporting requirements.
  • Do not modify the performance RAW dataset.
  • Most performance reports are run against Perf Hourly data for detail performance throughout the day.  For reports that span long periods of time (weeks/months) you should generally use Daily aggregation.
  • Daily aggregations should generally be kept for the same retention as hourly – or longer.
  • Hourly datasets use up much more space than daily aggregations.
  • Most people don't use events in reports – and these can often be groomed much sooner than the default of 100 days.
  • Most people don't do a lot of state reporting beyond 30 days, and these can be groomed much sooner as well if desired.
  • Don't modify a setting if you don't use it.  There is no need.
  • The Configuration dataset generally should not be modified.  This keeps data about objects to report on, in the warehouse.  It should be set to at LEAST the longest of any perf, alert, event, or state datasets that you use for reporting.
Posted by kevinhol | 5 Comments
Filed under: , ,

Tuning tip: Do you have monitors constantly “flip flopping” ?

 

This is something I see in almost all clients when we perform a PFE Health Check.  The customer will have lots of data being inserted into the OpsDB from agents, about monitors that are constantly changing state.  This can have a very negative effect on overall performance of the database – because it can be a lot of data, and the RMS is busy handling the state calculation, and synching this data about the state and any alert changes to the warehouse.

Many times the OpsMgr admin has no idea this is happening, because the alerts appear, and then auto-resolve so fast, you never see them – or don’t see them long enough to detect there is a problem.  I have seen databases where the statechangeevent table was the largest in the database – caused by these issues.

 

Too many state changes are generally caused by one or both, of two issues:

1.  Badly written monitors that flip flop constantly.  Normally – this happens when you target a multi-instance perf counter incorrectly.  See my POST on this topic for more information.

2.  HealthService restarts.  See my POST on this topic.

 

How can I detect if this is happening in my environment?

 

That is the right question!  For now – you can run a handful of SQL queries, which will show you the most common state changes going on in your environments.  These are listed on my SQL query blog page in the State section:

 

Noisiest monitors in the database: (Note – these will include old state changes – might not be current)

select distinct top 50 count(sce.StateId) as NumStateChanges, m.MonitorName, mt.typename AS TargetClass
from StateChangeEvent sce with (nolock)
join state s with (nolock) on sce.StateId = s.StateId
join monitor m with (nolock) on s.MonitorId = m.MonitorId
join managedtype mt with (nolock) on m.TargetManagedEntityType = mt.ManagedTypeId
where m.IsUnitMonitor = 1
group by m.MonitorName,mt.typename
order by NumStateChanges desc

 

The above query will show us which monitors are flipping the most in the entire database.  This includes recent, and OLD data.  You have to be careful looking at this output – as you might spent a lot of time focusing on a monitor that had a problem long ago.  You see – we will only groom out old state changes for monitors that are CURRENTLY in a HEALTHY state, AT THE TIME that grooming runs.  We will not groom old state change events if the monitor is Disabled (unmonitored), in Maintenance Mode, Warning State, or Critical State.

What?

This means that if you had a major issue with a monitor in the past, and you solved it by disabling the monitor, we will NEVER, EVER groom that junk out.  This doesn't really pose a problem, it just leaves a little database bloat, and messy statechangeevent views in HealthExplorer.  But the real issue for me is – it makes it a bit tougher to only look at the problem monitors NOW. 

To see if you have really old state change data leftover in your database, you can run the following query:

SELECT DATEDIFF(d, MIN(TimeAdded), GETDATE()) AS [Current] FROM statechangeevent

You might find you have a couple YEARS worth of old state data.

So – I have taken the built in grooming stored procedure, and modified the statement to groom out ALL statechange data, and only keep the number of days you have set in the UI.  (The default setting is 7 days).  I like to run this “cleanup” script from time to time, to clear out the old data, and whenever I am troubleshooting current issues with monitor flip-flop.  Here is the SQL query statement:

 

To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state.  By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.

USE [OperationsManager]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
BEGIN

    SET NOCOUNT ON

    DECLARE @Err int
    DECLARE @Ret int
    DECLARE @DaysToKeep tinyint
    DECLARE @GroomingThresholdLocal datetime
    DECLARE @GroomingThresholdUTC datetime
    DECLARE @TimeGroomingRan datetime
    DECLARE @MaxTimeGroomed datetime
    DECLARE @RowCount int
    SET @TimeGroomingRan = getutcdate()

    SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate())
    FROM dbo.PartitionAndGroomingSettings
    WHERE ObjectName = 'StateChangeEvent'

    EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT
    SET @Err = @@ERROR

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    SET @RowCount = 1  

    -- This is to update the settings table
    -- with the max groomed data
    SELECT @MaxTimeGroomed = MAX(TimeGenerated)
    FROM dbo.StateChangeEvent
    WHERE TimeGenerated < @GroomingThresholdUTC

    IF @MaxTimeGroomed IS NULL
        GOTO Success_Exit

    -- Instead of the FK DELETE CASCADE handling the deletion of the rows from
    -- the MJS table, do it explicitly. Performance is much better this way.
    DELETE MJS
    FROM dbo.MonitoringJobStatus MJS
    JOIN dbo.StateChangeEvent SCE
        ON SCE.StateChangeEventId = MJS.StateChangeEventId
    JOIN dbo.State S WITH(NOLOCK)
        ON SCE.[StateId] = S.[StateId]
    WHERE SCE.TimeGenerated < @GroomingThresholdUTC
    AND S.[HealthState] in (0,1,2,3)

    SELECT @Err = @@ERROR
    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END

    WHILE (@RowCount > 0)
    BEGIN
        -- Delete StateChangeEvents that are older than @GroomingThresholdUTC
        -- We are doing this in chunks in separate transactions on
        -- purpose: to avoid the transaction log to grow too large.
        DELETE TOP (10000) SCE
        FROM dbo.StateChangeEvent SCE
        JOIN dbo.State S WITH(NOLOCK)
            ON SCE.[StateId] = S.[StateId]
        WHERE TimeGenerated < @GroomingThresholdUTC
        AND S.[HealthState] in (0,1,2,3)

        SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

        IF (@Err <> 0)
        BEGIN
            GOTO Error_Exit
        END
    END   

    UPDATE dbo.PartitionAndGroomingSettings
    SET GroomingRunTime = @TimeGroomingRan,
        DataGroomedMaxTime = @MaxTimeGroomed
    WHERE ObjectName = 'StateChangeEvent'

    SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT

    IF (@Err <> 0)
    BEGIN
        GOTO Error_Exit
    END 
Success_Exit:
Error_Exit:   
END

 

Once this is cleaned up – you can re-run the DATEDIFF query – and see you should only have the same number of days as set in your UI retention setting for database grooming.

Now – you can run the “Most common state changes” query – and identify which monitors are causing the problem.

 

Look for monitors at the top with MUCH higher numbers than all others.  This will be “monitor flip flop” and you should use Health Explorer to find that monitor on a few instances – and figure out why it is changing state so much in the past few days.  Common conditions for this one are badly written monitors that target a single instance object, but monitor a multi-instance perf counter.  You can read more on that HERE.  Also – just poor overall tuning can cause this – or poorly written custom script based monitors.

If you see a LOT of similar monitors at the top, with very similar state change counts, this is often indicative of HealthService restarts.  The Health service will submit new state change data every time it starts up.  So if the agent is bouncing every 10 minutes, that is a new state change for ALL monitors on that agent, every 10 minutes.  You can read more about this condition at THIS blog post.

Posted by kevinhol | 1 Comments
Filed under: ,

The new and improved guide on HealthService Restarts. Aka – agents bouncing their own HealthService

I have written many articles in the past on HealthService restarts.  A HealthService restart is when the agent breaches a pre-set threshold of Memory use, or handle count use, and OpsMgr bounces the agent HealthService to try and correct the condition.

The Past:

Here are a few of the previous articles:

http://blogs.technet.com/kevinholman/archive/2009/03/26/are-your-agents-restarting-every-10-minutes-are-you-sure.aspx

http://blogs.technet.com/kevinholman/archive/2009/06/22/health-service-and-monitoringhost-thresholds-in-r2-how-this-has-changed-and-what-you-should-know.aspx

 

Generally – this is a good thing.  We expect the agent to consume a limited amount of system resources, and if this is ever breached, we assume something is wrong, so we bounce the agent.  The problem is that if an agent NEEDS more resources to do its job – it can get stuck in a bouncing loop every 10-12 minutes, which means there is very little monitoring of that agent going on.  It also can harm the OpsMgr environment, because if this is happening on a large scale, we flood the OpsMgr database with state change events.  You will also see the agent consume a LOT of CPU resources during the startup cycle – because each monitor has to initialize its state at startup, and all discoveries without a specific synch time will run at startup.

 

However, sometimes it is NORMAL for the agent to consume additional resources.  (within reason)

The limits at OpsMgr 2007 RTM were set to 100MB of private bytes, and 2000 handles.  This was enough for the majority of agents out there.  Not all though, especially since the release of Server 2008 OS, and the use of 64bit Operating systems.  Many servers roles require some additional memory, because they run very large discovery scripts, or discovery a very large instance space.  Like DNS servers, because they discover and monitor so many DNS zones.  DHCP servers, because they discover and monitor so many scopes.  Domain controllers, because they can potentially run a lot of monitoring scripts and discovery many AD objects.  SQL servers, because they discover and monitor multiple DB engines, and databases.  Exchange 2007 servers, etc…

 

What’s new:

At the time of this writing, two new management pack updates have been released.  One for SP1, and one for R2.  EVERY customer should be running these MP updates.  I consider them critical to a healthy environment:

R2 MP Update version 6.1.7533.0

SP1 MP Update version 6.0.6709.0

What these MP updates do – is to synchronize both versions of OpsMgr to work exactly the same – and to bump up the resource threshold levels to a more typical amount.  So FIRST – get these imported if you don't have them.  Yes, now.  This alone will solve the majority of HealthService restarts in the wild.  These set the Private Bytes from 300MB (up from 100MB), and the Handle Count to 6000 (up from 2000) for all agents.  This is a MUCH better default setting than we had previously.

 

How can I make it better?

I’m glad you asked!  Well, there are two things you can do, to enhance your monitoring of this very serious condition. 

  1. Add alerting to a HealthService Restart so you can detect this condition when it still exists.
  2. Override these monitors to higher thresholds for specific agents/groups.

Go to the Monitoring pane, Discovered Inventory, and change target type to “Agent”. 

Select any agent preset – and open Health Explorer.

Expand Performance > Health Service Performance > Health Service State.

image

 

This is an aggregate rollup monitor.  If you look at the properties of this top level monitor – you will see the recovery script to bounce the HealthService is on THIS monitor…. it will run in response to ANY of the 4 monitors below it which might turn Unhealthy.

 

image

 

So – we DONT want to set this monitor to also create the alerts.  Because – this monitor can only tell us that “something” was beyond the threshold.  We actually need to set up alerting on EACH of the 4 monitors below it – so we will know if it is a problem with the Healthservice or MonitoringHost, and either memory (private bytes) or Handle Count.

First thing – is to inspect the overrides on each monitor, to make sure you haven't already adjusted this in the past.  ANY specific overrides LESS than the new default of 300MB and 6000 handles should be deleted.  (The exchange MP has a sealed override of 5000 handles and this is fine)

What I like to do – is to add an override, “For all objects of Class”.  Enable “Generates Alert”.  I also ensure that the default value for “Auto-Resolve alert is set to false.  It is critical that auto-resolve is not set to True for this monitor, because we will just close the alert on every agent restart and the alert will be worthless.  What this will do – is generate an alert and never close it, anytime this monitor is unhealthy.  I need to know this information so I can be aware of very specific agents that might require a higher value:

image

 

Repeat this for all 4 monitors.

 

One thing to keep in mind – if you ever need to adjust this threshold for specific agents that are still restarting – 600MB of private bytes (double the default) in generally a good setting.  It is rare to need more than this – unless you have a very specific MP or application that guides you to set this higher for a specific group of agents.

Also – be careful overriding this value across the board… because Management Servers also have a “HealthService” and you could inadvertently set this to be too low for them.  Generally – the default settings are very good now – and you should only be changing this for very specific agents, or a very specific group of agents.

Now – you can uses these alerts to find any problem agents out there.  I really strongly recommend setting this up for any management group out there.  You NEED to know when agents are restarting on their own.

Posted by kevinhol | 3 Comments
Filed under: ,

Tuning tip – turning off some over-collection of events

We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don't care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.

However – one area of OpsMgr that often goes overlooked, is event overcollection.  This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat.  I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it.  They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.

MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP.  These events are items like "config requested”, “config delivered”, “new config active”.  They might be interesting, but there is no advanced analysis included to use these to detect a problem.  In small environments, they are not usually a big deal.  But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them.  I have yet to see a customer who did that.

 

At a high level – here is how I like to review these events:

  1. Review the Most Common Events query that your OpsDB has.
  2. Create a “My Workspace” view for each event that has a HIGH event count.
  3. Examine the event details for value to YOU.
  4. View the rule that collected the event.
    1. Does the rule also alert or do anything special, or does it simply collect the event?
    2. Do you think the event is required for any special reporting you do?
  5. Create an Override, in an Override MP for the rule source management pack, to disable the rule.
  6. Continue to the next event in the query output, and evaluate it.

 

So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:

Most common events by event number and event publishername:

SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC

The trick is – to run this query periodically – and to examine the most common events for YOUR environment.  The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it:  (I will use a common event 21024 as an example:)

 

image

 

image

 

What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.

Next – I will examine the rule.  I will look at the Data Source section, and the Response section.  The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section.  If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.

If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it.  You will find you can disable most of the top consumers in the database.

 

Here is why I consider it totally cool to disable these uninteresting event collection rules:

  • If they are really important – there will be different alert generating rule to fire an alert
  • They fill the databases, agent queues, agent load, and network traffic with unimportant information.
  • While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
  • Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
  • If we find we do need one later – simply remove the override.

 

Here is an example of this one:

image

 

So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.

 

Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:

 

1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103

 

I don't recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value.  Just knocking out the top 10 events will often free up 90% of the space they were consuming.

The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them.  You might find you have some other events as your top consumers.  I recommend you review them in the same manner as above – methodically.  Then revisit this every month or two to see if anything changed.

I’d also love to hear if you have other events that you see as your top consumer that isn't my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them.  I’d be happy to give feedback on those, or add any new ones to my list.

Posted by kevinhol | 22 Comments

Writing monitors to target Logical or Physical Disks

This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.

When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple.  “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent.  “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement.  Therefore – your performance unit monitors for this example work just like you’d think.

However – Logical Disk is very different.  On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc…   We must write our monitors to take this into account. 

For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target.  The only way this would work, is if we SPECIFICALLY chose the instance in perfmon.  I will explain:

Bad example #1:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.

I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold. 

image

I target a generic class, such as “Windows Server Operating System”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target.  This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances.  If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues.  HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter.  For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert. 

This is SERIOUS.  This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.

A quick review of Health Explorer will show what is happening:

This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:

image

Then went “healthy” in the same SECOND from the _Total Instance:

image

Then flipped back to unhealthy, at the same time – for the D: instance.

image

 

I think you can see how bad this is.  I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn't understand the class model, or multi-instance perf counters.  In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!

 

Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter.  So – what should I have used?  Well, in this case – I should use something like “Windows 2008 Logical Disk”  But we can still screw that up!  :-)

Bad example #2:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

Ack!  The SAME problem!  Why????

The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING.  This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.

Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.

image

 

When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn't event their drive letter!  I will examining the F: drive monitor:

I can see it was turned unhealthy because of the free space threshold hit on the D: drive!

image

and then flipped back to healthy due to the available space on the C: instance:

image

This is very, very bad.  So – what are we supposed to do???

 

We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object.  Make sense?  Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM.  This is actually easier than it sounds:

 

Good example:

 

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2.  Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:

image

 

This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class.  You can see all of these if you viewed that class in discovered inventory.  What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name.  In perfmon – we see the instance names are “C:” or “D:”

image

In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:

image

 

So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.

image

 

The wildcard parameter is actually something like this:

$Target/Property[Type="MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice"]/DeviceID$

This simply is a reference to the MP that defined the “Device ID” property on the class.

 

Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.

 

 

You can use this same process for any multi-instance perf object.  I have a (slightly less verbose) example using SQL server HERE.

 

To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List.  These should indicate lots of alerts, and monitor flip-flop, and should be investigated.

Posted by kevinhol | 2 Comments
Filed under:

29106 event on RMS – Index was out of range. Wait. What?

Was working with a customer on this one – figured it might help others.

Saw a lot of these VERY SPECIFIC 29106 events on the RMS, specifically with the text: 

System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.

 

Here is the full event:

Event Type:      Warning
Event Source:    OpsMgr Config Service
Event Category:  None
Event ID:        29106
Date:            11/10/2009
Time:            12:43:24 PM
User:            N/A
Computer:        AGENTNAME
Description:
The request to synchronize state for OpsMgr Health Service identified by "3688d65d-a16c-2be6-7e84-5faf8a9cffe0" failed due to the following exception "System.ArgumentOutOfRangeException: Index was out of range. Must be non-negative and less than the size of the collection.
Parameter name: index

What we found was – that we could look up these health service ID’s – by pasting them in the following SQL query:

select * from MTV_HealthService
where BaseManagedEntityId = '3688d65d-a16c-2be6-7e84-5faf8a9cffe0'

This would give us the name of the agent.

In the console, under Agent Managed – we found all of these agents were in “Unmonitored” state – on the agents themselves, they were stuck.  They looked like they got installed, but could not get config.  We deleted them from agent managed, waited a few minutes, and let them show back up in Pending Management.  Approved them – then they were able to come back in and work properly.  These looked for the most part like orphaned machines, and several were computers that were renamed, or old DC’s that were demoted.

Posted by kevinhol | 2 Comments
Filed under:

How to force the Web Console to open a specific view, instead of the default Monitoring Overview

 

When you open the Web Console – the default view will open to a “Monitoring Overview” pane.  This view, in very large environments, can take considerable time to finish loading, before you can select any other views.  Sometimes, loading this view may time out as well.

 

image

 

 

Here is a way to load a specific view by default.  This helps when you have customers that only need to see a very specific pre-configured alert or state view.

The format is: 

 

http://webconsoleserver:51908/default.aspx?ViewID=8DB1F5A7-F3F3-2646-6C6B-E34672F7ED98&ViewType=AlertView

 

Lets break that all down:

 

The first part of the URL is a constant – this should be self explanatory:

http://webconsoleserver:51908/default.aspx?ViewID=

The next part is an ID for the view.  These will be constant for default built in views and views from Microsoft MP’s.  Your custom MP’s will have their own unique view ID’s.  I will talk more about how to find these ID’s below.  My example ID is for the “Active Alerts” view at the top of the console view list.

8DB1F5A7-F3F3-2646-6C6B-E34672F7ED98

The last part is the ViewType.  This describes to the console if we are dealing with an AlertView, StateView, or PerformanceView

&ViewType=AlertView

 

Here is a SQL query – to get all the view ID’s from the OperationsManager (OpsDB) for any view:

select vv.id as 'View Id',
vv.displayname as 'View DisplayName',
vv.name as 'View Name',
vtv.DisplayName as 'ViewType',
mpv.FriendlyName as 'MP Name'
from ViewsView vv
inner join managementpackview mpv on mpv.id = vv.managementpackid
inner join viewtypeview vtv on vtv.id = vv.monitoringviewtypeid
--where mpv.FriendlyName like '%default%'
--where vv.displayname like '%Service%'
order by mpv.FriendlyName, vv.displayname

 

Here are some examples of some built in views:

 

“Active Alerts” view at the top:

http://webconsoleserver:51908/default.aspx?ViewID=8DB1F5A7-F3F3-2646-6C6B-E34672F7ED98&ViewType=AlertView

“Windows Computers” state view:

http://webconsoleserver:51908/default.aspx?ViewID=E3D720DE-F6DD-185C-6FDC-0832377D910A&ViewType=StateView

“Operating System Performance” view from the BaseOS (Microsoft Windows Server) MP:

http://webconsoleserver:51908/default.aspx?ViewID=9B216021-6E88-EF6D-2A97-9E3EA1D6AD3B&ViewType=PerformanceView

 

Now – you can give a specific use this URL for their favorites – if they want to open the Web Console on this specific view.

Posted by kevinhol | 4 Comments
Filed under:

OpsMgr 2007 SP1 cumulative rollup hotfix has shipped!

If you cannot or will not upgrade to OpsMgr 2007 R2 anytime soon – then this hotfix is for you!

Available at:   http://support.microsoft.com/kb/971541

This updates OpsMgr 2007 SP1 to 6.0.6278.100.  This is a rollup covering many new issues, plus most of the previously released critical hotfixes for OpsMgr.  I recommend this rollup hotfix for anyone running OpsMgr 2007 SP1 that doesn't have very near term plans in place to upgrade to OpsMgr 2007 R2.

 

Overview

The Update Rollup for Operations Manager 2007 Service Pack 1 (SP1) combines previous hotfix releases for SP1 with additional fixes and support of SP1 roles on Windows 7 and Windows Server 2008 R2. This update also provides database role and SQL Server Reporting Services upgrade support from SQL Server 2005 to SQL Server 2008.
The Update Rollup includes updates for the following Operations Manager Roles:

  • Root Management Server, Management Server, Gateway Server
  • Operations Console
  • Operations Management Web Console Server
  • Agent
  • Audit Collection Server (ACS Server)
  • Reporting Server
The following tools and updates are provided within this update which may be specific to a scenario:
  • Support Tools folder – Contains SRSUpgradeTool.exe and SRSUpgradeHelper.msi (Enables upgrade of a SQL Server 2005 Reporting Server used by Operations Manager Reporting to SQL Server 2008 Reporting Server)
  • Gateway folder – Contains a MSI transform and script to update MOMGateway.MSI for successful installation on Windows Server 2008 R2
  • ManagementPacks folder – Contains an updated Microsoft.SystemCenter.DataWarehouse.mp which requires manual import

Feature Summary:

  • Providing a rollup that supersedes nearly all SP1 binary hotfixes in a single package (~50 fixes) . See KB971541 for exceptions.
  • Support for Windows 7 and Windows Server 2008 R2 - See KB974722 which will be updated to include data around the release of KB971541
  • Operational and DataWarehouse database support for upgrade to SQL 2008.
  • Additional stability hotfixes
  • SCCM Monitoring via our 64-bit agent in the latest SCCM MP. See the latest SCCM MP guide for details.
  • Exchange 2010 MP support
  • Fix for Ops console crashes seen on Vista and Windows 7

The Supported Configurations Guide  and Upgrade Guide have also been updated.

 

My experience upgrading a lab management group to the SP1 rollup:

First – I’d recommend making a plan…. like reading the KB article, known issues, and plan out the order of update operations.  The KB article dont specifically state a specific order, so I will probably do something like this:

  • Root Management Server (includes web console)
  • all secondary Management Servers (includes any that are audit collectors)
  • SCOM Reporting Server
  • any stand alone OpsMgr consoles
  • Agents (both from pending and manually installed)

 

Ok – getting started…..

  • COPY the SystemCenterOperationsManager2007-SP1-KB971541-X86-X64-IA64-ENU.MSI file locally to the RMS.
  • Run the update from the MSI – install to default locations
  • If executing on Windows Server 2008 – run the MSI from an elevated command prompt.
  • READ the release notes (you can copy these out to word to make them more readable
  • Note the new splash screen for this hotfix:

image

 

  • We will “Run Server Update”
  • The hotfix installs with no further user interaction.
  • The installer will finish – you can click “Finish”.  However – another installer will kick off immediately afterward.  This is by design – documented in the release notes, and is for installing localization updates.  Then click “Finish” on the second update screen.
  • At this point, you can click “Exit” on the Software Update splash screen.
  • Continue applying the updates to the different roles – as documented in the release notes.

Some interesting things you might notice:

We actually clean up all the old hotfixes from the agent files – and move them to the root of \AgentManagement folder:

image

 

This is good – we wont try and re-apply them to subsequent agent installs/updates.  Now – there will be two new hotfix files in the \x86, \AMD64, and \ia64 folders:

 

image

 

These include the agent hotfix update, plus a localization update which will vary based on your localization settings.

 

As for double-checking the update applied successfully – you can add “File Version” column to windows Explorer:

image

 

You will notice several management packs got updated in the console: 

 

image

 

***Note – per the release notes and KB article

Import the following management pack from the ManagementPacks folder:

Microsoft.SystemCenter.DataWarehouse.Reports.MP

The location is a bit confusion – the full path would be:  \Program Files\System Center 2007 Hotfix Utility\KB971541\ManagementPacks\

You can import this at any time… I’d recommend importing this after you are done updating all the server roles, including reporting.

 

In the console – you will also note that any agents that were not manually installed – will require an agent update.  I would hold off updating any agents until your management group server roles are fully updated, and then only update around 200 agents at a time.  This process will cause significant database and management server activity – so I’d advise doing the agent updates during off-peak use hours for large management groups:

 

image

 

  • Next – I update my management servers, including any that run ACS (have a special ACS update for those too in the Hotfix installer splash screen)

 

  • Next up – reporting:

image

 

  • I kick this off on the SRS/SCOM Reporting server.  Not prompted for anything… it completes in seconds.

 

  • Next on the list – update agents.  They all updated just fine via pending actions except for one…. this happens to be the same server that is hosting the OpsDB.  Interesting.  I got the following error:

The MOM Server failed to perform specified operation on computer DB.opsmgr.net.

Operation: Agent Install

Install account: OPSMGR\localadmin

Error Code: 8007064A

Error Description: The configuration data for this product is corrupt. Contact your support personnel.

This turned out to be due to some bad data in my HKEY_CLASSES_ROOT\Installer\Products\C9A0067E2876122489E4BA987C08CDD2\Patches\Patches REG_MULTI_SZ value.  I am not sure how this got messed up – probably due to me testing a bunch of SP1 hotfixes previously and fat fingering a registry edit – so I would not assume this will be a common error.  Once I fixed this registry entry – this agent updated just fine as well.

 

How can I be sure all my agents got updated?

One of the simplest ways – is to look at the “Patchlist” column on a Health Service State view.  Create a new State View in “My Workspace”.  Target “Health Service”:

 

image

 

On the Display Tab – select only Name, and Patch List:

 

image

 

Voila!

 

image

Posted by kevinhol | 20 Comments
Filed under:

Native Exchange 2007 MP 6.0.6741.0 ships for OpsMgr 2007 SP1 users

If for some reason you cannot or will not upgrade to OpsMgr 2007 R2, then this update is for you!

This Native MP replaces the old “converted” Exchange 2007 MP.  If you are running the old conversion MP then I STRONGLY suggest making the effort to transition to this MP.

 

This MP version 6.0.6741.0 is for SP1 users ONLY!  If you are running OpsMgr 2007 R2 – there is an R2-Only version of this MP for you on the catalog.

 

Some notes:

This MP is essentially designed after the R2-only version – with the R2-only enhancement features removed. 

This MP is NOT UPGRADE COMPATIBLE with ANY previously public released Exchange 2007 MP.  This means you need to remove your existing Exchange MP before you import this one…. and start a new overrides MP for this one, for tuning.  Believe me – it will be worth it.  Technically – you can run both MP’s side by side…. here is a blurb from the guide:

 

  • This new Exchange Server 2007 Management Pack for Operations Manager 2007 SP1 does not support an upgrade from the previously released, converted Exchange Server 2007 Management Pack for Operations Manager 2007. We recommend that you do not run both management packs in parallel. However, running both management packs in parallel is a supported scenario. If you decide to run both management packs in parallel, before importing this management pack, disable any synthetic transaction in the converted management pack. For more information about migrating custom settings from the previously released, converted management pack to the new, native management pack, see Appendix: Migrating from the Previously-Released Converted Exchange Server 2007 Management Pack.

 

Here is what's new from the guide:

 

Reports. This management pack provides a set of reports specific to Exchange 2007. For the list of reports and for more information about the reports, see Appendix: Reports.

· Improved disk monitoring. The management pack improves Exchange disk monitoring by providing support for mount points and by discovering three types of disks. This improvement enables you to establish a disk monitoring standard across all Exchange 2007 servers using fewer overrides. The types of disks discovered are Database (on the Mailbox server role), Log (on the Mailbox server role), and Queue (on the Hub and Edge server roles). For information about how to configure Exchange 2007 Disk Monitoring, see Configure Disk Monitoring. For information about classes in this management pack, see Classes.

· A significant number of rules and monitors that are not actionable or may be noisy are disabled. Note that many of these rules are still in the management pack so that you can enable them if necessary. For a list of rules and monitors that are disabled by default in this management pack, see Appendix: Monitors and Rules Disabled by Default.

· A number of performance collection rules are disabled. For a list of disabled performance collection rules, see Performance Collection Rules Disabled by Default.

· Support for monitoring any number of Exchange organizations using a single Operations Manager 2007 management group. There are no specific requirements imposed by this management pack for monitoring multiple Exchange organizations.

· Full support for clustered configurations using Windows Clustering technology. For more details, see Monitoring Clustered Mailbox Servers and Log Shipping.

· An extensive class model, showing the relevant Exchange 2007 server roles and components, as well as service-centric components, allowing you to measure availability or performance at a granular level. The class model fully supports the Distributed Application Designer, allowing you to create custom distributed applications using the appropriate Exchange 2007 components. The appropriate classes and monitors are declared public, enabling you to extend the monitoring if needed. For hierarchical diagrams of the Exchange 2007 management pack classes, see Classes.

· Improved low-privilege support. This management pack supports installing the agent with the minimum rights required by the Operations Manager agent.

· Improved tasks. The management pack includes a number of tasks that simplify troubleshooting and reduce the amount of time to resolve the alert.

· Improved topology discovery. This management pack introduces a number of improvements to topology discovery.

· There is no central discovery script. Each agent is responsible for discovering its piece of the topology. This ensures that potential permissions and trust issues are minimized. There are no requirements on trusts in order to generate the topology.

· The topology will show only agent-managed Exchange 2007 servers, ensuring that the topology consists only of servers that the Exchange administrator has chosen to monitor.

· Improved synthetic transactions. This management pack improves the support for the Exchange 2007 synthetic transactions in several ways:

· The management pack supports local mail flow transactions. For more information, see the Configuring Mail Flow Synthetic Transactions section.

· The management pack supports running the following Exchange 2007 synthetic transactions:

· Test-Mailflow (http://go.microsoft.com/fwlink/?LinkID=137740) (local mail flows only)

· Test-OwaConnectivity (http://go.microsoft.com/fwlink/?LinkID=137732) (internal only)

· Test-MapiConnectivity (http://go.microsoft.com/fwlink/?LinkID=137739)

· Test-ReplicationHealth (http://go.microsoft.com/fwlink/?LinkID=148797)

· The discovery of Exchange 2007 server roles is disabled by default and minimal Exchange 2007 monitoring is applied. This allows you to discover and monitor your servers gradually, as well as tune the management pack as you bring more agent-managed Exchange 2007 servers into the Operations Manager environment. For more information, see Enable Exchange 2007 Server Role Discovery.

Posted by kevinhol | 0 Comments
Filed under:

Making groups of logical disks – an example from simple to advanced

I have been seeing this question come up a lot lately – as customers try and create groups of their disks – in order to create overrides for “certain” disks.  So – I am creating this post to give some real world examples.

 

Well – I will start this simply.  Say we want to create a group of all logical disks, with the drive letters of C: and D:?

I would start with creating a new group – and adding the “Windows Server 2003 Logical Disk” class.  Now – I could just use the parent class of “Logical Disk” instead of the OS specific class if I wanted.  The only issue with that is that most monitors targeting a disk – are OS specific – and duplicated three times.  So it is best to create specific groups for these – but totally not required.

Ok – so in the Dynamic Members query builder – I click add, and pick a property.  Since I know “Device Name” contains the drive letter – this will do nicely.  I select device name “Equals” “C:”. 

image

 

Now – I want to also include D:.  There are many way to do – this – and I will go through them.  First – I could simply Insert a new line for Windows Server 2003 Logical Disk – and replicate the line I have – adding one for D:

 

image

 

Only one problem – this is an “AND” grouping – I really need this to be an “OR” grouping to include both C: and D: drives.  You can switch this grouping the in UI, just right click the word “AND” and change it to an OR grouping:

 

image

 

Voila!

 

image

 

This formula now looks like: 

( Object is Windows Server 2003 Logical Disk AND ( Device Name Equals C: ) OR ( Device Name Equals D: ) )

Save your group – then right click it – and choose “View Group Members”.  This will ensure we are cooking with gas.  It should contain all your Windows 2003 based C: and D: volumes.

image

 

 

So far – so good.

Now – what if I ONLY want C: and D: disks, that are HOSTED by specific Windows Computers?  I can do that too!  Lets say I want a group – of all the C: and D: logical disks, on servers that begin with the name “SR______”

If you look at the bottom of the list of properties for Logical Disks – you will see (Host=Windows Computer).  From here – we can pick any attribute of the Windows Computer class as well to add to our expression – to limit our logical disks in our group to very specific Computers. 

 

image

 

 

Go back to the properties of your group, edit the Dynamic Members, and you can construct something like this:

 

image

 

Which translates to the following formula:

( Object is Windows Server 2003 Logical Disk AND ( Windows Computer.NetBIOS Computer Name Matches wildcard sr* ) AND ( ( Device Name Equals C: ) OR ( Device Name Equals D: ) ) )

Now – I will be honest – getting all the “ands” and “ors” in the right place using the UI is a big pain.  It is very easy to screw it up.  I like to simplify this to the fewest lines possible – using Regex.

Using Regular Expressions – we can use modifiers to create very advanced expressions.  my favorites are ^ which means the beginning of a new line or word, and | which is the “pipe” symbol – which means “or”.

 

So a simple way to accomplish the same example above – without all the complexity – is this:

 

image

 

WAY simpler!

However – you might notice – this doesn't work right.  This is because Regex is case sensitive.  If the Server NetBIOS name is detected in all CAPS, this expression wont match.  I talk a little about this issue in this post:  http://blogs.technet.com/kevinholman/archive/2009/04/21/quick-tip-using-regular-expressions-in-a-dynamic-group.aspx

So – based on that posts example – there is a simple way to make a RegEx case insensitive:  (?i:blah)

Using that as an example – we can now make very advanced groupings, quite easily:

 

image

 

(?i: to make it case insensitive.  ^ to signify the beginning of the word/line match.   Here is the formula now:

 

( Object is Windows Server 2003 Logical Disk AND ( Device Name Matches regular expression (?i:^C|^D) ) AND ( Windows Computer.NetBIOS Computer Name Matches regular expression (?i:^sr) ) )

 

Check it out:

 

image

 

Victory!

 

What if I wanted all logical disks that we NOT hosted by a Virtual Machine?  Easy!

 

image

 

( Object is Logical Disk AND ( Windows Computer.Virtual Machine Equals False ) AND True )

This reveals a group of ALL logical disks hosted by a Windows Computer with the attribute of Virtual Machine = False:

 

image

 

As you can see – using the Hosting relationship of the disk – to the Windows Computer object, there is much more you can do with groups.

Updated Active Directory ADMP Management Pack released – Version 6.0.7065.0

This is now available on the catalog.

http://technet.microsoft.com/en-us/opsmgr/cc539535.aspx

 

Changes in this update:

  • Support for monitoring Windows Server® 2008 R2 server operating systems as well as Windows® 7 client operating systems
  • Support for monitoring the Active Directory Web Service (ADWS) in Windows Server 2008 R2 as well as the Active Directory Management Gateway Service in Windows Server 2008 and Windows Server 2003

There were a few other minor fixes as well, such as the AD LSASS CPU overload monitor.  This release also replaces a recently released ADMP version 6.0.7050.0 that was pulled to fix the aforementioned monitor.

Posted by kevinhol | 5 Comments
Filed under:

Why do my group memberships for Windows Computers have machines that don't belong there?

Here is a little tip if you find that your Windows Computer Groups (and state views scoped by groups) contain computers that should not be there.

 

Have you noticed that you have state views or Windows Computer Groups that contain servers that you don't expect?  Like Exchange Servers in your SQL Computers Group?  Or SQL Servers in your Exchange 2007 computer group?  Or maybe Hyper-V host servers in a LOT of your groups?  If so – you are probably running Hyper-V, and using the Hyper-V MP version 6.0.6633.0.

Have a look at the below example:  My SQL Management Pack “Computers” view – contains domain controllers, exchange servers… even XP clients… plus it also includes the Hyper-V host (VS3).

image

 

 

This can wreak havoc on your management group…. because we use groups of Windows Computers for Overrides, and for scoping console views. 

 

The good news is – there is a very simple workaround:  There is a relationship in this MP that we need to disable.  This relationship attempts to associate the Windows Computer objects of a guest to its host – however it doesn't work properly, and isn't necessary.

 

Open Authoring in the console.  Select “Object Discoveries”.  Scope to “Hyper-V Virtual Machine”.  Find the discovery named: “Hyper-V 2008 Guest Computer Relationship Discovery”

 

image

 

Create an override for this – “for all objects of class:  Root Management Server”.  Set this discovery to disabled.

 

Now that that is disabled – we need to run a little cleanup on aisle 7.  We have a nifty little cmdlet in the OpsMgr command shell – named remove-disabledmonitoringobject.  This cmdlet will basically remove any discovered objects – for any situation where they are explicitly disabled with an override on a discovery.  Since that is what we just did (disabled a discovery) this will quickly delete any discovered relationships which previously existed.

 

image

 

Now – when I look at my state views scoped to SQL Computers group – I only see SQL servers, AND – the Hyper-V host.  We don't want the Hyper-V host tho…..  apparently the cmdlet cleanup doesn't take care of those.  To resolve that membership – I generally bounce the HealthService on the Hyper-V hosts, and then the HealthService on the RMS, and in a few minutes they will be gone.

 

image

 

Voila!

Posted by kevinhol | 12 Comments
Filed under:
More Posts Next page »
 
Page view tracker