Kevin Holman's System Center Blog

Posts in this blog are provided "AS IS" with no warranties, and confers no rights. Use of included script samples are subject to the terms specified in the Terms of UseAre you interested in having a dedicated engineer that will be your Mic

Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

Which servers are DOWN in my company, and which just have a heartbeat failure, RIGHT NOW?

  • Comments 18
  • Likes

 

 

 

 

 

In OpsMgr 2007, when a agent experiences a heartbeat failure, several things happen.  There are diagnostics, and possibly recoveries that are run.  Alerts, and possibly notifications go out.

But what happens if my Operations team misses on of these alerts?  What can I do to "spot check" agents with issues?

Well, any time an agent has a heartbeat failure, we gray out the state icon of the agents last known state for in each state view. 

However - you CAN create a State view that will turn Red or Yellow just like any other state views.  Simply create a new State View, and scope the class to Health Service Watcher (Agent).

I called mine Heartbeat State View:

image

This view will show us when any of the agent health service watcher monitors are unhealthy:  In my case - OWA and EXCH1 have issues.  OWA is DOWN, while EXCH1 agent healthservice is stopped.

image

However - here is the issue.  This view shows us when ANY monitor rolls up unhealthy state.... this includes heartbeat failures AND computer unreachable (server IP stack is down):

image

What if I want a State View - to ONLY show me computers that are DOWN.... as in... not heartbeating AND not responding to any PING?  Most customers consider this their "most critical situation".  Well, I haven't found an easy way to do that.... so I wrote a report which handles it.  This report will query the OpsDB for the state of the "Computer Not Reachable" monitor, and only display those servers.  It is based on the following query:

SELECT bme.DisplayName, s.LastModified as LastModifiedUTC, dateadd(hh,-5,s.LastModified) as 'LastModifiedCST (GMT-5)'
FROM state AS s, BaseManagedEntity as bme
WHERE s.basemanagedentityid = bme.basemanagedentityid AND s.monitorid
IN (SELECT MonitorId FROM Monitor WHERE MonitorName = 'Microsoft.SystemCenter.HealthService.ComputerDown')
AND s.Healthstate = '3' AND bme.IsDeleted = '0'
ORDER BY s.Lastmodified DESC

You can import this report if you have created a data source as shown in my previous post: 

http://blogs.technet.com/kevinholman/archive/2008/06/27/creating-a-new-data-source-for-reporting-against-the-operational-database.aspx

Import this report into your custom folder... and run it.  You can schedule it to receive it first thing every day... if you like the output:

image

*****  Update 6-30-08  I removed a section of the original query relating to maintenance mode.  We found that if a down server had never been in maintenance mode, the server would not show up in the report.  The query and report download have been updated to address this.

Report is attached below:

Attachment: Servers_Down_Report.rdl
Comments
  • With MOM 2005, we can accomplish this quite easily using the following approach:

    An SERVER DOWN alert can be generated in response to an internally-generated ping failure event created by a MOM Agent ping script which is part of a MOM Agent connectivity rule.

    This rule monitors for the internal failure event and will generate a "Service Unavailable" alert indicating that the Agent Computer is most likely down (or has lost network connectivity).

    Note that 12 ping attempts over a 90 second period along with an additional ping after 15 minutes must all have failed before this alert is generated.

  • One note to add - in OpsMgr you will get a distinct alert whenever an agent doest not respond to ping, in addition to the heartbeat failurre alert.  What we dont have - is a state view JUST for computers that are down...

    You could easily write a custom monitor that runs a ping script - and build your own state view for this in the console... and not need this report.  The benefit of the report is being able to schedule it and deliver via email or sharepoint.

  • Hi Kevin

    Trouble is by creating the monitor you mention you are actually duplicating work that OpsMgr is doing. It sort of highlights the lack of logic in some functionality.

    To me, it makes no sense that I have to do a ping script as a monitor when OpsMgr has a much more powerful solution - agent heartbeat with associated ping of servers on which the agent heartbeat has been missed. I just need to get that information into the console .... and the fact that OpsMgr can't is a something of design flaw.  

    As I mentioned on the newsgroups, I don't think the report is feasible for near real time info in a large environment.

    Cheers

    Graham

  • Ahhh .. didn't read that properly before I posted!! Meant the fact that agent health state couldn't be incorporated into the computer state view is something of a flaw ... realise there are the agent health state views as per my posting in the newsgroup ;-)

  • Here is a unique way to use web page views in the OpsMgr console. You can create a web page view in the

  • Hi,

    I do not understand how to IMPORT the report in to the new Custom report folder.

    I notice that the reports on my reporting server are *.rpdl but this attachment is *.rdl.

    How do I get this report into the new folder?

    Thx,

    John Bradshaw

  • Hi,

    I have a different problem to the same topic. If a server goes down I do not receive any alerts. When I open Health Explorer with the above settings, I see only white bullets under Availability except Local Health Service Availability. Computer not Reachable, ... are disabled in their sealed MP. What is wrong in our configuration and what do I have to change to get an alert when a server goes down?

    Thanks

    Hendrik

  • I tried to use this UDL file by following the steps as mentioned in this site. When i run the report getting this error "An error has occurred during report processing.

    Cannot create a connection to data source 'ops'.

    For more information about this error navigate to the report server on the local server machine, or enable remote errors ".

    Please advise.

    Ren

  • You are correct - It looks like in this RDL file I named my data source "Ops" instead of "OpsDB".

    Simply open the RDL file - edit that, and import..... or simply go to your imported report - edit it - change the data source to your live data source that points to the opsDB.

  • Do you know of any way to setup subscriptions for only "ping failed" notifications?  Right now every time a server fails a heart beat and cannot be pinged we receive two text messages.  One for the heart beat failure and one for the Ping failure.

  • YES!  I do.  :-)

    In R2 - this is super easy - because we can subscribe to alerts rule by rule - monitor by monitor.

    In SP1 - it is doable - just a bit more difficult.  Please see my how to post at:

    http://blogs.technet.com/kevinholman/archive/2008/10/12/creating-granular-alert-notifications-rule-by-rule-monitor-by-monitor.aspx

  • Nice report. Works well,and I learnt a thing or two about SRS along the way. I did have to refer to Marnix's blog about importing rdl files, but then it all came together.

    Thx Kevin

    John Bradshaw

    http://thoughtsonopsmgr.blogspot.com/2009/12/how-to-upload-rdl-file-for-sql-server.html

  • Nice Blog, thank you Kevin

    I wanted to import your report (Servers_Down_Report.rdl) but that was written with Report Server 3.

    Now I'm updating SQL 2008 SP1 to R2 but the setup failed with two errors:

    - RS_NoCustomAuthExtensions (The Report Server has some custom authentication extensions configured)

    - RS_NoCustomSecurityExtensions (The Report Server has some custom security extensions configured)

    can you please help?

    Thanks,

    Reto

  • SQL 2008 R2 is not supported at this time for upgrading existing installations.

  • Crud, is this still the case?  That you can't upgrade SQL 2008 to SQL 2008 R2 on a SCOM box?

    I'm getting the same error, failing on "No Custom Security Extensions" and "No Custom Authentication Extensions" during the "Upgrade Rules"  part of the upgrade.

    So what's the solution, to uninstall everything, reinstall SQL R2 & Reinstall/Reconfigure SCOM?  Or is an upgrade path on the horizon?

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
Search Blogs