by admin on May 06, 2006 11:36am

Monitoring Thresholds – Protectors of the Information Overload

British author and inventor Arthur C. Clarke* once said “Any sufficiently advanced technology is indistinguishable from magic “. Case and point being the extent to which we can monitor, maintain and manage our environments today compared to the processes and methods from 30 years ago. Today we have more scripts, more monitoring tools and more methods and ways to manage technology than ever before. Implementation of key technologies, on the right scale with the right amount of training behind it for the folks in the trenches could mean a disastrous outage averted or several hours of downtime.  Falling back to my days in operations, when someone would ask me what the top three challenges of running a Production environment, spread across six countries and eight datacenters were, I would say “Monitoring, Managing and Maintenance”, the three M’s of Technology.

You see, today we have more of everything – more applications, more platforms, more development and that too at a more accelerated pace than experienced ever before. Whether you’re a 10,000+ user environment or a startup site, “Uptime” is and will always be a key factor in determining the success of what you’re offering. Your data, service or whatever you’re hosting or offering to the customer is more than likely to be bound to an SLA (Service Level Agreement) of some type. SLA’s are put in place to ensure that outages, issues or escalations of any kind are attended to in a time bound framework. That directly puts an enormous amount of burden as well as pressure to ensure that any issues emerging from within the environment are escalated immediately. This means that the managing the monitoring, alerting and maintenance of our environments is key especially from a backend services standpoint.

So what’s the best approach – more monitoring tools and more alerts? No, exactly the opposite. The phrase “do less with more” has never been more relevant.  The catch in implementing monitoring and management solutions is in the thresholds. This holds true for hundreds of folks out there who work as Operational Analysts as Tier-1 support. These folks are the first ones to witness and respond to any alert that’s triggered off by monitoring tool/s within the environment. If the environment is simply gigantic, it only means that the trigger/thresholds have to be carefully examined, customized/massaged, set and constantly managed to avoid desensitization.

What’s the best approach to take in such a scenario? Well, the one thing we always want to avoid in any Operations Control Center is “desensitization” to alarms or alert of any kind. When a specific alarm or alert goes off too many times and you’re asked to ignore it, the desensitization sets in. Which means that further alarms and alerts will not get the attention or scrutiny necessary. Thus the title of today’s blog – Monitoring Thresholds – Protectors of the Information Overload. This may not be a good example but it’s the best one I can some up with after a long week, so here goes: In the movies, you might have remembered one of those oft played scenes when someone is trying to cut the wires to an explosive device and cutting the right wire would mean life and the wrong one would mean death. That’s the best way I can describe the sensitivity behind setting thresholds. Why – because the larger the environment, the more complex and difficult it is to retain sensitivity towards the environment.

Takeaway: Equilibrium in an ideal Ops world is when you have suppressed the “noise” without suppressing a genuine incident.

* My mistake on the quote, it was not Isaac Asimov.