Some have asked about what are good
approaches for examining the complexity (or trying to figure out how to
associate operational staffing numbers for a specific datacenter) tough
question.
Here is what I used in the past to
decompose this question (some stupid approaches and some which worked better….)
Pure work related activity:
1. Budget for a higher amount of
operational staff and train, automate and optimize for a smaller amount. (it’s never that easy and it’s cop out
approach) dumb and not recommended
2. Some take past CMMI automation,
process, structure audits from past 3rd party consultants to justify head
count. Lower the CMMI score, the higher
the FTE count is the common approach.
(again, never that easy and often
a newbee approach that gets you into trouble later) not recommended
3. The stuff/technology per person
measurement: example: Patch/Monitoring by physical processor / box is
ridiculous with today’s complexities:
Why? Because the modern
datacenter is more complicated:
Some Examples of those complexities:
Virtual servers managed, virtual SANs
managed, virtual networks managed, 2rd party management systems, SLAs and OLAs
(Operational Level Agreements), Silos managed, Business Applications
managed, Velocity of New apps
introduced, Old apps retired, Complexity of current regulations/laws, Glue Complexity between data center services,
Security systems for authentication, authorization, confidentiality, privacy,
etc.. , Risk tolerance levels, physical center constraints (including
HVAC), discipline, capabilities of
operational staff as well as discipline, capabilities and knowledge base of
development staff. Of course, politics, organizational dynamics, fiscal
budget models: capex/opex relationships etc…
Thinking about how everything impacts the
OLAs
OLAs (Operational Level Agreements) often, the systemic qualities for specific
business applications
Generally, in most ISP datacenters, every extra 9 in the availability
measurement can increase operational FTE significantly. From past work with
organizations needing extremely high HA with triple redundant everything
(examples: HBA cards, clusters, load balancers, trunking network cards, etc…)
for HA even during maintenance periods can promote significant Operational
staff requirements. This strategy
promoted (by sheer statistics) significant hardware component failures but kept
continuously high HA (redundancy promotes an inverse relationship between
component reliability and solution availability: which equates to more
resources needed for higher availability rates). While the specific architecture was
continuous, the amount of human work was
nothing less than extraordinary. This
was further compounded with the requirement for mandatory access control needs.
Auditing all of these areas isn’t easy.
If these approaches didn’t work all that
well, what has been working?
Measuring Operational Complexity
Better way: decompose how many specific elements would be
measured for MTTR (mean time to repair) (often in the OLA structure). MTTR elements can be anything that a MTTR metric
must address (software, hardware, etc…)
The higher the element count utilized for MTTR (from hardware, systems,
virtualized stuff, etc…), the higher the
complexity, the more work operational
staff needed to responsibly manage
it. I’ve found this a fairly accurate
complexity measuring approach transcending the technologies and trends of the
day.
Measuring Operational Scale:
(not the same as the Systemic Quality:
Scalability)
For each solution, application, etc… count the diversity and number of external
entities (some use the term actors from UML use case diagrams but this is not
always as inclusive) utilizing the solution
Then.
Calculate the growth rate (or shrink
rate) of the diversity and number of external entities utilizing each solution
in the datacenter.
Then.
Calculate impact rate on common data
center services from the growth rate of the solutions.
(example: DNS, NAT, Routers, E-Mail,
Middleware, Backup and Recovery Systems, Disaster Recovery Operations, etc…)
It’s simple, Greater degrees of scale and scope
traditionally increase operational staff requirements.
The trick is accurately forecasting this
with the customer based on lots of stated assumptions (hint)
The Marketing Stuff: Automation solving
World Peace of Datacenter automation promise:
Today,
many companies, including us, are promoting significant automated
capabilities to reduce operational staffing needs as scale and complexity are
increased (example: “we can manage more servers per person than they can sales
pitch so you should build an SOA datacenter with us.” …) It’s true that some automated technologies
in isolation have helped datacenter management for all operating system
vendors. Research with the largest and
most advanced datacenters in the world have not yet demonstrated the
realization of fewer headcount or work while delivering higher availability
rates promise (the opposite is often measured).
And of course, servers per headcount is a meaningless metric without
understanding the real context with which it was built (as most who read this
blog already know. J ) We will have to
digest SLA strategies into highly automated and extremely focused OLA MTTR measurable and manageable elements. Like the perfect product promise, a magical
process alone will also not solve this problem. It takes the best architecture with the best
products with the best processes and good people in the right organization (if
all the stars in the galaxy align argument).
Or at least a real commitment to align those stars (which is what I
recommend when really taking on an automation strategy) And this is one of the many reasons why this
promise is so often not realized.
Today, the best approaches automate to reduce cost and complexity
impacting a highly focused subset of MTTR elements. The sad humor in the datacenter: while a couple
of MTTR elements are often automated to levels requiring less human
activity, the sheer complexity of the
automation system many times requires a pile of new MTTR elements no else
thought of (of course ) An interesting
way to measure the value of datacenter automation tools and techniques: the
number of MTTR elements automated to the number of MTTR elements introduced.
This leads to another approach: looking
for datacenter architecture stability = Velocity of entering and exiting MTTR
elements per quarter. A high number can
often indicate chaos (no matter how cool the stuff they are buying).
Just some of the approaches I’ve used in
the past.
Hope this helps…
Lewis