Some have asked about what are good approaches for examining the complexity (or trying to figure out how to associate operational staffing numbers for a specific datacenter) tough question.

Here is what I used in the past to decompose this question (some stupid approaches and some which worked better….)

 

Pure work related activity:

1. Budget for a higher amount of operational staff and train, automate and optimize for a smaller amount.   (it’s never that easy and it’s cop out approach) dumb and not recommended

2. Some take past CMMI automation, process, structure audits from past 3rd party consultants to justify head count. Lower the CMMI score,  the higher the FTE count is the common approach.   (again,  never that easy and often a newbee approach that gets you into trouble later)  not recommended

3. The stuff/technology per person measurement: example: Patch/Monitoring by physical processor / box is ridiculous with today’s complexities:   Why?  Because the modern datacenter is more complicated:

Some Examples of those complexities:

Virtual servers managed, virtual SANs managed, virtual networks managed, 2rd party management systems, SLAs and OLAs (Operational Level Agreements), Silos managed, Business Applications managed,  Velocity of New apps introduced, Old apps retired, Complexity of current regulations/laws,  Glue Complexity between data center services, Security systems for authentication, authorization, confidentiality, privacy, etc.. , Risk tolerance levels, physical center constraints (including HVAC),  discipline, capabilities of operational staff as well as discipline, capabilities and knowledge base of development staff.   Of course,  politics, organizational dynamics, fiscal budget models: capex/opex relationships etc…

Thinking about how everything impacts the OLAs

OLAs (Operational Level Agreements)  often, the systemic qualities for specific business applications

Generally, in most ISP datacenters,  every extra 9 in the availability measurement can increase operational FTE significantly. From past work with organizations needing extremely high HA with triple redundant everything (examples: HBA cards, clusters, load balancers, trunking network cards, etc…) for HA even during maintenance periods can promote significant Operational staff requirements.   This strategy promoted (by sheer statistics) significant hardware component failures but kept continuously high HA (redundancy promotes an inverse relationship between component reliability and solution availability: which equates to more resources needed for higher availability rates).   While the specific architecture was continuous,  the amount of human work was nothing less than extraordinary.  This was further compounded with the requirement for mandatory access control needs.

Auditing all of these areas isn’t easy.

If these approaches didn’t work all that well,  what has been working?

Measuring Operational Complexity

Better way:  decompose how many specific elements would be measured for MTTR (mean time to repair) (often in the OLA structure).   MTTR elements can be anything that a MTTR metric must address (software, hardware, etc…)   The higher the element count utilized for MTTR (from hardware, systems, virtualized stuff, etc…),  the higher the complexity,  the more work operational staff  needed to responsibly manage it.   I’ve found this a fairly accurate complexity measuring approach transcending the technologies and trends of the day.

Measuring Operational Scale:

(not the same as the Systemic Quality: Scalability)

For each solution, application, etc…  count the diversity and number of external entities (some use the term actors from UML use case diagrams but this is not always as inclusive) utilizing the solution

Then.

Calculate the growth rate (or shrink rate) of the diversity and number of external entities utilizing each solution in the datacenter.

Then.

Calculate impact rate on common data center services from the growth rate of the solutions.

(example: DNS, NAT, Routers, E-Mail, Middleware, Backup and Recovery Systems, Disaster Recovery Operations, etc…)

 

It’s simple,  Greater degrees of scale and scope traditionally increase operational staff requirements.

The trick is accurately forecasting this with the customer based on lots of stated assumptions (hint)

 

 

 

The Marketing Stuff: Automation solving World Peace of Datacenter automation promise:

Today,  many companies, including us, are promoting significant automated capabilities to reduce operational staffing needs as scale and complexity are increased (example: “we can manage more servers per person than they can sales pitch so you should build an SOA datacenter with us.” …)    It’s true that some automated technologies in isolation have helped datacenter management for all operating system vendors.   Research with the largest and most advanced datacenters in the world have not yet demonstrated the realization of fewer headcount or work while delivering higher availability rates promise (the opposite is often measured).  And of course, servers per headcount is a meaningless metric without understanding the real context with which it was built (as most who read this blog already know. J )   We will have to digest SLA strategies into highly automated and extremely focused OLA  MTTR measurable and manageable elements.   Like the perfect product promise, a magical process alone will also not solve this problem.   It takes the best architecture with the best products with the best processes and good people in the right organization (if all the stars in the galaxy align argument).   Or at least a real commitment to align those stars (which is what I recommend when really taking on an automation strategy)  And this is one of the many reasons why this promise is so often not realized.   Today, the best approaches automate to reduce cost and complexity impacting a highly focused subset of MTTR elements.   The sad humor in the datacenter: while a couple of MTTR elements are often automated to levels requiring less human activity,  the sheer complexity of the automation system many times requires a pile of new MTTR elements no else thought of (of course )  An interesting way to measure the value of datacenter automation tools and techniques: the number of MTTR elements automated to the number of MTTR elements introduced.

 

This leads to another approach: looking for datacenter architecture stability = Velocity of entering and exiting MTTR elements per quarter.   A high number can often indicate chaos (no matter how cool the stuff they are buying).

Just some of the approaches I’ve used in the past.

 

Hope this helps…

 

Lewis