Posted by David Bills, chief reliability strategist, Trustworthy Computing

When we’re talking about cloud services, I’m a firm believer in the idea that services failure will occur – it’s not a matter of if, it’s strictly a matter of when. This is because the more complex
things become, the more challenging it is to anticipate and predict failures. As a result, designing services to withstand failure, as well as having a plan in place to recover the service quickly, is critical in building trust and maintaining long-term relationships with customers.

From my experience, at a basic level there are three main causes of cloud services failure:

1.   Human error

2.   Device and infrastructure failure

3.   Software Vulnerabilities

If we anticipate these failures will invariably happen – that indeed they are a constant threat – but we maintain the organization goals as described in my earlier article, then we begin to see just how important it is to plan for services failure. Cloud service providers need to do everything they can to ensure that when failures occur, the impact to customers is minimized.

Recovery-oriented computing (ROC) defines the following six research areas that can be adapted to cloud services design and implementation to help mitigate potential issues as a result of these causes.

  1. Recovery process drills. Conduct routine recovery process drills to test repair mechanisms, both during development and in production.
  2. Diagnostic aids. Use diagnostic aids for root cause analysis of failures.
  3. Fault zones. Partition cloud services into fault zones so failures can be contained, enabling rapid recovery.
  4. Automated rollback. Create systems that provide automated rollback for most aspects of operations.
  5. Defense-in-depth. Use a defense-in-depth approach to ensure that a failure remains contained if the first layer of protection does not isolate it.
  6. Redundancy. Build redundancy into systems to survive faults. Design fail-fast components to enable redundant systems to detect failure quickly and isolate it during

For more insights on these reliability topics, I encourage you to download the recent services reliability whitepaper.