Get on-the-go access to the latest insights featured on our Trustworthy Computing blogs.
Posted by David Bills, chief reliability strategist, Trustworthy Computing
When we’re talking about cloud services, I’m a firm believer in the idea that services failure will occur – it’s not a matter of if, it’s strictly a matter of when. This is because the more complexthings become, the more challenging it is to anticipate and predict failures. As a result, designing services to withstand failure, as well as having a plan in place to recover the service quickly, is critical in building trust and maintaining long-term relationships with customers.
From my experience, at a basic level there are three main causes of cloud services failure:
1. Human error
2. Device and infrastructure failure
3. Software Vulnerabilities
If we anticipate these failures will invariably happen – that indeed they are a constant threat – but we maintain the organization goals as described in my earlier article, then we begin to see just how important it is to plan for services failure. Cloud service providers need to do everything they can to ensure that when failures occur, the impact to customers is minimized.
Recovery-oriented computing (ROC) defines the following six research areas that can be adapted to cloud services design and implementation to help mitigate potential issues as a result of these causes.
For more insights on these reliability topics, I encourage you to download the recent services reliability whitepaper.