Posted  by David Bills, Chief Reliability Strategist, Trustworthy Computing
 
Today we published a new video series, ‘Fundamentals of cloud service reliability’. Designing and delivering reliable services is complex, and this series highlights the fundamentals of designing for service reliability and complements our recent whitepaper ‘An introduction to designing reliable cloud services’.  Together, these pieces aim to be the catalyst for further discussions among services teams within organizations, as well as the industry itself.

The series consists of three short videos:
1.      ‘What is cloud service reliability?, discusses reliability and presents four goals cloud service providers should consider to make their customers happy.
2.      ‘Addressing common cloud service issues’, discusses the common causes of service failure and core design principles to help reduce the likelihood and severity of outages when they happen.
3.      ‘Designing for and responding to cloud service issues’, discusses a process to help cloud service providers design cloud services to meet customers’ expectations.

Those of you who have read my previous articles know I believe it’s not if an outage will occur; it’s strictly a matter of when. This means it’s critical for organizations to understand how best to design and deliver reliable cloud services and how to ensure that when things do go wrong, the impact to customers is minimized.

One of the techniques Microsoft uses to improve the reliability of our services is fault modeling. Just as threat modeling is an important step in the design process when security-related issues are being evaluated, fault modeling is an important step in the design process for building reliable cloud services. It’s about identifying the interaction points and dependencies of the service and enabling the engineering team to identify where investments should be made to ensure the service can be monitored effectively and issues detected quickly. And, in turn, even guiding the engineering team toward effective coping mechanisms so the service is better able to withstand, or mitigate, the fault.

If you are designing or operating a cloud service, then I strongly encourage you to review these videos and more at www.microsoft.com/reliability.