Posted by: David Bills, chief reliability strategist, Trustworthy Computing

Cloud computing and cloud services are emerging as new solutions for many organizations seeking to reduce costs and increase productivity. It’s an exciting and challenging time for the services industry as more and more organizations move their applications and IT services to the cloud.

When I speak to customers about cloud services, security, privacy and reliability are the three primary topics they consistently ask about. Across the industry I see a number of organizations focused on improving security and privacy but not a lot of emphasis is being placed on reliability.  It seems as if many are still trying to get a handle on how to operate a highly reliable service.

Reliability is ultimately about customer satisfaction, which means managing reliability is a more nuanced challenge than simply measuring uptime. For example, you can imagine a service that never goes down but might be really slow or difficult to use on a regular basis. I’ll argue no one is going to be happy using that service, despite privacy principles being consistently applied and security practices being among the best in the industry. In short, reliability is just as important and warrants an appropriate level of engineering investment from the service provider to truly satisfy all of the customers’ requirements.

Although maintaining high levels of customer satisfaction is a multi-faceted challenge, reliability is the foundation upon which other aspects of customer satisfaction are built. Cloud-based services must be designed from the beginning with reliability in mind.

Today Microsoft has released a new whitepaper titled, “Deploying highly available and secure cloud solutions”. The paper showcases examples for deploying robust cloud solutions to maintain highly available and secure client connections and uses real-world examples to discuss scalability issues. From my experience, at a basic level there are three main causes of cloud services failure:
1. Device and infrastructure failures
2. Software vulnerabilities
3. Human errors

If we anticipate these failures will invariably happen – that indeed they are a constant threat – we need to design cloud services so that when something does go wrong, the impact to customers is avoided or minimized.
At a high level, each cloud session consists of a customer using a computing device to connect to an organization’s cloud-based service that is hosted by an internal or external entity. When planning for a highly-available cloud service, it’s important to consider the expectations and responsibilities of each of these parties. In planning, organizations need to acknowledge the real-world limitations of technology, and recognize failures will occur. By applying the necessary design principles to isolate and repair service failures quickly, thus avoiding or minimizing impact on the service’s availability to users, the provider is demonstrating their commitment to ensuring reliability is regarded as an absolutely essential element of cloud computing – an element their customers view as being just as important as security and privacy.

If you are designing or operating a cloud service, then I strongly encourage you to download the whitepaper to read more about successful techniques for deploying highly available and secure cloud services and creating an optimal overall user experience for your customers.