Every Solution Architect or IT Manager eventually, needs take care of one of the most challenging tasks for an IT organization: dispose of disaster recovery capabilities to support the business continuity requirements.

These concerns are not different when you are using a Cloud platform as Windows Azure, you always need bear in mind that anything (under your control or not), would fail in any moment (Black swan theory), so, you must plan and design your Cloud solution from a disaster-proof perspective following a process to build resilient solutions very similar to the process that you can follow for an on-premises solution, but, the main difference is that with Windows Azure you don’t need take care of expensive specialized hardware or build a nuke-proof bunker.

I define the process in five steps:


1. Get business requirements: getting from different business sources the critical processes where IT need to invest for business continuity and what are exactly the requirements to keep the business running. The basic idea is that redundancy is not necessary for every IT service. You need use a criteria to define what IT services need high availability and disaster recovery; I do usually use the following:

a. Critical Systems: services that need keep up and running in any circumstance (service outages of a critical system means business interruption and potentially could have financial and legal implications).

b. Essential Systems: services that need keep up and running to support business operations and are typically integrated with Critical Systems. These services need priority in the disaster recovery plan.

c. Necessary Systems: services that help to improve business operations and provide productivity improvements for employees, but are not required for business continuity. Necessary Systems can be left in the background in the recovery plan.

d. Optional Systems: These services may or may not improve business productivity, here I include test systems, historical data archiving, Intranet (not vital to the business, etc.). These services may be excluded from the recovery plan.

2. Set business continuity IT requirements: where we need to define recovery time objectives for the previously classified systems, recovery point objectives for application data, how you are going to manage the process and the IT support that you need to provide resiliency for the systems; in the same time that you are balancing between high availability and disaster recovery costs vs. the possible business losses in case of outages because 9’s in the availability percentage means more or less money.

3. Define High Availability and Disaster Recovery strategies: as a first planning phase where you need define what strategy are you going to follow to minimize impact of outages. The outage categories that you must cover are: software outages, operation errors outages (these two have aprox. 50% of the typical service outages), hardware outages, scheduled outages and facility outages.

4. Plan and Design the solution: at this step you need to plan for application data, state and infrastructure resiliency following cloud design patterns and defining the IT service management and operational processes (including disaster recovery protocol, etc.).

5. Build solution: now you are in the middle of the battlefield developing the solution, and here is where you can use Windows Azure services as the building blocks to build your disaster-proof solution, using any of the application, data or infrastructure services to deploy a solution created with the features that I call the Cloud DNA: Health and Monitoring model, Automation, Availability sets (grouped components so that are always available during scheduled outages), Security, Data and Compliance, component distribution and geo-dispersion according with business requirements, load balancing and autoscaling, Development for Operations and Configuration Management; using the programming language and computing platform that better fits for you.

Some useful links:

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services.

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications

Windows Azure SDKs.

Windows Azure Patterns & Practices

Windows Azure