Cloud Insights from Brad Anderson, Corporate Vice President, Enterprise & Client Mobility
Earlier in this series, I looked at four common Hybrid Cloud scenarios, including disaster recovery. In this post, I’ll look at the technology behind Microsoft’s disaster recovery (DR) solutions (in particular Windows Azure Hyper-V Recovery Manager), as well as the different VMM topologies, infrastructure models, and the DR scenarios supported in HRM. I’ve even included some FAQ gathered from my last several customer meetings about our DR solution.
Reliable and cost effective DR is a priority item for both enterprises and service providers. Enterprises want to protect the mission critical LOB apps for their internal departments, and service providers want to protect tenant workloads running in their datacenters (either in a dedicated/shared fabric or a pool of resources).
When implementing a DR solution, businesses face four primary challenges:
Microsoft has worked very hard to address these challenges. The solution is Windows Azure Hyper-V Recovery Manager (HRM).
HRM is a Windows Azure service for managing cross-site protection and recovery of data centers and/or apps in conjunction with System Center Virtual Machine Manager (SCVMM).
HRM is a service that uses the Windows Azure public cloud to orchestrate and manage the replication of your primary data center to a secondary site. The hybrid service allows you to use off-premise automation (Windows Azure Management Portal) to perform DR operations on an on-prem private cloud managed through VMM in an enterprise or hosted cloud within a service provider environment. To perform the replication orchestrated by the service, VMM uses Hyper-V Replica, a replication mechanism built into Hyper-V in Windows Server 2012.
There are four primary SC VMM server topologies supported in HRM:
In this topology, a single VMM server is used to manage both the primary DR site and secondary DR site. The primary DR site hosts the protection cloud (VMM Cloud) and the secondary DR site hosts the replication cloud (VMM Cloud). This topology is commonly adopted when the primary and secondary sites are located in close proximity and the scale of the Protection and Replication cloud can be managed through a single VMM server.
The diagram below shows how a single VMM server manages both Primary DR and Secondary DR Sites. In this topology, the SC VMM server can be deployed either on the Primary DR site or Secondary DR Site. The SC VMM server is connected to the HRM service running on the Windows Azure and it is authenticated using a management certificate (.cer and .pfx files).
To get a detailed overview of Single VMM Setup, check out this post from the VMM team.
In this topology each DR site (Primary and Secondary) is managed using a dedicated VMM server. The primary DR site hosts the protection cloud (VMM Cloud) and the secondary DR site hosts the replication cloud (VMM Cloud). This topology is commonly adopted where the primary and secondary DR sites are far from each other and the scale (i.e. the number of Hyper-V hosts managed through VMM) cannot be managed through a single VMM server.
The diagram below shows how a dedicated VMM server is used to manage a primary DR site and secondary DR site separately. In this topology, the SC VMM server is deployed in each DR site which in turn is connected to the HRM service running on Windows Azure (which, as noted above, is also authenticated using a management certificate with .cer and .pfx files).
In both cases there is no Active Directory trust required between the primary and secondary DR sites, as HRM uses certificate issued by VMM server for replication using Hyper-V Replica.
In this model, the pool of fabric resources (Hyper-V host clusters under Host Groups in VMM) is dedicated for each tenant on the primary DR site to carve out one or more clouds (VMM Cloud) for protection purposes called Protection Clouds. Similarly on the secondary DR site, a pool of fabric resource is dedicated for each tenant to carve out one or more clouds (VMM Cloud) for replication purposes called Replica Clouds.
In this model, pool of fabric resource (Hyper-V host clusters under Host Groups in VMM) is shared among multiple tenants on the primary DR site to carve out one or more clouds (VMM Cloud) for protection purposes called Protection Clouds. Similarly on the secondary DR site, a pool of fabric resource is shared among multiple tenants to carve out one or more clouds (VMM Cloud) for replication purposes called Replica Clouds.
In both dedicated and shared cases, the mapping of Protection Cloud to Replication Cloud is one-one, so that each cloud on the primary DR site is mapped to a cloud on the secondary DR site. However, if a service provider wants to share the same set of fabric resources on the secondary DR site for replication purposes then he can carve out multiple clouds (i.e. Replication Clouds) on that same set of fabric resources used for mapping it to the primary DR site clouds.
There are two primary DR scenarios supported in HRM:
Managed DR ScenarioThe tenant workload (virtual machines) is managed by the Hosting Service Provider (HSP) to provide DR as a Service for those tenants that opt for the service. In this scenario, as the HSP manages the end-to-end DR scenario on behalf of the tenants, HSP subscribes to the HRM service on Windows Azure and accesses the Azure Management portal to perform the HRM operations that administer and manage the DR plans for its tenants. The tenants simply make the request to the HSP on their DR requirements (e.g. virtual machines, DR drills, or planned failover).
Self Service DR ScenarioThis is the scenario where tenants manage DR on their own, e.g. setting up a DR plan, performing DR drills and planned failovers, etc.
There are several key recovery actions supported in HRM:
Test FailoverThis is a test DR action where the VM is recovered on the secondary DR site without affecting the primary DR site workload. The VM on the secondary DR site is recovered in an isolated environment to make sure the failover operation is smooth.
Planned FailoverIn this DR action, the VM is recovered on the secondary DR site by safely turning off the virtual machine on the primary DR site after replicating the latest changes to the VM to ensure there is no data loss. The VM boots up on the secondary DR site with its active location changes to the secondary site VMM server.
Unplanned Failover Without primary site operationsIn this DR action, when the primary DR site is no longer reachable, the recovery plan is executed to recover the VM’s in the order on the secondary DR site. As the primary DR site is not reachable it is possible to see data loss. The virtual machine boots up on the secondary DR site with its active location changes to the secondary site VMM server.
Unplanned failover with primary site operationsHere the Planned and Unplanned failover operations are combined to result in a better RTO (Recovery Time Objective).
To find out more about HRM and learn about all of the capabilities it has, I recommend checking out these sites: