Getting-Deep_thumb2

Earlier in this series, I looked at four common Hybrid Cloud scenarios, including disaster recovery. In this post, I’ll look at the technology behind Microsoft’s disaster recovery (DR) solutions (in particular Windows Azure Hyper-V Recovery Manager), as well as the different VMM topologies, infrastructure models, and the DR scenarios supported in HRM. I’ve even included some FAQ gathered from my last several customer meetings about our DR solution.

Reliable and cost effective DR is a priority item for both enterprises and service providers. Enterprises want to protect the mission critical LOB apps for their internal departments, and service providers want to protect tenant workloads running in their datacenters (either in a dedicated/shared fabric or a pool of resources).

When implementing a DR solution, businesses face four primary challenges:

  • Cost
    The need to reduce the cost of downtime has to be balanced with the cost of the disaster recovery solution. Some disaster recovery solutions with synchronous replication are expensive to implement and maintain.
  • Monitoring
    Continuously monitoring services can be challenging as it might involve more software, more costs, and more investments in training.
  • Automation
    Recovering a service or workload once it has failed can be extremely time-consuming and complex, and automating this process requires constant development and testing efforts to ensure procedures are up-to-date and practical.
  • Protection
    Some workloads that could benefit from protection might go unprotected due to the previous three factors.

Microsoft has worked very hard to address these challenges. The solution is Windows Azure Hyper-V Recovery Manager (HRM).

HRM is a Windows Azure service for managing cross-site protection and recovery of data centers and/or apps in conjunction with System Center Virtual Machine Manager (SCVMM).

Windows Azure Hyper-V Recovery Manager (HRM)

HRM is a service that uses the Windows Azure public cloud to orchestrate and manage the replication of your primary data center to a secondary site. The hybrid service allows you to use off-premise automation (Windows Azure Management Portal) to perform DR operations on an on-prem private cloud managed through VMM in an enterprise or hosted cloud within a service provider environment. To perform the replication orchestrated by the service, VMM uses Hyper-V Replica, a replication mechanism built into Hyper-V in Windows Server 2012.

SC VMM Server Topologies Supported in HRM

There are four primary SC VMM server topologies supported in HRM:

  • Single SC VMM server managing multiple sites
  • Multiple SC VMM servers, one for each DR site
  • Dedicated infrastructure for each tenant
  • Shared multi-tenant infrastructure across tenants

Single SC VMM server managing multiple sites

In this topology, a single VMM server is used to manage both the primary DR site and secondary DR site. The primary DR site hosts the protection cloud (VMM Cloud) and the secondary DR site hosts the replication cloud (VMM Cloud). This topology is commonly adopted when the primary and secondary sites are located in close proximity and the scale of the Protection and Replication cloud can be managed through a single VMM server.

The diagram below shows how a single VMM server manages both Primary DR and Secondary DR Sites. In this topology, the SC VMM server can be deployed either on the Primary DR site or Secondary DR Site. The SC VMM server is connected to the HRM service running on the Windows Azure and it is authenticated using a management certificate (.cer and .pfx files).

To get a detailed overview of Single VMM Setup, check out this post from the VMM team.

clip_image001

Multiple SC VMM servers, one for each DR site

In this topology each DR site (Primary and Secondary) is managed using a dedicated VMM server. The primary DR site hosts the protection cloud (VMM Cloud) and the secondary DR site hosts the replication cloud (VMM Cloud). This topology is commonly adopted where the primary and secondary DR sites are far from each other and the scale (i.e. the number of Hyper-V hosts managed through VMM) cannot be managed through a single VMM server.

The diagram below shows how a dedicated VMM server is used to manage a primary DR site and secondary DR site separately. In this topology, the SC VMM server is deployed in each DR site which in turn is connected to the HRM service running on Windows Azure (which, as noted above, is also authenticated using a management certificate with .cer and .pfx files).

clip_image002

In both cases there is no Active Directory trust required between the primary and secondary DR sites, as HRM uses certificate issued by VMM server for replication using Hyper-V Replica.

Dedicated infrastructure for each tenant:

In this model, the pool of fabric resources (Hyper-V host clusters under Host Groups in VMM) is dedicated for each tenant on the primary DR site to carve out one or more clouds (VMM Cloud) for protection purposes called Protection Clouds. Similarly on the secondary DR site, a pool of fabric resource is dedicated for each tenant to carve out one or more clouds (VMM Cloud) for replication purposes called Replica Clouds.

Shared multi-tenant infrastructure across tenants:

In this model, pool of fabric resource (Hyper-V host clusters under Host Groups in VMM) is shared among multiple tenants on the primary DR site to carve out one or more clouds (VMM Cloud) for protection purposes called Protection Clouds. Similarly on the secondary DR site, a pool of fabric resource is shared among multiple tenants to carve out one or more clouds (VMM Cloud) for replication purposes called Replica Clouds.

In both dedicated and shared cases, the mapping of Protection Cloud to Replication Cloud is one-one, so that each cloud on the primary DR site is mapped to a cloud on the secondary DR site. However, if a service provider wants to share the same set of fabric resources on the secondary DR site for replication purposes then he can carve out multiple clouds (i.e. Replication Clouds) on that same set of fabric resources used for mapping it to the primary DR site clouds.

Key DR Scenarios Supported in HRM

There are two primary DR scenarios supported in HRM:

Managed DR Scenario
The tenant workload (virtual machines) is managed by the Hosting Service Provider (HSP) to provide DR as a Service for those tenants that opt for the service. In this scenario, as the HSP manages the end-to-end DR scenario on behalf of the tenants, HSP subscribes to the HRM service on Windows Azure and accesses the Azure Management portal to perform the HRM operations that administer and manage the DR plans for its tenants. The tenants simply make the request to the HSP on their DR requirements (e.g. virtual machines, DR drills, or planned failover).

Self Service DR Scenario
This is the scenario where tenants manage DR on their own, e.g. setting up a DR plan, performing DR drills and planned failovers, etc.

Recovery Actions Supported in HRM

There are several key recovery actions supported in HRM:

  • Test Failover
  • Planned Failover
  • Unplanned Failover Without primary site operations
  • Unplanned failover with primary site operations

Test Failover
This is a test DR action where the VM is recovered on the secondary DR site without affecting the primary DR site workload. The VM on the secondary DR site is recovered in an isolated environment to make sure the failover operation is smooth.

Planned Failover
In this DR action, the VM is recovered on the secondary DR site by safely turning off the virtual machine on the primary DR site after replicating the latest changes to the VM to ensure there is no data loss. The VM boots up on the secondary DR site with its active location changes to the secondary site VMM server.

Unplanned Failover Without primary site operations
In this DR action, when the primary DR site is no longer reachable, the recovery plan is executed to recover the VM’s in the order on the secondary DR site. As the primary DR site is not reachable it is possible to see data loss. The virtual machine boots up on the secondary DR site with its active location changes to the secondary site VMM server.

Unplanned failover with primary site operations
Here the Planned and Unplanned failover operations are combined to result in a better RTO (Recovery Time Objective).

Frequently Asked Questions About Managed DR Scenarios

  • Q1: As a hosting service provider, do I have to share my tenants’ identity with Windows Azure.
    • A1: No, the tenants’ identity is not shared or uploaded to Windows Azure.
  • Q2: Does the app data of my tenants go to the public cloud?
    • A2: No, app data never goes to Windows Azure. The data transfer takes place between the primary and secondary sites in encrypted form.
  • Q3: My Hosts and VMs don't have Azure connectivity. Can I use Hyper-V Recovery Manager for DR?
    • A3: Yes. Hyper-V Recovery Manager does not require any Azure connectivity for Hosts or VM’s – they can be totally isolated in your corporate network. SC VMM Server is the only server which needs Azure connectivity. That connectivity is outbound and works through Proxy.
  • Q4: Do my tenants need to access the Windows Azure Portal for HRM service?
    • A4: For a managed DR scenario, the Hosting Service Provider (HSP) sets up the HRM service and performs the DR plan (protection and recovery) using the Azure Management Portal for HRM service on behalf of the tenants. Thus, the tenants are not required to subscribe to the HRM service to get access.
  • Q5: What exact tenant resource information is sent to Windows Azure as a part of the metadata?
    • A5: Tenant specific data such as VM name, ID, and virtual network name are sent. The full list of metadata sent (including the HSP’s fabric information is here). All metadata sent between the VMM servers in the primary and secondary to the Windows Azure is encrypted over HTTPS.
  • Q6: Do I need to install DRP (Disaster Recovery Provider) on each Hyper-V host/guest?
    • A6: No. DRP is needed only on SCVMM servers in the primary and secondary sites.
  • Q7: What will happen if a disaster impacts both my primary site and ISP providing the Internet connection?
    • A7: During failover, HRM has no dependency on any sort of network connectivity to primary DC. Thus, failover to secondary DC can be done even if the HRM service on Azure cannot connect to the primary DC.
  • Q8: A tenant’s’ n-tier app is using SQL AlwaysOn. Can I get single click app failover?
    • A8: Yes. HRM works with SQL AlwaysOn using simple scripts plugged into the recovery plan.
  • Q9: Do I need to have connectivity from primary and secondary site Hyper-V servers?
    • A9: Yes. Hyper-V servers on the primary site need to have connectivity to the secondary site. Hyper-V Replica needs to be enabled on both sides.
  • Q10: Do I need to have connectivity between primary and secondary site VMM servers?
    • A10: No. VMM-server-to-VMM-server connectivity is not required. Connectivity is not required between the primary site and secondary site, whereas both VMM servers needs to have Internet connectivity with Windows Azure (either directly or through proxy server) for transferring the VMM cloud metadata that needs protection and recovery.

Find out more

To find out more about HRM and learn about all of the capabilities it has, I recommend checking out these sites: