Cloud Insights from Brad Anderson, Corporate Vice President, Windows Server & System Center
Part 7 of a 9-part series. Today’s post is the 2nd of two sections; to read the first half, click here.
As an industry we have worked on disaster recovery (DR) and high availability (HA) solutions for – well, it seems like forever. There has been a lot of good work done in this space, but one thing that has always stood out to me has been the fact that, for any enterprise to truly establish a solid DR solution, the costs have been incredibly high. So high, in fact, that these costs have been totally prohibitive; you could argue that building a complete DR solution is a luxury reserved for only the largest organizations.
One thing that I admired about Microsoft before joining the company – and something that I have come to appreciate even more since working here – is the relentless effort we make to simplify challenges and problems, to make the solutions approachable for all, and to deliver the solutions at an economical price.
With Windows Server 2012 R2, with Hyper-V Replica, and with System Center 2012 R2 we have delivered a DR solution for the masses.
This DR solution is a perfect example of how the cloud changes everything.
Because Windows Azure offers a global, highly available cloud platform with an application architecture that takes full advantage of the HA capabilities – you can build an app on Azure that will be available anytime and anywhere. This kind of functionality is why we made the decision to build the control plane or administrative console for our DR solution on Azure. The control plane and all the meta-data required to perform a test, planned, or unplanned recovery will always be available. This means you don’t have to make the huge investments that have been required in the past to build a highly-available platform to host your DR solution – Azure automatically provides this.
(Let me make a plug here that you should be looking to Azure for all the new application you are going to build – and we’ll start covering this specific topic in next week’s R2 post.)
With this R2 wave of products, organizations of all sizes and maturity, anywhere in the world, can now benefit from a simple and cost-effective DR solution.
There’s also another other thing that I am really proud of here: Like most organizations, we regularly benchmark ourselves against our competition. We use a variety of metrics, like: ‘Are we easier to deploy and operate?’ and ‘Are we delivering more value and doing it a lower price?’ Measurements like these have provided a really clear answer: Our competitors are not even in the same ballpark when it comes to DR.
During the development of R2, I watched a side-by-side comparison of what was required to setup DR for 500 VMs with our solution compared to a competitive offering, and the contrast was staggering. The difference in simplicity and the total amount of time required to set everything up was dramatic. In a DR scenario, one interesting unit of measurement is total mouse clicks. It’s easy to get carried away with counting clicks (hey, we’re engineers after all!), but, in the side-by-side comparison, the difference was 10’s of mouse clicks compared to 100’s. It is literally a difference of minutes vs. days.
You can read some additional perspectives I’ve shared on DR here.
In yesterday’s post we looked at the new hybrid networking functionality in R2 (if you haven’t seen it yet, it is a must-read), and in this post Vijay Tewari (Principal Program Manager for Windows Server & System Center) goes deep into the architecture of this DR solution, as well this solution’s deployment and operating principles.
As always in this 2012 R2 series, check out the “Next Steps” at the bottom of this post for links to a variety of engineering content with hyper-technical overviews of the concepts examined in this post.
* * *
Disaster Recovery (DR) solutions have historically been expensive and complex to deploy. They require extensive configuration, and they do not work at cloud scale. Since current solutions fall short of addressing many DR needs due to their cost and complexity, workloads in need of protection are left vulnerable, and this exposes businesses to a range of compliance and audit violations. Moreover, because so many of these solutions are built using components from multiple vendors, any attempt to provide simplicity through a single pane of management is nearly impossible.
To address this need, Microsoft embarked on a journey to resolve these issues by delivering Hyper-V Replica (HVR) as a standard feature in Windows Server 2012. From the moment it entered the market, top reviewers and customers saw value in HVR and consistently ranked it amongst ‘the Top 10 features’ of the entire release.
In the R2 release we are following up on this success by providing a cloud-integrated solution for DR. This solution builds on some key enhancements in Windows Server 2012 R2 HVR around variable replication frequency, support for near-sync, and extended replication. From a management standpoint, we provide a DR solution via Windows Azure Hyper-V Recovery Manager (HRM), that is integrated with System Center Virtual Machine Manager (VMM). HRM is currently in limited preview and will be broadly available in the near future.
Our singular focus with HRM is democratizing Disaster Recovery by making it available for everybody, everywhere. HRM builds on the world-class assets of Windows Server, System Center, and Windows Azure and it is delivered via the Windows Azure Management Portal. Coupled with Windows Azure Backup (WAB), which offers data protection or backup, HRM completes the “Recovery Services” offering in Windows Azure. To know more about WAB, read this blog.
The key scenario that we want to address is to providing a simple, easily deployed and simple to operate DR solution that provides
There are five key tenets for the DR solution:
DR software is itself susceptible to the disasters that can hit the datacenters. We wanted to ensure that the core capability that provides DR is itself highly available, resilient and not subject to failure for the same reasons the customer’s workloads are failing. With these facts in mind, the control plane of our solution (HRM) is delivered as a cloud service we call DRaaS (Disaster Recovery as a Service). HRM has been built as a highly available service running on Windows Azure.
From an operations perspective, independent of which site you are administering, the recovery actions need to be taken only in a single place, i.e. the HRM portal. Since the metadata required for orchestration resides in the cloud, you are insured against losing the critical DR orchestration instructions even if your primary site is impacted, thereby addressing the common mistake wherein the monitoring system monitors itself.
By making the DR plans securely accessible everywhere, HRM drastically reduces the amount of lost time currently suffered by the business when initiating a disaster recovery. A customer DR plan may involve multiple sites. HRM manages multiple sites, as well as complex inter-site relationships, thereby enabling a customer to create a comprehensive DR plan. On the deployment side, HRM requires only one provider to be installed per VMM server (a single VMM server can manage 1,000 virtual hosts), thereby addressing the ever-present issue of complexity (the single most important blocker of DR deployments today).
Built on top of Virtual Machine Manager (VMM), HRM leverages existing investments that your fabric administrators have made in topology and management configurations. By leveraging your existing datacenter intelligence, HRM ensures there is no reason to worry about supporting/creating redundant configurations or managing multiple tools. Via deep VMM integration, HRM monitors environment changes in the datacenter and reacts appropriately to them.
HRM works at the logical abstractions of VMM, making it a truly cloud-scale solution. Built on cloud point design from a ground-up, the solution is elastic and can easily scale to deployments of large scale.
HRM is available 24x7 because, frankly, you don’t want to invest in a DR solution for your DR solution. The reality is that customers will have different clouds as targets – private, hosted or public clouds. The HRM solution is designed to deliver consistent experience across all clouds, while reducing the cost of re-training personnel for each new deployment/topology.
With its scripting support, HRM orchestrates the failover of applications even in scenarios where different tiers are protected by heterogeneous replication technologies.
For example, a typical app that you could protect with HRM can have a virtualized front-end paired with a backed in SQL protected via SQL AlwaysOn. Capabilities within HRM orchestration can seamlessly failover such applications with a single click.
The HRM solution is engineered to deliver an extensible architecture such that it can enable other scenarios in the future, as well as ensure that partners can enrich, extend, or augment those scenarios (e.g. newer storage types, additional recovery objectives, etc.). With increasing deployment complexity and with systemic disasters becoming increasingly commonplace, DR solutions must keep pace. The service approach ensures that we can get these innovations out to customers faster than ever before.
Given these benefits, you should be able to roll-out the solution and protect your first virtual machine in hours!
The following sections analyze how these tenets are delivered.
The HRM service is hosted in Windows Azure and orchestrates between private clouds built on the Microsoft cloud stack of Windows Server and System Center. HRM supports Windows Server 2012/Windows Server 2012 R2 and System Center 2012 SP1, System Center 2012 2012 R2 VMM.
The diagram below captures the high level architecture of the service. The service itself is in Windows Azure and the provider installed on the VMM servers sends the metadata of the private clouds to the service, which then uses it to orchestrate the protection and recovery of the assets in the private cloud. This architecture ensures the DR software is protected against disasters.
Since a key goal of this DR solution is extensibility, the replication channel that sends the data is highly flexible. Today, the channel supports Hyper-V Replica.
Within this architecture, it is important to note the considerable investments made towards security:
The goal of the HRM solution is to have your workloads protected in a couple of hours. The “How To” steps for doing this are here. Let’s examine how to plan, roll out and use HRM.
Before you begin setting up HRM, you should first plan your deployment from a capacity, topology and security perspective. In the early configure phase, you should map the resources of your primary and secondary sites such that during failover the secondary site provides the resources needed for business continuity. After that, you should protect Virtual Machines and then create orchestration units for failover. Finally you test and fail them over.
Before we look at these phases, here are a couple notes to consider
With these things in mind, here is how to plan your HRM deployment:
To deploy the solution, you need to download a single provider and run through a small wizard on the VMM Servers in NewYork and Chicago. The provider is downloaded from the Windows Azure portal from the Recovery Services tab.
The service can then orchestrate DR for workloads hosted by Hyper-V on the primary site. The provider, which runs in the context of VMM, communicates with HRM service. No additional agents, no software and no complex configurations are required.
To map the resources for compute and memory, you configure the “Protected Items”, which represent the logical clouds of VMMs. For example, to protect the Gold cloud in VMM-NewYork by the Gold-Recovery of VMM-Chicago, you choose values for simple configurations like replication frequency. You need to ensure that the capacity of the Gold-Recovery cloud will meet the DR requirements of virtual machines protected in the Gold cloud.
Once this is done, the system takes over and does the heavy lifting. It configures all the hosts in both the Gold and Gold-Recovery clouds with the required certificates and firewall rules - and it configures the Hyper-V Replica settings for both clusters and stand-alone hosts.
The diagram below shows the same process with hosts configured.
Note that the clouds, which are the resources for compute and memory, are shown in the tab “Protected Items” in the portal.
Once you have the cloud configured for protection, the next task is network mapping. As a part of that initial VMM deployment you have already created the networks on the primary and the corresponding networks on the recovery, now you map these networks on the service.
This mapping is used in multiple ways:
Network mapping works for the entire gamut of networks – VLANs or Hyper-V Network Virtualization. It even works for heterogeneous deployments, wherein the networks on primary and recovery sites are of different types.
The diagram below shows the tenant networks of the multi-tenanted Gold cloud as mapped to the tenant networks of the multi-tenanted Gold-Recovery cloud – the replica virtual machines are attached to the corresponding networks due to this mapping. For example, the replica virtual machine of Marketing is attached to Network Marketing Recovery since (a) the primary virtual machine is connected to Network Marketing and (b) Network Marketing in turn is mapped to Network Marketing Recovery.
For automated recovery, there are Recovery Plans, which is the orchestration construct that supports dependency groups and custom actions. Today, organizations create documents detailing their disaster recovery steps. These documents are cumbersome to maintain and even if someone made the effort to keep these documents up-to-date, they were prone to the risk of human errors by the staff hired to execute these plans.
RPs are the manifestation of our goal of ‘simplify orchestration’. With RPs, data is presented in the recovery plan view to help customer’s compliance/audit requirements. For example, in a quick glance customers can identify the last test failover of a plan or how long ago they did a planned failover of a recovery plan.
Recovery plans work uniformly for all the 3 types of failovers, namely Test Failover, Planned Failover, and Unplanned Failover. In a recovery plan, all virtual machines in a group fail over together thereby improving RTO. Across groups, the failover is in a sequence thereby preserving dependencies - Group1 followed by Group2, and so on.
In recovery, there is a continuum based on customer needs and expertise (as illustrated in the timeline below).
Organizations use DR drills, known as Test Failovers in the solution, for various purposes, such as compliance reasons, training staff around roles via simulated runs, verification of the system for patching, etc.
HRM leverages HVR and the networking support in VMM to deliver simplified DR drills. In most DR solutions, drills either impact production workloads or the protection of the workloads which makes it hard to carry out testing. HRM impacts neither, making regular testing a possibility.
To increase the quality of testing in an automated manner, and help customers focus on truly testing the app, a lot of background tasks are taken care of by the system. The DR Drill creates the required Test Virtual Machines in the right mode.
From the network perspective, the following options are available to run test failovers:
Once you signaled that DR drills are completed, the system cleans up whatever it has created.
In the following cases, you need planned failover. Some compliance requirements for organizations mandate the failover of workloads twice-a-year to the recovery site and then running it there for a week. A typical scenario is if the primary site requires some maintenance that prevents applications from being run on that site.
To help customers fulfill these scenarios, HRM has first-class support for planned failover through its orchestration tool, RP. As part of PFO, the Virtual Machines are shut-down, the last changes sent over to ensure zero data loss, and then virtual machines are brought up in order on the recovery site.
After a PFO you can re-protect the virtual machine via reverse replication, in which case the replication goes in the other direction – from VMM-Chicago to VMM-NewYork. Once the maintenance is completed or the compliance objectives are met, you do a PFO in a symmetric manner to get back to VMM-NewYork. Failback is single click gesture which executes planned failover in the reverse direction.
Much like insurance, we hope a customer never needs to use this! But, in eventualities such as natural disasters, this ensures that designated applications can continue to function. In the event of unplanned failovers, HRM attempts to shut down the primary machines in case some of the virtual machines are still running when the disaster strikes. It then automates their recovery on the secondary site as per the RP.
Despite all preparations, during a disaster things can go wrong. Therefore, an unplanned failover might not succeed in the first attempt and thus require more iteration. In the solution, you can re-run the same job and it will pick up from the last task that completed successfully.
Like in PFO, failback after unplanned failover is a single click gesture to execute planed failover in the reverse direction. This brings about a symmetric behavior in these operations.
Topologies evolve in an organization with growth, deployment complexities, changes in environments, etc. Unlike other solutions, HRM provides the ability to administer various different topologies in a uniform manner. Examples of this include an active-active wherein the NewYork site provides protection for Chicago and vice-versa, many sites being protected by one, complex relationships, or multiple branch offices going to a single head office. This enables capacity from secondary sites to be utilized and not reserved while not running the workloads from the primary site.
With DR solutions comes a strong need for a simple and scalable monitoring. HRM delivers on this by providing the right view for the right jobs. For example, when a user takes an action on the HRM portal to setup infrastructure, HRM recognizes that he would be monitoring the portal to ensure the action succeeded. Therefore a rich view is present on the portal in the “Jobs” tab to help this user monitor in-progress jobs and take action on those that are waiting for action. On the other hand, when the user is not proactively monitoring the portal and the system needs to draw attention, this is provided through integration in the Operations Manager (SCOM).
The table below captures this continuum and shows how all the required monitoring needs of the users are met.
Since all DR tasks are long-standing, workflows are created by the system. We have built a rich jobs framework that helps you monitor the jobs, query for previous jobs, and export the jobs. You can export the jobs to keep an audit book of the same, as well as track the RTO of the recovery plans to help, the jobs report task-level time granularity.
First, a highly available, or HA VMM, is a recommended deployment for VMM to protect against downtime due to maintenance/patching of VMM both on the primary and secondary site. HRM works seamlessly in this case: You can failover the VMM service from one VMM server to another and the management and orchestration will fail over seamlessly.
Second, there are scenarios wherein users use one VMM to manage both the sites – primary and secondary. An example of this is when one admin is managing both the sites (and these sites happen to be close by to each other), therefore wanting to see the virtual machines of both sites in one console. HRM supports a single VMM scenario and enables pairing of clouds administered by the same VMM Server.
Note that in the event of a complete failure of the primary site, a customer has to recover the VMM server itself on the secondary site before proceeding. Once VMM is up on the secondary site, the rest of the workloads can be failed over.
Microsoft’s DR solution addresses a key gap in the disaster recovery market around simplification and scale. Delivered as a service in Azure, HRM is designed to enable protection for many workloads that are currently lacking protection, thereby improving business continuity for organizations.
A solution like this is a game-changer for organizations of any size, and it is an incredibly exciting thing for the Windows Server, System Center, and Windows Azure teams to deliver.
To learn more about the topics covered in this post, check out the following articles:
It's, high level valuable information for the versed in the computer engineering ,it's too much for a beginner like me to really understand full;y to benefit myself.
I have to take issue with your statement about it being to high of a cost for enterprises to implement. I've done half a dozen with a cheap solution and all are very happy. The cost was extremely low and it was so easy.
The one weakness in this new model is the Enterprise SLA - if you look at the penalties for not hitting the availability targets, the penalty is 10%-25% discount credits on VMs for that month, pretty measly penalty for an "Enterprise" SLA.
John, what cheaper solution are you using?