By-Scenario_thumb2_thumb_thumb

Every business leader has found themselves, at one time or another, dealing with something unforeseeable or unexpected, and, in the IT world, workload spikes can sometimes come out of nowhere and wreak havoc on an organization’s infrastructure.

Dealing proactively with these spikes – and even developing an infrastructure capable of handling them – is surprisingly simple with a Hybrid environment. Rather than incurring capex for additional server capacity to handle infrequent workload peaks, a Hybrid environment leverages its local IT infrastructure for the many business requirements and then supplements its operation with public (or service provider) cloud resources whenever the local resources are stressed.

This Hybrid technique is called Cloud Bursting and it allows the enterprise to expand its capacity as needed while making efficient use of its existing resources.

Cloud bursting is a technically advanced topic that is rapidly evolving, and in this blog post I’ll introduce some design guidelines and technical insights we’ve developed to help address use cases related to building an auto-scaling infrastructure. And don’t worry, I’ll link away to our engineering blogs for the deeply technical content rather than get into the weeds – I promise.

Challenges & Advantages

Cloud Bursting is a critical way to meet user experience requirements and SLAs without needing to own massive on-prem resources, but this does not mean this process is without challenges all its own.

Although cloud bursting has the potential to save money and provide a consistent level of performance, doing this in a non-Hybrid environment presents some real challenges. In particular, app components running on a burst public cloud needs access to secure data and these apps must be designed from the outset to scale or they cannot take advantage of cloud bursting – trying to retrofit apps is time-consuming and expensive.

Additionally, depending on where you burst to, the other platform may not be compatible with the platform on which you did your dev/test – this can undo a lot of progress made with the app and with your plans to use. Other challenges include how to deal with differing APIs/policies/UI/tools, troubleshooting load balancing to other VMs, and maintaining control over external cloud resources during and after bursting. Compounding all of this is the fact that if these bursting actions are not well automated, the act of doing it manually is difficult, time-consuming, and prone to error – all of which are fatal flaws when your infrastructure’s capacity is maxed out.

This is where a Hybrid Cloud can make a real impact – in fact, cloud bursting itself is building into Hybrid Cloud architecture. A Hybrid Cloud is, after all, a deployment model in which two or more separate clouds are bound together. This built-in functionality enables some major advantages.

A really quick look at these advantages includes on-demand scale (only when you need it, delivered fast), self-service functionality (you control how you want things to provision), flexibility (the bursting happens where, when, and how you want), fault-tolerance (multiple locations can be used for redundancy), limiting human error through automation, and all of the above lead to big cost savings.

Cloud Bursting Scenarios

To see how cloud bursting can work in your organization, here are a couple scenarios that show how you can do it efficiently with System Center 2012 R2 and Windows Server 2012 R2.

First, consider the needs of the fictitious Woodgrove Bank. Woodgrove’s payroll app is two-tiered (web front-end and SQL backend) and it deploys five front-end servers to handle peak loads throughout the month – but, aside from the first and last week of the month, those five servers are very under-utilized. The IT team at Woodgrove wants to scale down to a single front-end server for day-to-day operations (and to house the customer data side of the app), and get the scaled capacity they need during short-term peaks from a service provider.

To make this happen, Woodgrove’s IT team will need to address via 5 simple things:

  • Monitoring the service state
  • Service scale Orchestration
  • Service scaling Management
  • Network connectivity between the private and public site
  • Data replication

Each of these 5 things can be implemented using various technologies, and the approach on each of these implementations determines how the bursting process is triggered and carried out. The goal is to allow the app or service to autonomously remain within compliance of SLA’s, and, if necessary, add additional resources to remain in compliance and meet demand.

Monitoring the Service State with System Center 2012 R2 – Operations Manager

The observable service state is critical in implementing an effective, dynamic IaaS architecture. If you don’t have a way of understanding what the delivered service is providing to the end users, you can’t react dynamically to that state.

To make the best decisions regarding how to adapt the system to provide the desired service level, you can use the end user experience measure as a trigger for the scaling up/down of our dynamic infrastructure. For example, whenever the end-user experience falls below a threshold, capacity is added, and, as the measure increases above our threshold, we decrease capacity.

The drawback of using only an end-user monitoring approach is that it only tells you that you have a problem with the performance for your end users. What it doesn’t do is give our system any information as to where the problem is or what is causing the problem. To overcome these shortcomings, there is a big need to create a truly intelligent dynamic IaaS service that considers the end-user experience as its key performance indicator (KPI). Any IaaS will still need infrastructure and application metrics provide the causal analysis data.

One way to have the necessarily visibility is with System Center Operations Manager 2012 R2 (SCOM) – a tool built to provide service-level end-to-end monitoring, as well as insight into the health and performance of line-of-business apps.

Service Scale Orchestration with System Center 2012 R2 – Orchestrator and SMA

Once your KPIs have been determined and the proper thresholds are established, the next step is to ensure you have the necessary automation to burst your private cloud service into an on-demand public cloud.  This public cloud should deliver the exact set of services and application setups that you have on premises, and push them to the cloud.

The 2 key scenarios for automation are:

  • Scaling out to a public cloud to support load increases.
  • Scaling Back (i.e. Tearing Down) those resources in the public cloud when they are no longer needed.

Scaling Out to Your Public Cloud:

In the graphic below, an alert has been triggered according to a defined threshold that requires you to execute automation for bursting to your public cloud.  In that graphic, three key things are happening:

  1. SCOM identifies a threshold has been met.
  2. Orchestrator responds to this configured threshold breach by executing an SMA Runbook in the off-premise hosted facility.
  3. SMA Runbooks execute the automation to deploy a pre-configured Gallery Resource that extends your web front end service to this offsite location – thus supporting your on demand scale needs.

clip_image001

Note: A very similar automated process can be leveraged on a schedule to support bursting into the public cloud for predetermined speak times (and then scale back when demand is predicted to decrease).

Scaling Back or “Tearing Down” Public Cloud Resources:

In order to ensure that CAPEX is at a minimum, rebalancing of compute resources back to your on-prem data center is necessary.  Whether this is executed as part of an on demand or scheduled scenario, automation can be leveraged to streamline the scale-back process that has three primary steps:

  1. SCOM identifies a threshold has been met that allows for scale back of bursted resources.
  2. Orchestrator responds to this configured threshold stabilization by executing an SMA Runbook in the off-premise hosted facility.
  3. An SMA Runbook executes the automation to tear down the temporary resources in your off premise cloud.

clip_image002

For more information on how Orchestrator and SCOM can be leveraged together when responding to these events please see: Operations Manager 2012 Integration Pack Overview for Orchestrator 2012.

For information on automating the scaling process of Gallery Resources within Windows Azure Pack, please check out this blog series about the mechanics of provisioning VM Gallery Resources through PowerShell automation: Automation: The New World of Tenant Provisioning with Windows Azure Pack (Part 1).  Also, pay close attention to one of my favorite blogs Building Clouds and the Automation Track for future posts on how to automate your hybrid cloud.

Service Scaling Management

The scaling management process is driven by directives contained within the Virtual Machine role (VMRole) Resource Definition (RESDEF). To get deeper into how to define those directives, click here.

Network connectivity between the private and public site

To use the newly added resources, the networking components first need to be configured. When auto scaling a service, the consumption of the new resources should be transparent to the user, and, for this reason, it’s necessary for access routes to be automatically updated with the appropriate configuration changes so that resources can be consumed.

To learn more about our hybrid connectivity solution, check out this analysis of Software-defined Networking.

Data bursting

The bursting process is one thing, but bursting data is a completely different concept.

Bursting data is one of the hardest challenges with cloud bursting today, and there are basically two options when dealing with data in a cloud bursting scenario:

  • Round trip to the primary site, which uses the private cloud as the primary data site, and then point all the burst activity to that site.

Or,

  • Moving data closer to the scaled compute instances.

The first approach incurs heavy penalties in terms of latency since each computation needs to make the round trip to the primary site in order to get the data for the computation. The second approach replicates the data between the sites in an ongoing streaming way, and this makes the data available on the cloud as soon as the peak occurs. In this scenarios, you can then spin up compute instances and immediately start to redirect load.

To learn more about SQL Server replication types: SQL Server Replication.