Exploring VMM Live Migration as part of a patching workflow

This post goes under the heading "Sometimes you have to learn the hard way"… or maybe "you don't know what you don't know". As I was building the workflows for patching VMM Clusters using Opalis and the Configuration Manager Integration Pack, I wanted to make absolutely sure that what I was building was really going to work. I didn't want to just pull a few scripts off the web and use those in some half-baked demo environment and "hope" they worked. I wanted to make sure, so I went about creating a clustered VMM environment in my lab. Now we have a demo environment built on Hyper-V virtual machines that includes all the System Center products, including a VMM server. So I thought I'd just create a *virtual* VMM cluster in that environment and save myself some hardware setup effort. I spent a couple of days getting the VMs installed and figuring out how to configure the cluster, finding an iSCSI target to install on another machine for the shared storage, and after I finally got VMM running in a virtual cluster configuration, I went to turn on the VM that was on the virtual server…. and of course it failed.

The reason why it failed was because Hyper-V needs to talk to the hardware, and it can't within a VM, so while you can get Hyper-V activated and VMM running on top of a virtual machine, you can't actually start a VM running in a VM. Ouch. So I went back to using physical hardware – I have three laptops stacked on my desk running my VMM cluster now. Lesson learned Smile


I didn't call this post "part 4" of the series because this is not really talking about the workflows specifically. What I'm talking about does affect how you build your workflows because the process of patching a cluster is even more challenging than just doing a live VM migration and putting the cluster node into maintenance mode, patching, and then enabling it again. As I ran through a bunch of scenarios, I found out some interesting things I'd like to relay to you so you can account for them in your workflows.

Rule 1: Your VMs must be "highly-available" to be "live migrated" to another cluster node

When you have a clustered VMM server, your VMs can be placed on there in a number of different ways (via Hyper-V, VMM, or PowerShell) but generally they will exist as "highly available" (HA) or "not highly available". In order to be HA, the VM has to be running on a cluster shared storage disk. If you import / create a VM and store it on one of the cluster node's local disks (one not shared by other cluster nodes), then your VM will not be HA. Therefore, it cannot be moved to another cluster node using live migration.

If you perform a "Disable-VMHost" command or put a host into maintenance mode, any non-HA VM will automatically be put into Saved mode. This will cause anyone connected to that VM to be instantly cut off until the VM comes back online. Given that the patch download and install process can take anywhere from a few minutes to an hour (depending on the number/sizes of the patches), this could be a bad thing for your users. You will need to account for these VMs before running your Disable-VMHost command and manually move them off to other hosts if you want to minimize the downtime.

Here's a quick PowerShell script that can help you determine if there are non-HA VMs running on the cluster (any node of the cluster):

 $vms = get-vmhostcluster -name "clustername" -vmmserver "vmmserver" | Get-VMHost | Get-VM | Where-Object {($_.IsHighlyAvailable –eq "False") -and ($_.Status -eq "Running")}

This single line of code will return a VM object (or objects, or nothing) representing all of the VMs on the cluster that are not HA and are also running. By running this prior to putting the cluster node into maintenance mode, you can make sure you are not accidentally cutting off access to VMs.

Rule 2: Your VMs must be running to be "live migrated" to another cluster node

Even if your VM is highly-available, if it's powered off or in a saved state, VMM is not going to make the effort to move it to another cluster node. Normally this is not a problem, but I mention it because there are those instances where you want to use the same workflow to drain a VMM cluster node prior to doing something drastic like decommissioning a server or replacing hardware, or even taking it out of service for an extended period to service it. If you have VMs that are on this server but turned off, they're already inaccessible to anyone who wants to use them (until you turn them on). But if you take the cluster node down before moving them somewhere else, they'll really be inaccessible. So if you want to make sure that the VMs are available should someone want to turn them on, you will want to move them to another cluster node.

Here's another one-liner to determine if there are any HA VMs that are not powered on (only a slight difference from the code above)

 $vms = get-vmhostcluster -name "clustername" -vmmserver "vmmserver" | Get-VMHost | Get-VM | Where-Object {($_.IsHighlyAvailable –eq "True") -and ($_.Status -ne "Running")}

Rule 3: VMs that are automatically migrated to another node are not automatically migrated back

If you had only HA VMs and all of them were running when you started maintenance mode, they all got automatically migrated off to another cluster. Wonderful. You do your patching and bring the cluster node back online and enable VMM placement again. What happens? Nothing. All of the VMs stay on the cluster nodes they were pushed to during live migration. They are not automatically migrated back. Now this may not be a problem for you – perhaps you just want them to go where they may. But I find that a lot of people feel they need to put things back the way they were. After all, those VMs were running on that cluster node for a reason, and leaving the VMs where they landed may leave the cluster in an unbalanced situation.

In order to put all of the right VMs back on the original cluster node, you have to do a two-step process. First, when you begin maintenance mode, you need to get a list of all the VMs that were on the cluster node prior to moving them. Then, after maintenance mode is done, you need to compare the list of VMs to see where they are now, and move them back to the original server. Luckily, this is easily accomplished using PowerShell. I could have used a more compact script and taken out a bunch of the blank space, but I opened it up to be more readable.

 $cluster = get-vmhostcluster -name "clustername" -vmmserver "vmmserver"
$vmhost1 = get-vmhost -computername "hostname" -vmhostcluster $cluster
$vms = get-vmhostcluster -name "clustername" -vmmserver "vmmserver" | Get-VMHost | Get-VM | Where-Object {($_.IsHighlyAvailable –eq "True") -and ($_.Status -eq "Running")}
if ($vms) 
{
    [array] $vmNames
    foreach ($vm in $vms)
    {
        $vmNames += $vm.Name
    }
}
Disable-VMHost $VMHost1 -JobVariable "disablejob" -MoveWithinCluster -RunAsynchronously
Do 
{ 
    Start-Sleep -milliseconds 100 
} 
until (($disablejob.IsCompleted -eq "True") -or ($disablejob.ErrorInfo.DisplayableErrorCode -ne 0))

#
# Host drained - do patching here
# 

Enable-VMHost $VMHost1 -JobVariable "enablejob" -RunAsynchronously
Do 
{ 
    Start-Sleep -milliseconds 100 
} 
until (($enablejob.IsCompleted -eq "True") -or ($enablejob.ErrorInfo.DisplayableErrorCode -ne 0))

foreach ($vmname in $vmnames)
{
    $vm = get-vmhostcluster -name "clustername" -vmmserver "vmmserver" | Get-VMHost | Get-VM | Where-Object {($_.Name –eq $vmname)}
    if ($vm) 
    {
        if ($vm.HostName -ne $vmhost1.Name) 
        { 
            move-vm -vm $vm -vmhost $vmhost1 -usecluster -runasynchronously
        }
    }
    
}

So you can see that in the middle of this script is when we do the patching. In other words, this script would actually be broken into two parts, with one half being done before the patching, and one half being done after patching. We would simply save the $VMnames variable into the published data, and we'd subscribe to that data later in the workflow for the second part of the script.

Hopefully this will help you put some of the workflow concepts into perspective when I wrap up the 4-part series and add all of the error handling and other stuff in there to make it a fully-functional workflow.

Cheers!