Welcome to TechNet Blogs Sign in | Join | Help
Live Migration, Cluster Shared Volumes & Networks

The recommendation for people setting up live migration clusters is to isolate different kinds of traffic on their own networks:

  1. Public network to access the cluster and the virtual machines running on it
  2. “Private” cluster heartbeat network
  3. “live migration” network
  4. iSCSI network, if required to access shared storage

How do we determine what traffic goes where?

For public and private, the failover cluster manager tool is self-explanatory:

image

We select the appropriate cluster network properties. If we want to limit such network to private traffic, we do not allow clients to connect through it.

If we don’t want the cluster to use such network at all, e.g. because it is dedicated to iSCSI, we select the “Do not allow…” button.

How about the live migration traffic, though? It can be quite heavy, as we are copying memory pages from one host to another. We can select in which order to use cluster networks for such traffic through the failover cluster manager

image

The property requires some digging: expand “services and applications”, select the virtual machine in question, then in the main panel right-click on “virtual machine <name>” and you’ll see tab called “network for live migration”. You can then select and sort in order of priority the networks that you want to use. By default, live migration will select a network that is NOT used for CSV traffic. Note that you may have networks in this panel that were not selected for cluster use before. If you use iSCSI, de-select the relevant entry to make sure that the live migration traffic does not go through that network.

This brings me to cluster shared volumes. One of the great features of CSVs is that if the storage link (iSCSI, fibre) becomes unavailable for any reason on a node, storage traffic can be redirected over the cluster network to another node and hence to the storage device. But which cluster network?

Inter-node communications and CSV traffic will use the available network authorized for cluster use that has the lowest metric value. We can see the metrics with old cluster.exe

C:\Windows\system32>cluster net /prop
Listing properties for all networks:

T  Network              Name                     Value
-- -------------------- ------------------------ -----------------------
SR Cluster Network 1    Name                     Cluster Network 1
MR Cluster Network 1    IPv6Addresses            <cut on purpose> 
MR Cluster Network 1    IPv6PrefixLengths        <..>
MR Cluster Network 1    IPv4Addresses            <cut on purpose>
MR Cluster Network 1    IPv4PrefixLengths        <..>
SR Cluster Network 1    Address                  <..>
SR Cluster Network 1    AddressMask              <..>
S  Cluster Network 1    Description
D  Cluster Network 1    Role                     3 (0x3)
D  Cluster Network 1    Metric                   10001 (0x2711)
D  Cluster Network 1    AutoMetric               0 (0x0)
SR Cluster Network 2    Name                     Cluster Network 2
MR Cluster Network 2    IPv6Addresses
MR Cluster Network 2    IPv6PrefixLengths
MR Cluster Network 2    IPv4Addresses            <..>
MR Cluster Network 2    IPv4PrefixLengths        <..>
SR Cluster Network 2    Address                  <..>
SR Cluster Network 2    AddressMask              <..>
S  Cluster Network 2    Description
D  Cluster Network 2    Role                     1 (0x1)
D  Cluster Network 2    Metric                   1000 (0x3e8)
D  Cluster Network 2    AutoMetric               1 (0x1)

Note the 3 values:

  • Role: 1 for a private network, 0 for ignored by cluster, 3 for mixed traffic
  • Metric: the “weight” of the connection, generally in the 10,000 range for public networks, 1,000 for private ones. If a network has a default gateway, it is considered public; if not, private. Should there be more than one private or public network, the metric is incremented by 100 in order of enumeration (e.g. private network 2 will have a default metric of 1,100)
  • Autometric: 1 if the metric is set automatically by the cluster, 0 if you have set it manually.

So in my simple case the heartbeat network will also be used for CSV traffic. If you have more than 1 private network and you want to prioritize them, you can set the metric with cluster.exe, e.g.

C:\Windows\system32>cluster net "Cluster Network 2" /prop metric=1001

C:\Windows\system32>cluster net "Cluster Network 2" /prop

Listing properties for 'Cluster Network 2':

T  Network              Name                     Value
-- -------------------- ------------------------ -----------------
SR Cluster Network 2    Name                     Cluster Network 2
MR Cluster Network 2    IPv6Addresses
MR Cluster Network 2    IPv6PrefixLengths
MR Cluster Network 2    IPv4Addresses                 <..>
MR Cluster Network 2    IPv4PrefixLengths             <..>
SR Cluster Network 2    Address                       <..>
SR Cluster Network 2    AddressMask                   <..>
S  Cluster Network 2    Description
D  Cluster Network 2    Role                     1 (0x1)
D  Cluster Network 2    Metric                   1001 (0x3e9)
D  Cluster Network 2    AutoMetric               0 (0x0)

Redirection of the traffic is automatic: if a network becomes unavailable, the next-lowest-metric one will be used. If another network with a lower metric becomes available, it will be used from that point onwards.

In Summary

By default, live migration traffic will be put on the network with the second-lowest metric. CSV traffic will be put on the the network with the lowest metric. In this simple example, I just have a public and private network, so the public one is used for live migration and the private one for csv and cluster traffic.

P2V with SCVMM – a quick tip

System Center Virtual Machine Manager (SCVMM) has been offering a relatively simple way of doing physical-to-virtual migrations (P2V) for a while. You just click on the “Convert Physical Server” icon and off you go. Despite the name, it also works with client target machines. It’s simple, if you do some preparation work before.

image

In fact, VMM will ask you for name or ip address of the machine in question and for administrator credentials on it. Those will be used to reach the machine and install a p2v agent on it. For the process to work correctly, you must let through the firewall of the target machine:

  • WMI traffic
  • http
  • file and print
  • remote management

Also, make sure that the ADMIN$ share exists and start the Windows remote management service on the target machine.

By default, most of these ports and services are closed.

VHD Boot

With Windows 7 and Server 2008 R2 you get the opportunity to boot directly from a vhd file. The operating system in the vhd file will have direct access to the machine hardware. It will not run as a virtual machine with synthetic or emulated adapters, but as a “real machine”. VHD happens to be the format that is used to represent a disk to the o/s. The physical disk will contain a set of vhd files and still be visible as a disk to the o/s you boot. Thus, you won’t require a partition per o/s.

Assuming that you are running Windows 7 or 2008 R2, here’s how you can set it up:

1. Create a vhd file to contain your o/s.

I found that 15 GB are enough for Server 2008 R2 + Hyper-V role (you can enable any role on the o/s in that VHD). You can use diskpart from the command line or the disk management tool. Make sure to select a fixed disk size.

image

Mount that vhd file, e.g. to drive letter W:

2. Apply a WIM image to the VHD file you just created.

You can generate a WIM image of a pre-installed “golden” machine with the imagex tool, part of the Windows Automated Installation Kit (WAIK)

You can use the WIM image provided with the Windows installation media in sources\install.wim

If all you have is an iso file (e.g. a 2008 R2 evaluation version you’ve just downloaded), there are utilities like MagicISO which will let you mount it as a disk, so you can use the install.wim within the file.

The simplest way to apply the wim image is to use the Install-Windowsimage powershell script, which you’ll find on MSDN. The installation media contains several versions of Windows, so make sure you select the one you are licensed for.

In powershell, type:

.\Install-WindowsImage.ps1 -WIM D:\Sources\Install.wim

to obtain a list of the available images on your installation DVD (D:\). Note the index number of the image you are interested in.

Type

.\Install-WindowsImage.ps1 -WIM D:\Sources\Install.wim -Apply -Index 3 -Destination W:\

to apply the 3rd image on the VHD drive you mounted previously (W:\)

3. Make the VHD bootable

Open a command prompt and type:

W:\windows\system32\bcdboot w:\windows

Bcdboot creates the boot control data (bcd) block to boot Windows from the vhd file and sets it as default option.

4. Check the bcd entry and set your preferred default

At an administrator’s command prompt, type:

bcdedit /v

image

Note the identifier of the entry that you want to use as default boot option.

Type:

bcdedit /default {identifier}

to set the default.

That’s it – you can now boot from your vhd file.

You need not stop here, however: there is no need to start from a regular o/s installation: you can configure vhd boot from the installation media and have all your Windows o/s boot from vhd files. Keith Combs explains how on his blog.

A Free Book on Microsoft Virtualization

Understanding Microsoft Virtualization Solutions from Microsoft Press is available as a FREE download.  

This 15MB E-Book gives an overview of all current Microsoft Virtualization technologies: Hyper-V, the Microsoft Enterprise Desktop Virtualization (MED-V), and VDI. It also describes which management solutions are available for them (e.g. System Center Virtual Machine Manager) and how they fit together. It is worth reading when planning the virtualization of your infrastructure.

You can find it here: http://csna01.libredigital.com/?urmvs17u33

Updated Infiniband on Server 2008 Paper

I have finally updated my notes on the installation of Infiniband on Windows Server 2008. They now cover the released version 2.0 of Mellanox WinOF stack. You can find the document in my skydrive public folder.

Let me know if you find it useful.

Powered by Qumana

Faking Networks

On a Windows HPC Server 2008 head node, that is...

1. No Infiniband on the head node

In many cases people want to save themselves some money by not installing an Infiniband adapter on the head node, thereby also sparing a port on that expensive infiniband switch. It makes a lot of sense, especially when you plan not to perform any calculations on such machine. So, how do we make the software believe it has an Infiniband adapter?

The HPC management tools do not care too much about the type of connection you have, as long as they can get an IP address to communicate with. So, you can install a "loopback adapter", give it a fixed IP address and pretend it is a real network card. Of course, you will not be able to use it to communicate with the compute nodes, but if all you want to carry on IB is MPI traffic amongst those, the trick will work.

The only caveat is that you lose the ability to use dhcp on the infiniband network, hence you will have to provide a mechanism to assign fixed IP addresses for IPoIB communication. Of course the subnet you use on the "fake" IB and the real one must be the same.

The easiest way is possibly to write a small script that uses the netsh command, then run it on all the compute nodes. You will need at least 1 private Ethernet network for management traffic across the cluster.

For instance, the command below will assign the ip address 192.168.3.100 and a 24-bit mask to the network connection called "Application"

netsh int ip set address "Application" static 192.168.3.100 255.255.255.0

2. No public ethernet

In several cases I found that the head node has only 1 ethernet card. Our HPC software out of the box prevents the use of Windows Deployment Services, DHCP unless you have at least 2 adapters, in order to avoid conflicts with existing deployment solutions. You may choose to install a fake "public" network on a loopback adapter and thus enable WDS on the real "private" network.

3. No private ethernet

Another interesting case is the one you get with many pre-built clusters, which provide 1 Ethernet and 1 Infiniband network in the box.

Note that when you install an Infiniband stack (e.g. WinOF 2.0), you typically get an IP-over-IB protocol provider. Thus, it is possible to use the infiniband network to route private cluster traffic, with the exception of deployment (no PXE-boot over IB). For "heavy" mpi applications, you will want to keep the two networks separate anyway.

Powered by Qumana

Live Migration in R2

I've got a lot of questions about Live Migration in 2008 R2. Rather than writing a long post on it, I thought I'd point you at some resources I found useful whilst setting up my test environment, so you can build one too:

 
I also recorded the steps to build such environment in a series of screencasts that will be appearing on http://edge.technet.com. The first one is already there, so check it out. They are short out of necessity, so it will take a couple of weeks for all of them to appear.
Turning hyper-v on and off

I use hyper-v on my laptop. When I know I don't need VMs for the day, I can squeeze a bit more performance out of the machine by turning hyper-v off with:

bcdedit /set hypervisorlaunchtype off

and a reboot. To turn it back on:

bcdedit /set hypervisorlaunchtype on (or auto start)

and reboot.

What is new in virtualization with Windows Server 2008 R2?

There are some quite interesting improvements in Windows Server 2008 R2 (what was wrong with W7 as a name?) that help us progress toward a dynamic infrastructure. Three of them are worthy of highlighting: live migration of virtual machines in hyper-v, cluster shared volumes and core parking.

1. Live Migration

Live migration refers to the ability of moving a running virtual machine from one host server to another without loss of service. For this to happen, we have to transfer the current virtual machine state and memory pages between machines and we have to warrant both servers the same level of access to the virtual machine files. The process can be summarized as follows:

  1. Create a virtual machine on the target server
  2. Copy the memory pages of the running virtual machine in question from the source to the target server via Ethernet. While we copy, those memory pages may change, so after an initial pass we have to go back and copy the changed set again, until a minimum threshold number of pages is reached. It is hard to fix the threshold: ideally, it will be the number of pages that can be copied within a TCP connection timeout, so the clients won’t notice.
  3. Pause the source machine; copy its state across.
  4. Resume the target machine, issue ARP command to update routing tables.

For (3) to happen quickly and transparently to the clients, the target server must have immediate access to the virtual machine files. It cannot wait for a disk volume to fail-over and possibly go through file system checks. That’s where cluster shared volumes come in.

2. Cluster Shared Volumes

Cluster Shared Volumes enable concurrent access to the same LUN by several nodes. Consequently, all the nodes see the same NTFS file-system and namespace. By the way, CSV is not a parallel or a cluster file system. It was designed with the live migration scenario in mind.

Since the host servers already mount the CSV, there is no need to arbitrate for disk access and fail over the volume hosting the virtual machine files. All you need to do is transfer ownership of those files and their locks to the target server.

CSVs are implemented via a filter driver mechanism, which is used to establish the access path to the underlying LUNs. This also enhances our fail-over ability, as file system requests will be redirected over the network to another server if a direct SAN access is no longer available.

3. Core idling or parking

Changes in Windows 7 power management allow for “density” scheduling, i.e. minimizing the number of processor cores on which work is done, hence maximizing their utilization. The idle cores can be put to sleep (low-power state Cx under the ACPI specifications), thus reducing power consumption. Hyper-V can take advantage of this feature and schedule its virtual machines accordingly. Power management policies can be controlled via WMI, policies and scripts.

If you combine “density” scheduling with the ability to move virtual machines among hosts, you achieve quite a scalable, efficient and dynamic solution to the distributed resource allocation problem. Now, all that remains to do is automate it. Stay tuned.

4. References

ACPI explanation on Wikipedia

WinHEC 2008 conference whitepapers

Engineering Windows 7 blog

The Windows blog

The Windows Server 2008 R2 Reviewers' Guide http://www.microsoft.com/windowsserver2008/en/us/r2.aspx 

Upgrading from an evaluation version

I have received a few questions about upgrades from the evaluation version that you can download from microsoft.com/hpc to a full version.

The good news is that the evaluation version is fully functional, so you won't need a complete re-installation. The only thing you need to do is obtain a full licence key, then:

- To upgrade the hpc pack tools you have to run “upgrade.exe” on the head node. The hpc pack CD contains the upgrade.exe file.

- To upgrade the o/s, you have to obtain a full licence key for all the nodes, then run slmgr.vbs –ipk <new licence> across the cluster. You can do that from the command line (clusrun /all) or via the GUI.

You can also use slmgr.vbs to extend the evaluation period by another 60 days. When you are approaching the end of the evaluation, simply run slmgr.vbs -rearm across the cluster. Note that the evaluation does not require activation, but a full licence does.

Please see http://support.microsoft.com/kb/948472 for more information.

Proxies and Compute Nodes

You’ve prepared your templates, configured your network, your firewalls and everything you could think of, yet your automated provisioning takes forever and eventually fails…

Well, check if you have a patching task in your node template. If you do, you’ll need a way to reach the Microsoft Update service and download any patches. You may need to set a proxy on the nodes for that. Alas, the GUI interface does not offer you an option to do that. Also, any proxy setting that you specify in Internet Options is effective just for the logged-in user. So, how can you set a proxy for windows update to use?

The Windows Update service uses the WinHTTP protocol. You can set a protocol-level proxy with:

netsh winhttp set proxy proxy-server=”http=<your proxy:port>” bypass-list=”<local>”

Where <local> is typed literally <local>. You could have that command line run before the patching task in the template.

Alternatively, you could deploy the nodes without the patching task, run that command across the cluster, then apply a template with a patching task.

Last but not least, you could set up a Windows Update Server on your corporate network and then use group policies to direct the update service on the nodes to that server.

Anyway, if your nodes go anywhere near the Internet, please keep them patched!

Upgrading to HPC Server 2008 RC1
Well, there is no upgrade path, so the quickest way is to re-image.
Download the RC build of the HPC software from connect.microsoft.com. If you have Infiniband cards, download the latest WinIB-ND drivers (1.4.0.2577) from http://www.mellanox.com. 
 
1. Re-image the head node and install the latest HPC pack.
2. Unzip the WinIB package on the head node, e.g. to c:\ib.
3. Open device manager and update the drivers for the Infiniband adapter, then the openib adapter. 4. Point the wizard to c:\ib\inf to select the appropriate IB driver. Add any other drivers as required.
5. Build a new o/s image (or re-use a previously built one) 
6. Click on "manage drivers" in the to-do list. Point the wizard to c:\ib\inf to add the drivers to the image
7. Create a new node template with the image you built
8. Reboot the compute nodes and wait for re-deployment to complete
9. Use clusrun to copy c:\ib from the head node to the compute nodes
10. Use clusrun to run c:\ib\inf\ndinstall -i on all nodes and thus install the new ND provider.
 
If you are using OpenSM, you'll find a new version of it in c:\ib\tools.
You may also have to re-boot the Infiniband switch after the driver update. I haven't figured out why yet, but IPoIB worked without problems, MPI over ND did not just after the driver update. Rebooting the switch seemed to fix this. 
 
 
Stop Climate Change?! – Part 2

I have been investigating some more in the area of Green IT, S+S. Some ideas and a lot of questions have come to mind. Please read on and let me know if they make any sense.

By the way, Part 1 is here :-)

1. How do you understand the status quo?

This may prove to be the most difficult part of the job. There aren’t many tools available. System Center Operations Manager, plus a few OEM management packs, are a starting point. Alas, you must build your own model to establish correlations between power utilization measured and applications over time. From those, you can derive a measure of efficiency. A few 3rd-party applications (e.g. Verdiem’s Surveyor, Avocent and APC InfrastruXure) do a better job of establishing the baseline, although again they do little for the correlation analysis.

IBM’s Active Energy Manager goes a step further (on IBM hardware) by allowing you study trends and to take action on specific energy-related conditions. Again, it is not a complete “IT intelligence” tool.

2. How do you design your infrastructure and applications to optimize consumption?

Once you understand what type of load consumes what power (no small feat):

1. Can you reduce the physical tiers of your architecture? For instance, if you have a memory-intensive application and a CPU-intensive one, you may want to co-host them, thus using all the available cores and saving a few machines’ worth of power. This will only work from a performance point of view if you manage resource allocation tightly to avoid contention. In our example, you would run a thread belonging to the memory intensive application on 1 core and a thread of the cpu-intensive one on the other core of the same CPU socket. Before embarking on such a consolidation exercise, you will want to estimate the costs and the savings, in terms of power and money. Also keep in mind that as a consequence of the changed workload, you may require different hardware (e.g. “whole machines” rather than just blades) to optimize your power consumption profile over time.

2. Can you reduce the logical tiers of your architecture? Here’s an example: your application may use Sharepoint as a front-end, windows workflow to manage business logic, SQL for data processing, all running on separate hardware. Sharepoint can host workflows. SQL handles workflows in Integration Services and it can host an in-process CLR. With some clever re-architecting of your application, you may be able to get rid of the middle tier by using some combination of the two workflow services. The whole area of “power-conscious” applications is yet to be explored. We’re investigating.

3. Can you offload a tier of your architecture? Here’s where Software + Services comes into play. For instance, you may consider using an on-line storage service (e.g. SQL Server Data Services, aka CloudDB or Sitka) instead of hosting your own SQL. If you have a compute-intensive application, you may want to farm it out to a HPC provider and pay by CPU cycles utilized (Microsoft will offer such a service, now in pilot stage with a few ISVs). If your provider is able to consolidate several users’ workloads on its servers and charge for capacity consumed, the overall carbon footprint may be reduced – along with your costs.

4. If you do offload a function, how do you measure its performance against SLAs? This is actually the most difficult point. Technology is available to do all of the above (although not necessarily on Windows). Capacity-on-demand, for instance, has been a feature of certain Mainframes and Unix systems for years. Hosted services offering are widely available. However, different security boundaries and political pressures make it difficult to build tools that monitor its application across companies – leave alone countries.

5. Can you offer or trade computing capacity? If you know how much you need, when and where, why not “sell” spare capacity? Again, S+S comes into play here. Grid computing is possibly the best example of implementation of a similar concept today.

3. What tools & techniques are available?

Hyper-V sounds like an obvious answer, but we are at risk of sounding like the proverbial person with just 1 hammer in the toolbox, to whom everything looks like a nail.

Virtualization is one powerful tool, but it must be used appropriately. One must carefully choose which workloads to virtualize, then which of those virtualized workloads can be combined on a single physical tier. Again, given a workload profile, that physical tier may look entirely different from your current one. Also, most often we speak only of host virtualization. For a complete solution, we must find the best combination of host, storage and network virtualization.

A caveat to keep in mind is that virtualization may be self-defeating without proper management practices. The ease of deploying virtual machines may lead operators to spawn far more than necessary. I have seen a few examples of this in large deployments.

Regulation (in the form of prescriptive guidance) may address some of the problem, but charging money is more effective. The idea of trading computing power may become useful in this scenario: imagine that you planned and budgeted for 200 VMs, but find out that you’re running just 150. You could sell the capacity for the remaining 50 to another part of your organization that requires it. They wouldn’t even need to buy or host servers. Who said that market economy principles cannot be applied to IT governance?

Co-hosting is another technique to optimize resource consumption, often neglected on Windows. If you can virtualize two workloads and run them together without significant impacts in performance, you may be able to gain even more by running them on the same o/s instance. The applications must of course be compatible (able to coexist). Thus, you eliminate the overhead of virtualization. Tools like WSRM allow you to change resource allocation dynamically, adapting to workload requirements. Unix and Mainframes have been doing this for decades, along with virtualization.

IIS6 and 7, for instance, are classical examples where co-hosting of several websites works very well. SQL2005 and 2008 are good examples too, where you can co-host several databases in one instance and several instances on one machine.

As for capacity optimization tools, I could not find a silver bullet. I mentioned a few so far; here’s a quick summary:

- System Center Virtual Machine Manager, with its workload analysis and placement functions, is instrumental in devising the best resource allocation.

- The Microsoft Assessment and Planning Toolkit is a useful, free instrument to plan for virtualization (amongst other things).

- System Center Capacity Planner is also very useful in designing the target architecture for certain workloads (Exchange, Sharepoint, Operations Manager).

- For a far more sophisticated (and expensive) capacity management and planning suite, you may want to look at tools like SAS.

- System Center Operations Manager, plus management packs provided by OEMs, is useful to obtain a baseline of resource utilization.

- IBM’s Active Energy Manager is a great example of what we can do with the data.

4. Further reading

Here are a few pointers that may help inform a discussion:

- Lewis Curtis’s blog: http://blogs.technet.com/lcurtis/

- Little Miss Enviro-Geek http://blogs.technet.com/lmeg/default.aspx

- The Green Datacenter Blog: http://www.greenm3.com/2008/07/new-coal-electr.html

- MAP Toolkit

- Microsoft’s Environment web page

- IBM Green Datacenter paper

- IBM Active Energy Manager

- Windows Server 2008 Power Savings

- Green Computing Paper

- The Green Grid

- Infrastructure Planning and Design

I'll be there in 2 microseconds!

Fantastic news! Mellanox has released the beta 2 version of their WinIB 1.4 stack, which works with HPC Server 2008 beta 2 and has Network Direct providers for their latest ConnectX cards. The results announced at ISC 08 are outstanding:

- 2 microseconds' latency

- 2 GB/s throughput

Another outstanding result for HPC Server 2008 is the Umea cluster, at n. 39 in the Top 500 list:

- 46.04 TFlops

- 85.6% efficiency

Hats off to the Network Direct team. Now Windows HPC Server 2008 plays with the big boys :-)

Powered by Qumana

Teched session on HPC in top 20!

Phil Pennington and I presented a session on cluster performance optimization at Teched 2008 in Orlando. It made it in the top 20 list by customer satisfaction!!!

To all those who were there and voted for us: Thank you!!!

To all those who were not there but would still like to know about it: leave a comment!

Powered by Qumana

More Posts Next page »
Page view tracker