Welcome to TechNet Blogs Sign in | Join | Help
A Free Book on Microsoft Virtualization

Understanding Microsoft Virtualization Solutions from Microsoft Press is available as a FREE download.  

This 15MB E-Book gives an overview of all current Microsoft Virtualization technologies: Hyper-V, the Microsoft Enterprise Desktop Virtualization (MED-V), and VDI. It also describes which management solutions are available for them (e.g. System Center Virtual Machine Manager) and how they fit together. It is worth reading when planning the virtualization of your infrastructure.

You can find it here: http://csna01.libredigital.com/?urmvs17u33

Updated Infiniband on Server 2008 Paper

I have finally updated my notes on the installation of Infiniband on Windows Server 2008. They now cover the released version 2.0 of Mellanox WinOF stack. You can find the document in my skydrive public folder.

Let me know if you find it useful.

Powered by Qumana

Faking Networks

On a Windows HPC Server 2008 head node, that is...

1. No Infiniband on the head node

In many cases people want to save themselves some money by not installing an Infiniband adapter on the head node, thereby also sparing a port on that expensive infiniband switch. It makes a lot of sense, especially when you plan not to perform any calculations on such machine. So, how do we make the software believe it has an Infiniband adapter?

The HPC management tools do not care too much about the type of connection you have, as long as they can get an IP address to communicate with. So, you can install a "loopback adapter", give it a fixed IP address and pretend it is a real network card. Of course, you will not be able to use it to communicate with the compute nodes, but if all you want to carry on IB is MPI traffic amongst those, the trick will work.

The only caveat is that you lose the ability to use dhcp on the infiniband network, hence you will have to provide a mechanism to assign fixed IP addresses for IPoIB communication. Of course the subnet you use on the "fake" IB and the real one must be the same.

The easiest way is possibly to write a small script that uses the netsh command, then run it on all the compute nodes. You will need at least 1 private Ethernet network for management traffic across the cluster.

For instance, the command below will assign the ip address 192.168.3.100 and a 24-bit mask to the network connection called "Application"

netsh int ip set address "Application" static 192.168.3.100 255.255.255.0

2. No public ethernet

In several cases I found that the head node has only 1 ethernet card. Our HPC software out of the box prevents the use of Windows Deployment Services, DHCP unless you have at least 2 adapters, in order to avoid conflicts with existing deployment solutions. You may choose to install a fake "public" network on a loopback adapter and thus enable WDS on the real "private" network.

3. No private ethernet

Another interesting case is the one you get with many pre-built clusters, which provide 1 Ethernet and 1 Infiniband network in the box.

Note that when you install an Infiniband stack (e.g. WinOF 2.0), you typically get an IP-over-IB protocol provider. Thus, it is possible to use the infiniband network to route private cluster traffic, with the exception of deployment (no PXE-boot over IB). For "heavy" mpi applications, you will want to keep the two networks separate anyway.

Powered by Qumana

Live Migration in R2

I've got a lot of questions about Live Migration in 2008 R2. Rather than writing a long post on it, I thought I'd point you at some resources I found useful whilst setting up my test environment, so you can build one too:

 
I also recorded the steps to build such environment in a series of screencasts that will be appearing on http://edge.technet.com. The first one is already there, so check it out. They are short out of necessity, so it will take a couple of weeks for all of them to appear.
Turning hyper-v on and off

I use hyper-v on my laptop. When I know I don't need VMs for the day, I can squeeze a bit more performance out of the machine by turning hyper-v off with:

bcdedit /set hypervisorlaunchtype off

and a reboot. To turn it back on:

bcdedit /set hypervisorlaunchtype on (or auto start)

and reboot.

What is new in virtualization with Windows Server 2008 R2?

There are some quite interesting improvements in Windows Server 2008 R2 (what was wrong with W7 as a name?) that help us progress toward a dynamic infrastructure. Three of them are worthy of highlighting: live migration of virtual machines in hyper-v, cluster shared volumes and core parking.

1. Live Migration

Live migration refers to the ability of moving a running virtual machine from one host server to another without loss of service. For this to happen, we have to transfer the current virtual machine state and memory pages between machines and we have to warrant both servers the same level of access to the virtual machine files. The process can be summarized as follows:

  1. Create a virtual machine on the target server
  2. Copy the memory pages of the running virtual machine in question from the source to the target server via Ethernet. While we copy, those memory pages may change, so after an initial pass we have to go back and copy the changed set again, until a minimum threshold number of pages is reached. It is hard to fix the threshold: ideally, it will be the number of pages that can be copied within a TCP connection timeout, so the clients won’t notice.
  3. Pause the source machine; copy its state across.
  4. Resume the target machine, issue ARP command to update routing tables.

For (3) to happen quickly and transparently to the clients, the target server must have immediate access to the virtual machine files. It cannot wait for a disk volume to fail-over and possibly go through file system checks. That’s where cluster shared volumes come in.

2. Cluster Shared Volumes

Cluster Shared Volumes enable concurrent access to the same LUN by several nodes. Consequently, all the nodes see the same NTFS file-system and namespace. By the way, CSV is not a parallel or a cluster file system. It was designed with the live migration scenario in mind.

Since the host servers already mount the CSV, there is no need to arbitrate for disk access and fail over the volume hosting the virtual machine files. All you need to do is transfer ownership of those files and their locks to the target server.

CSVs are implemented via a filter driver mechanism, which is used to establish the access path to the underlying LUNs. This also enhances our fail-over ability, as file system requests will be redirected over the network to another server if a direct SAN access is no longer available.

3. Core idling or parking

Changes in Windows 7 power management allow for “density” scheduling, i.e. minimizing the number of processor cores on which work is done, hence maximizing their utilization. The idle cores can be put to sleep (low-power state Cx under the ACPI specifications), thus reducing power consumption. Hyper-V can take advantage of this feature and schedule its virtual machines accordingly. Power management policies can be controlled via WMI, policies and scripts.

If you combine “density” scheduling with the ability to move virtual machines among hosts, you achieve quite a scalable, efficient and dynamic solution to the distributed resource allocation problem. Now, all that remains to do is automate it. Stay tuned.

4. References

ACPI explanation on Wikipedia

WinHEC 2008 conference whitepapers

Engineering Windows 7 blog

The Windows blog

The Windows Server 2008 R2 Reviewers' Guide http://www.microsoft.com/windowsserver2008/en/us/r2.aspx 

Upgrading from an evaluation version

I have received a few questions about upgrades from the evaluation version that you can download from microsoft.com/hpc to a full version.

The good news is that the evaluation version is fully functional, so you won't need a complete re-installation. The only thing you need to do is obtain a full licence key, then:

- To upgrade the hpc pack tools you have to run “upgrade.exe” on the head node. The hpc pack CD contains the upgrade.exe file.

- To upgrade the o/s, you have to obtain a full licence key for all the nodes, then run slmgr.vbs –ipk <new licence> across the cluster. You can do that from the command line (clusrun /all) or via the GUI.

You can also use slmgr.vbs to extend the evaluation period by another 60 days. When you are approaching the end of the evaluation, simply run slmgr.vbs -rearm across the cluster. Note that the evaluation does not require activation, but a full licence does.

Please see http://support.microsoft.com/kb/948472 for more information.

Proxies and Compute Nodes

You’ve prepared your templates, configured your network, your firewalls and everything you could think of, yet your automated provisioning takes forever and eventually fails…

Well, check if you have a patching task in your node template. If you do, you’ll need a way to reach the Microsoft Update service and download any patches. You may need to set a proxy on the nodes for that. Alas, the GUI interface does not offer you an option to do that. Also, any proxy setting that you specify in Internet Options is effective just for the logged-in user. So, how can you set a proxy for windows update to use?

The Windows Update service uses the WinHTTP protocol. You can set a protocol-level proxy with:

netsh winhttp set proxy proxy-server=”http=<your proxy:port>” bypass-list=”<local>”

Where <local> is typed literally <local>. You could have that command line run before the patching task in the template.

Alternatively, you could deploy the nodes without the patching task, run that command across the cluster, then apply a template with a patching task.

Last but not least, you could set up a Windows Update Server on your corporate network and then use group policies to direct the update service on the nodes to that server.

Anyway, if your nodes go anywhere near the Internet, please keep them patched!

Upgrading to HPC Server 2008 RC1
Well, there is no upgrade path, so the quickest way is to re-image.
Download the RC build of the HPC software from connect.microsoft.com. If you have Infiniband cards, download the latest WinIB-ND drivers (1.4.0.2577) from http://www.mellanox.com. 
 
1. Re-image the head node and install the latest HPC pack.
2. Unzip the WinIB package on the head node, e.g. to c:\ib.
3. Open device manager and update the drivers for the Infiniband adapter, then the openib adapter. 4. Point the wizard to c:\ib\inf to select the appropriate IB driver. Add any other drivers as required.
5. Build a new o/s image (or re-use a previously built one) 
6. Click on "manage drivers" in the to-do list. Point the wizard to c:\ib\inf to add the drivers to the image
7. Create a new node template with the image you built
8. Reboot the compute nodes and wait for re-deployment to complete
9. Use clusrun to copy c:\ib from the head node to the compute nodes
10. Use clusrun to run c:\ib\inf\ndinstall -i on all nodes and thus install the new ND provider.
 
If you are using OpenSM, you'll find a new version of it in c:\ib\tools.
You may also have to re-boot the Infiniband switch after the driver update. I haven't figured out why yet, but IPoIB worked without problems, MPI over ND did not just after the driver update. Rebooting the switch seemed to fix this. 
 
 
Stop Climate Change?! – Part 2

I have been investigating some more in the area of Green IT, S+S. Some ideas and a lot of questions have come to mind. Please read on and let me know if they make any sense.

By the way, Part 1 is here :-)

1. How do you understand the status quo?

This may prove to be the most difficult part of the job. There aren’t many tools available. System Center Operations Manager, plus a few OEM management packs, are a starting point. Alas, you must build your own model to establish correlations between power utilization measured and applications over time. From those, you can derive a measure of efficiency. A few 3rd-party applications (e.g. Verdiem’s Surveyor, Avocent and APC InfrastruXure) do a better job of establishing the baseline, although again they do little for the correlation analysis.

IBM’s Active Energy Manager goes a step further (on IBM hardware) by allowing you study trends and to take action on specific energy-related conditions. Again, it is not a complete “IT intelligence” tool.

2. How do you design your infrastructure and applications to optimize consumption?

Once you understand what type of load consumes what power (no small feat):

1. Can you reduce the physical tiers of your architecture? For instance, if you have a memory-intensive application and a CPU-intensive one, you may want to co-host them, thus using all the available cores and saving a few machines’ worth of power. This will only work from a performance point of view if you manage resource allocation tightly to avoid contention. In our example, you would run a thread belonging to the memory intensive application on 1 core and a thread of the cpu-intensive one on the other core of the same CPU socket. Before embarking on such a consolidation exercise, you will want to estimate the costs and the savings, in terms of power and money. Also keep in mind that as a consequence of the changed workload, you may require different hardware (e.g. “whole machines” rather than just blades) to optimize your power consumption profile over time.

2. Can you reduce the logical tiers of your architecture? Here’s an example: your application may use Sharepoint as a front-end, windows workflow to manage business logic, SQL for data processing, all running on separate hardware. Sharepoint can host workflows. SQL handles workflows in Integration Services and it can host an in-process CLR. With some clever re-architecting of your application, you may be able to get rid of the middle tier by using some combination of the two workflow services. The whole area of “power-conscious” applications is yet to be explored. We’re investigating.

3. Can you offload a tier of your architecture? Here’s where Software + Services comes into play. For instance, you may consider using an on-line storage service (e.g. SQL Server Data Services, aka CloudDB or Sitka) instead of hosting your own SQL. If you have a compute-intensive application, you may want to farm it out to a HPC provider and pay by CPU cycles utilized (Microsoft will offer such a service, now in pilot stage with a few ISVs). If your provider is able to consolidate several users’ workloads on its servers and charge for capacity consumed, the overall carbon footprint may be reduced – along with your costs.

4. If you do offload a function, how do you measure its performance against SLAs? This is actually the most difficult point. Technology is available to do all of the above (although not necessarily on Windows). Capacity-on-demand, for instance, has been a feature of certain Mainframes and Unix systems for years. Hosted services offering are widely available. However, different security boundaries and political pressures make it difficult to build tools that monitor its application across companies – leave alone countries.

5. Can you offer or trade computing capacity? If you know how much you need, when and where, why not “sell” spare capacity? Again, S+S comes into play here. Grid computing is possibly the best example of implementation of a similar concept today.

3. What tools & techniques are available?

Hyper-V sounds like an obvious answer, but we are at risk of sounding like the proverbial person with just 1 hammer in the toolbox, to whom everything looks like a nail.

Virtualization is one powerful tool, but it must be used appropriately. One must carefully choose which workloads to virtualize, then which of those virtualized workloads can be combined on a single physical tier. Again, given a workload profile, that physical tier may look entirely different from your current one. Also, most often we speak only of host virtualization. For a complete solution, we must find the best combination of host, storage and network virtualization.

A caveat to keep in mind is that virtualization may be self-defeating without proper management practices. The ease of deploying virtual machines may lead operators to spawn far more than necessary. I have seen a few examples of this in large deployments.

Regulation (in the form of prescriptive guidance) may address some of the problem, but charging money is more effective. The idea of trading computing power may become useful in this scenario: imagine that you planned and budgeted for 200 VMs, but find out that you’re running just 150. You could sell the capacity for the remaining 50 to another part of your organization that requires it. They wouldn’t even need to buy or host servers. Who said that market economy principles cannot be applied to IT governance?

Co-hosting is another technique to optimize resource consumption, often neglected on Windows. If you can virtualize two workloads and run them together without significant impacts in performance, you may be able to gain even more by running them on the same o/s instance. The applications must of course be compatible (able to coexist). Thus, you eliminate the overhead of virtualization. Tools like WSRM allow you to change resource allocation dynamically, adapting to workload requirements. Unix and Mainframes have been doing this for decades, along with virtualization.

IIS6 and 7, for instance, are classical examples where co-hosting of several websites works very well. SQL2005 and 2008 are good examples too, where you can co-host several databases in one instance and several instances on one machine.

As for capacity optimization tools, I could not find a silver bullet. I mentioned a few so far; here’s a quick summary:

- System Center Virtual Machine Manager, with its workload analysis and placement functions, is instrumental in devising the best resource allocation.

- The Microsoft Assessment and Planning Toolkit is a useful, free instrument to plan for virtualization (amongst other things).

- System Center Capacity Planner is also very useful in designing the target architecture for certain workloads (Exchange, Sharepoint, Operations Manager).

- For a far more sophisticated (and expensive) capacity management and planning suite, you may want to look at tools like SAS.

- System Center Operations Manager, plus management packs provided by OEMs, is useful to obtain a baseline of resource utilization.

- IBM’s Active Energy Manager is a great example of what we can do with the data.

4. Further reading

Here are a few pointers that may help inform a discussion:

- Lewis Curtis’s blog: http://blogs.technet.com/lcurtis/

- Little Miss Enviro-Geek http://blogs.technet.com/lmeg/default.aspx

- The Green Datacenter Blog: http://www.greenm3.com/2008/07/new-coal-electr.html

- MAP Toolkit

- Microsoft’s Environment web page

- IBM Green Datacenter paper

- IBM Active Energy Manager

- Windows Server 2008 Power Savings

- Green Computing Paper

- The Green Grid

- Infrastructure Planning and Design

I'll be there in 2 microseconds!

Fantastic news! Mellanox has released the beta 2 version of their WinIB 1.4 stack, which works with HPC Server 2008 beta 2 and has Network Direct providers for their latest ConnectX cards. The results announced at ISC 08 are outstanding:

- 2 microseconds' latency

- 2 GB/s throughput

Another outstanding result for HPC Server 2008 is the Umea cluster, at n. 39 in the Top 500 list:

- 46.04 TFlops

- 85.6% efficiency

Hats off to the Network Direct team. Now Windows HPC Server 2008 plays with the big boys :-)

Powered by Qumana

Teched session on HPC in top 20!

Phil Pennington and I presented a session on cluster performance optimization at Teched 2008 in Orlando. It made it in the top 20 list by customer satisfaction!!!

To all those who were there and voted for us: Thank you!!!

To all those who were not there but would still like to know about it: leave a comment!

Powered by Qumana

How fast is this thing, really?!

Microsoft entered the HPC market a couple of years ago with a value proposition based on ease of integration, use and management. The comment we received most often sounded more or less like this: "This is all well and good, but how fast is this thing, really?". Well, here are a couple of impressive answers:

1. The NCSA cluster, running Windows HPC Server 2008 CTP on 1184 nodes (9472 cores), achieved 64.48 TFlops and 77.7% efficiency. This places it at n. 23 of the June 08 Top500 list.

2. The Aachen cluster, running the same build on 262 nodes, achieved 18.81 TFlops and 76.5% efficiency, which places it at n. 100 of that list.

Happy now? ;-) If you want to read more about the details, have a look at http://www.microsoft.com/hpc

Powered by Qumana

A Hybrid OS Cluster Solution

Thomas Varlet, of Microsoft France, and Dr. Patrice Calegari, of BULL SAS, have written an excellent paper on how to build hybrid clusters, i.e. clusters where 2 or more operating systems can be run at the same time. It is recommended reading, in my opinion, for those of us who use both Linux and Windows HPC solutions. You'll find the paper here.

UK HPC User Group Meeting

The UK HPC user group is meeting in London on June 26th, for what promises to be an interesting day at the Imperial War Museum.

This meeting is intended for customers, partners and developers to “meet and mingle”, compare notes and provide Microsoft with direct input into our product and offerings.  
On the day the attendees will hear:
-         The latest news on Windows HPC 2008
-         New software solutions in Finance, Engineering and Defence
-         The latest in MS enabling technology like Microsoft ESP & MS Robotics
-         The winners of the UK HPC student competition
-         Customer stories.
Attendees will also have the opportunity to tour the Imperial War Museum.

Please register here.  

The event is being organised by the UK Microsoft HPC User Group, chaired by Professor Simon Cox, School of Engineering Sciences of  the University of Southampton.

Powered by Qumana

More Posts Next page »
Page view tracker