Well, there is no upgrade path, so the quickest way is to re-image.
Download the RC build of the HPC software from connect.microsoft.com. If you have Infiniband cards, download the latest WinIB-ND drivers (1.4.0.2577) from http://www.mellanox.com.
1. Re-image the head node and install the latest HPC pack.
2. Unzip the WinIB package on the head node, e.g. to c:\ib.
3. Open device manager and update the drivers for the Infiniband adapter, then the openib adapter. 4. Point the wizard to c:\ib\inf to select the appropriate IB driver. Add any other drivers as required.
5. Build a new o/s image (or re-use a previously built one)
6. Click on "manage drivers" in the to-do list. Point the wizard to c:\ib\inf to add the drivers to the image
7. Create a new node template with the image you built
8. Reboot the compute nodes and wait for re-deployment to complete
9. Use clusrun to copy c:\ib from the head node to the compute nodes
10. Use clusrun to run c:\ib\inf\ndinstall -i on all nodes and thus install the new ND provider.
If you are using OpenSM, you'll find a new version of it in c:\ib\tools.
You may also have to re-boot the Infiniband switch after the driver update. I haven't figured out why yet, but IPoIB worked without problems, MPI over ND did not just after the driver update. Rebooting the switch seemed to fix this.
I have been investigating some more in the area of Green IT, S+S. Some ideas and a lot of questions have come to mind. Please read on and let me know if they make any sense.
By the way, Part 1 is here :-)
1. How do you understand the status quo?
This may prove to be the most difficult part of the job. There aren’t many tools available. System Center Operations Manager, plus a few OEM management packs, are a starting point. Alas, you must build your own model to establish correlations between power utilization measured and applications over time. From those, you can derive a measure of efficiency. A few 3rd-party applications (e.g. Verdiem’s Surveyor, Avocent and APC InfrastruXure) do a better job of establishing the baseline, although again they do little for the correlation analysis.
IBM’s Active Energy Manager goes a step further (on IBM hardware) by allowing you study trends and to take action on specific energy-related conditions. Again, it is not a complete “IT intelligence” tool.
2. How do you design your infrastructure and applications to optimize consumption?
Once you understand what type of load consumes what power (no small feat):
1. Can you reduce the physical tiers of your architecture? For instance, if you have a memory-intensive application and a CPU-intensive one, you may want to co-host them, thus using all the available cores and saving a few machines’ worth of power. This will only work from a performance point of view if you manage resource allocation tightly to avoid contention. In our example, you would run a thread belonging to the memory intensive application on 1 core and a thread of the cpu-intensive one on the other core of the same CPU socket. Before embarking on such a consolidation exercise, you will want to estimate the costs and the savings, in terms of power and money. Also keep in mind that as a consequence of the changed workload, you may require different hardware (e.g. “whole machines” rather than just blades) to optimize your power consumption profile over time.
2. Can you reduce the logical tiers of your architecture? Here’s an example: your application may use Sharepoint as a front-end, windows workflow to manage business logic, SQL for data processing, all running on separate hardware. Sharepoint can host workflows. SQL handles workflows in Integration Services and it can host an in-process CLR. With some clever re-architecting of your application, you may be able to get rid of the middle tier by using some combination of the two workflow services. The whole area of “power-conscious” applications is yet to be explored. We’re investigating.
3. Can you offload a tier of your architecture? Here’s where Software + Services comes into play. For instance, you may consider using an on-line storage service (e.g. SQL Server Data Services, aka CloudDB or Sitka) instead of hosting your own SQL. If you have a compute-intensive application, you may want to farm it out to a HPC provider and pay by CPU cycles utilized (Microsoft will offer such a service, now in pilot stage with a few ISVs). If your provider is able to consolidate several users’ workloads on its servers and charge for capacity consumed, the overall carbon footprint may be reduced – along with your costs.
4. If you do offload a function, how do you measure its performance against SLAs? This is actually the most difficult point. Technology is available to do all of the above (although not necessarily on Windows). Capacity-on-demand, for instance, has been a feature of certain Mainframes and Unix systems for years. Hosted services offering are widely available. However, different security boundaries and political pressures make it difficult to build tools that monitor its application across companies – leave alone countries.
5. Can you offer or trade computing capacity? If you know how much you need, when and where, why not “sell” spare capacity? Again, S+S comes into play here. Grid computing is possibly the best example of implementation of a similar concept today.
3. What tools & techniques are available?
Hyper-V sounds like an obvious answer, but we are at risk of sounding like the proverbial person with just 1 hammer in the toolbox, to whom everything looks like a nail.
Virtualization is one powerful tool, but it must be used appropriately. One must carefully choose which workloads to virtualize, then which of those virtualized workloads can be combined on a single physical tier. Again, given a workload profile, that physical tier may look entirely different from your current one. Also, most often we speak only of host virtualization. For a complete solution, we must find the best combination of host, storage and network virtualization.
A caveat to keep in mind is that virtualization may be self-defeating without proper management practices. The ease of deploying virtual machines may lead operators to spawn far more than necessary. I have seen a few examples of this in large deployments.
Regulation (in the form of prescriptive guidance) may address some of the problem, but charging money is more effective. The idea of trading computing power may become useful in this scenario: imagine that you planned and budgeted for 200 VMs, but find out that you’re running just 150. You could sell the capacity for the remaining 50 to another part of your organization that requires it. They wouldn’t even need to buy or host servers. Who said that market economy principles cannot be applied to IT governance?
Co-hosting is another technique to optimize resource consumption, often neglected on Windows. If you can virtualize two workloads and run them together without significant impacts in performance, you may be able to gain even more by running them on the same o/s instance. The applications must of course be compatible (able to coexist). Thus, you eliminate the overhead of virtualization. Tools like WSRM allow you to change resource allocation dynamically, adapting to workload requirements. Unix and Mainframes have been doing this for decades, along with virtualization.
IIS6 and 7, for instance, are classical examples where co-hosting of several websites works very well. SQL2005 and 2008 are good examples too, where you can co-host several databases in one instance and several instances on one machine.
As for capacity optimization tools, I could not find a silver bullet. I mentioned a few so far; here’s a quick summary:
- System Center Virtual Machine Manager, with its workload analysis and placement functions, is instrumental in devising the best resource allocation.
- The Microsoft Assessment and Planning Toolkit is a useful, free instrument to plan for virtualization (amongst other things).
- System Center Capacity Planner is also very useful in designing the target architecture for certain workloads (Exchange, Sharepoint, Operations Manager).
- For a far more sophisticated (and expensive) capacity management and planning suite, you may want to look at tools like SAS.
- System Center Operations Manager, plus management packs provided by OEMs, is useful to obtain a baseline of resource utilization.
- IBM’s Active Energy Manager is a great example of what we can do with the data.
4. Further reading
Here are a few pointers that may help inform a discussion:
- Lewis Curtis’s blog: http://blogs.technet.com/lcurtis/
- Little Miss Enviro-Geek http://blogs.technet.com/lmeg/default.aspx
- The Green Datacenter Blog: http://www.greenm3.com/2008/07/new-coal-electr.html
- MAP Toolkit
- Microsoft’s Environment web page
- IBM Green Datacenter paper
- IBM Active Energy Manager
- Windows Server 2008 Power Savings
- Green Computing Paper
- The Green Grid
- Infrastructure Planning and Design
Fantastic news! Mellanox has released the beta 2 version of their WinIB 1.4 stack, which works with HPC Server 2008 beta 2 and has Network Direct providers for their latest ConnectX cards. The results announced at ISC 08 are outstanding:
- 2 microseconds' latency
- 2 GB/s throughput
Another outstanding result for HPC Server 2008 is the Umea cluster, at n. 39 in the Top 500 list:
- 46.04 TFlops
- 85.6% efficiency
Hats off to the Network Direct team. Now Windows HPC Server 2008 plays with the big boys :-)
Powered by Qumana
Phil Pennington and I presented a session on cluster performance optimization at Teched 2008 in Orlando. It made it in the top 20 list by customer satisfaction!!!
To all those who were there and voted for us: Thank you!!!
To all those who were not there but would still like to know about it: leave a comment!
Powered by Qumana
Microsoft entered the HPC market a couple of years ago with a value proposition based on ease of integration, use and management. The comment we received most often sounded more or less like this: "This is all well and good, but how fast is this thing, really?". Well, here are a couple of impressive answers:
1. The NCSA cluster, running Windows HPC Server 2008 CTP on 1184 nodes (9472 cores), achieved 64.48 TFlops and 77.7% efficiency. This places it at n. 23 of the June 08 Top500 list.
2. The Aachen cluster, running the same build on 262 nodes, achieved 18.81 TFlops and 76.5% efficiency, which places it at n. 100 of that list.
Happy now? ;-) If you want to read more about the details, have a look at http://www.microsoft.com/hpc
Powered by Qumana
Thomas Varlet, of Microsoft France, and Dr. Patrice Calegari, of BULL SAS, have written an excellent paper on how to build hybrid clusters, i.e. clusters where 2 or more operating systems can be run at the same time. It is recommended reading, in my opinion, for those of us who use both Linux and Windows HPC solutions. You'll find the paper here.
The UK HPC user group is meeting in London on June 26th, for what promises to be an interesting day at the Imperial War Museum.
This meeting is intended for customers, partners and developers to “meet and mingle”, compare notes and provide Microsoft with direct input into our product and offerings.
On the day the attendees will hear:
- The latest news on Windows HPC 2008
- New software solutions in Finance, Engineering and Defence
- The latest in MS enabling technology like Microsoft ESP & MS Robotics
- The winners of the UK HPC student competition
- Customer stories.
Attendees will also have the opportunity to tour the Imperial War Museum.
Please register here.
The event is being organised by the UK Microsoft HPC User Group, chaired by Professor Simon Cox, School of Engineering Sciences of the University of Southampton.
Powered by Qumana
Hey all,
I'm doing another series of webcasts. 2 of them have already been aired, two more will happen shortly. Here are the topics:
- deployment and management
- high availability
- new scheduler features
- hpc server 2008 and linux
You'll find a link to register and summaries on:
http://www.microsoft.com/hpc/events.aspx
Did you notice that the latest CTP has introduced a new option for mpiexec? Using mpiexec -affinity you can affinitize the mpi rank to the core where it is started, thus avoiding context switches. Your application will determine whether you actually benefit from affinitization or not. Some of them show a good performance improvement, some do not. In particular, if you have an MPI application that is also multi-threaded, the affinity option may backfire, because the affinity mask that you set for the process is inherited by default by all its threads. Thus, its threads may be stuck on 1 core. Windows offers other API calls to set thread affinity.
"Traditional", non multi-threaded MPI applications may be more straightforward. One important factor to take into account when deciding when to affinitize the process is the compute node architecture: is it NUMA or not? If it is, have you got enough RAM in the memory bank local to the core where the process will run? If not, you may incur frequent (and lengthy) remote memory accesses on the same hardware. In this case, it may be best to rely on the o/s scheduler to determine the ideal NUMA node for the thread.
Powered by Qumana
I've recently been involved in a simple benchmarking exercise. Here are a few quick "rules of thumb" that have helped me:
- 4x4
A PCIe 4x slot is supposed to have 4 lanes capable of 250 MB/s each, for a total of 1 GB/s. An Infiniband SDR 4x card has 4 channels clocked at 2.5Gb/s, so a simple rule of thumb is: put an Infiniband card in the PCIe slot with the same number of channels. This is not a coincidence: Intel was part of the original Infiniband group.
BUT
Be aware that not all motherboards are equal, although in theory most of them use the same chipsets. In our case, we found out that the motherboard was not able to sustain more than about 600 MB/s on the PCIe 4x slot. We had to move the Infiniband cards to the 8x slots, where we could reach the expected 900 MB/s transfer rate of the card. The 8x slot on those motherboards is probably not capable of reaching its top speed either, but it is sufficient for the SDR 4x card.
- Snoop Filters
A snoop filter is a mechanism to reduce traffic between different memory bus segments. It is particularly useful in multi-cpu, multi-core machines. Applications generally benefit from it, but there are some cases where latency-bound applications are adversely affected. If you see erratic behaviours in your latency tests (e.g. "random" high latencies in an otherwise consistent benchmark) and you have quad-core machines (especially early Clovertowns), try and disable the snoop filter in the bios. It may (or may not) help. Again, motherboards affect the results, as different components (with or without snoop filter) were used by different manufacturers.
New quad-core machines (Harpertown) have a snoop filter, but do not seem to show the symptoms mentioned above (at least those I've seen).
- Dynamic Power Management
It is generally NOT a good idea when you're trying to squeeze the last FLOP out of the CPUs. Disable it in the BIOS.
- MPI traffic
You may want to make absolutely sure that your MPI applications are using Infiniband; or you may want to run them once on Ethernet and another time on Infiniband, then compare the results. In any case, you can specify the network where MPI traffic will go at run time:
mpiexec -env MPICH_NETMASK <address>/<mask> <other parameters> <exe>
You may also want to make absolutely sure that your MPI traffic uses Network Direct, not winsock. You can:
- remove the Winsock provider. Coarse, but effective:
clusrun /<nodes> installsp -r
- run your application with
mpiexec -env MPICH_DISABLE_SOCK 1 <other parameters> <exe>
Incidentally, you can install the Network direct provider with
clusrun /<nodes> ndinstall -i
In my last post I investigated how HPC can be used to build UFOs. This time, I've learned to my surprise that HPC can be used to make movies!
Digital media production follows a complex workflow, from initial sketch to wireframe model, to rendered 3D images, to movies. HPC is typically used in rendering, encoding or transcoding. I've done some research on the matter and posted the results here.
Let me know what you think of it.
To many people HPC is like UFOs: We there's somebody somewhere, but we don't really know what they're doing and where they fit in the grand scheme of things.
Here's my attempt at explaining them (UFOs AND HPC). Happy reading.
If you think I've smoked one too many - please leave a comment. Equally, please let me know if the article makes sense to you.
We do like our jobs and are serious professionals, you know!!
Special Announcement for TechNet Edge Visitors
After I published a document about the installation of Mellanox Infiniband on Server 2008, I have received some good feedback that deserves sharing. Note that:
The WinIB 1.4 beta available for download today on Mellanox's web site does not work with the HPC Server 2008 March CTP. We are working with Mellanox to fix that.
The procedure I illustrated in that document uses a "trick": Deploy the Mellanox package with msiexec first, add the Infiniband network to the cluster configuration later. This works both with our deployment tools and with 3rd parties'. However, one can exploit the built-in HPC Server 2008 tools better. Here's how:
1. Install the Mellanox WinIB package (currently WinIB_x86_1_4_0_2094.msi) on the head node. Set up the cluster network configuration to include Infiniband as the MPI network.
2. Create an o/s image and deployment template for the compute nodes.
3. In the Admin Console, right-click on the image and select Manage Drivers. You need 3 drivers for the card to be visible in the admin console, hence configurable:
- ib_bus.inf Mellanox InfiniBand Fabric driver
- mthca.inf InfiniBand Host Channel Adapter driver
- netipoib.inf Mellanox IP over Infiniband protocol driver
You will find the first 2 files in the C:\Program Files\Mellanox\WinIB\Drivers on the head node after installation of the WinIB package. The last one will be in C:\Program Files\Mellanox\WinIB\IPoIB
4. Copy WinIB_x86_1_4_0_2094.msi package to %CCP_DATA%\InstallShare on the head node.
5. Edit the compute nodes deployment template and add an Installation->Unicast Copy from z:\<winib package>.msi to c:\<winib package>.msi; move the copy operation before the "Install CCP" task.
6. In the same template, add an Installation->Execute OS command: msiexec /i c:\<winib package>.msi /qn ADDLOCAL=ALL.
7. Deploy the compute nodes.
8. When they are deployed, you can start the Network Direct provider with clusrun /nodes:<list of nodes> "%WinIB_HOME%\IPoIB\NDI\ndinstall.exe" -i
Hello again,
I'll be doing a series of webcasts soon, hopefully on the feature-complete beta 2 of HPC Server 2008. They will be mostly demonstrations, with a few slides for those concepts that are not evident in the software. Here is the schedule (all times are PST):
- 5/9/2008 08:00 AM: Windows HPC Server 2008: Management and Diagnostics in High-Performance Computing (Level 200)
- 5/23/2008 08:00 AM: Windows HPC Server 2008: High Availability and Diagnostics for High-Performance Computing (Level 200)
- 5/30/2008 09:30 AM: Windows HPC Server 2008: Job Scheduler and SOA in High-Performance Computing (Level 200)
Phil Pennington will also present on current efforts to develop a unified parallel programming model.
- 5/2/2008 08:00 AM: Future of Multi/Many-Core and the Convergence of Client and Cluster in Parallel Computing (Level 300)
Please click on the links to register.