I’ve been working on deploying LINPACK on our Windows HPC Server 2008, including compiling source code, setting up environment for the machines and also making adjustments on input parameters for LINPACK, so I would like to share some experience with you on the issue.
In order to run LINPACK on windows platform, we should do such steps:
1. Find out right version of source code and compile it.
There are several versions of LINPACK, the High Performance Computing LINPACK Benchmark is called HPL, and its current version is HPL 1.0a, you can find the source file “hpl.tgz” from the website: http://www.netlib.org/benchmark/hpl/.
If your machine is INTEL based, you can also find the binary version from INTEL MKL, but pay attention to find the one suits your machines.
2. Set up running environment.
In order to run LINPACK, we should have MPI and BLAS (Basic Linear Algebra Subprograms) libraries on our machines. So first, have HPC Pack installed, and then we can use MS MPI, and have a choice among BLAS libraries: GOTO、Atlas、ACML、ESSL、MKL. Some libraries are machine specific, so find out the suitable one from http://www.netlib.org/blas/faq.html. Here I take INTEL MKL as a first choice; you may find it from http://www.intel.com/cd/software/products/asmo-na/eng/266857.htm. Install MKL in the nodes that you want to run LINPACK.
3. Install CCP SDK
Find CCP SDK from http://www.microsoft.com/downloads/details.aspx?FamilyID=D8462378-2F68-409D-9CB3-02312BC23BFD&displaylang=en, or if you have Windows HPC Server 2008 installed, then CCP SDK is already included.
4. Configure paths
In order to run LINPACK on multiple nodes, we should set up shared folders for input, output as well as program executable. Take my setting as an example, we establish a new share folder on head node, named “Scratch”, then make three directories: input, output, bin. To run LINPACK, we should provide a file named “HPL.DAT” containing the input parameters; we should put this input file in the directory “input”. Then the output file containing results will be put into “output”, and the executable file of LINPACK in “bin”.
5. Estimate Results
For better tuning the input parameters, we would like to see the performance efficiency under current configuration. The maximum value is calculated in this way: Clock Speed (GHZ) * Flops per Cycle. Flops per Cycle” are the number of flops per clock, for Opteron and Xeon the value is 2, for Xeon dual-cores and Quad-Cores, this values is 4. Then current result / max value will be your efficiency.
6. Submit jobs
· Input Parameters: Modify hpl.dat file to suit the target configuration. Firstly, four major parameters: N, NB, P, Q can be decided and the others remain default values. A standard input file is like the following:
· Submit job:
Use “Job submit /numberprocessors:P*Q /workdir : \\%CCP_CLUSTER_NAME%\Scratch\Linpack /stdout:hpl.log /stderr:hpl.err mpiexec -wdir \\%CCP_CLUSTER_NAME%\Scratch\Linpack\bin xhp.exe” to submit the job. Then you may find it through “job management” in “admin console”:
· View the benchmark results: After the job is finished, you may find the result like below:
7. Issues on input parameters
Maybe you have heard there are 29 input parameters for LINPACK, so it is a very hard work to decide these inputs and it is always the most important work when running LINPACK. But we can start from 4 parameters: N, NB, P and Q. N is the problem size, it should be large enough to reach the maximum performance, but not too large, which may result in paging, which would reduce the performance. . It is recommended that the matrix uses 80% of total memory. . As my experience, we can do some test on the machine, and monitor the available physical usage from heat map:
If there are too many available physical memory, then we can increase N, and vice versa. However, the best value will be obtained after several times of actual running.
The value of NB should also be achieved from the real tests, a guideline is N mode NB = 0. Some experience results tell, for Intel Xeon processors, NB should be 192, but according to my tests on our TYAN cluster with Xeon dual cores, 224 is a better choice. So I think we may increase NB at a fixed N, increase NB by 16 each time until we find a max Gflops.
When related to P & Q, I really don’t know how to make a decision, the only thing I am very sure is, P * Q must be the number of cores. I’ve found a lot of materials written by different persons, some said values of P, Q must be close to each other, and others said P should be as small as possible. I’ve talked with Xavier, he suggests me to have a small P at first because when he does so, he gets the best performance. However, it is very funny, when I am making a test on a four cores node, P = 4 with Q = 1 gets the best result and P = 1 with Q = 4 has a much poorer performance, the results are as below:
But situation changes a lot when it comes to 3 nodes with 12 cores, P = 12 with Q = 1 performs much worse than P = 1 with Q = 12, the results are as below:
Maybe the only way to find the best combination is through your own exploration.
So these are some experience these weeks, though I’ve not achieved a satisfying efficiency, I am sure the performance can be improved in many ways, also I am very appreciate George for guidance and Xavier for precious suggestion.
Lewis Liu 刘贤斐
PM Intern,Microsoft STB China HPC
大家好,我是HPC组的DEV朱仁琪。在加入微软的一年里,非常有幸能够亲眼目睹我们的Windows HPC Server 2008捷报频传,从第一个Beta版本到最近发布的社区预览版本(CTP),功能越来越强大,着实令人兴奋。在此我和大家一起分享一下我对于Windows HPC Server 2008网络配置的一些心得。
众所周知,配置集群的网络通常是一件令人头疼但却又无法逃避的复杂工作。为了简化这项工作,Windows HPC Server 2008提供了一个向导(Network Wizard)来帮助我们完成网络的配置,如下图所示。
在上图中我们可以看到,Windows HPC Server 2008支持五种不同的网络拓扑结构。那么我们应该如何去选择一个合适自身情况的拓扑呢?且听本文慢慢道来。
首先,让我们来了解一下各种拓扑结构中涉及到的三种网络:
不仅集群中的节点可能连接到这个网络,企业或机构中的其他计算机通常也连接到这个网络。大多数用户通过这个网络来进行通讯,进行他们的日常工作。
这是集群的内部网络,它可以用来承载集群内部节点间的通讯,集群外部的计算机通常无法连接到这个网络。
这是一个集群内部的高速网络,通常具有很高的带宽和很低的延迟,可以用来满足集群内部并行MPI程序的通讯需求。常见的高速网络有Gigabit Ethernet、10 Gigabit Ethernet、Myrinet©、InfiniBand©等等。
Windows HPC Server所支持的五种拓扑结构的区别在于包含上述网络中的不同子集,并且计算节点(Compute Node)能够连接到的网络有所不同。
在这五个拓扑结构中做出选择之前,我们可能会需要考虑这样一些因素:
我们可以将上述因素总结如下表:
计算节点连接到Application网络
较方便的访问集群内节点
√
怎么样,现在是不是觉得很简单地能够在Windows HPC Server的五种网络拓扑结构中做出合适的选择呢?:)
Renqi Zhu
DEV, Windows HPC
Shanghai, China