HPC中国研发团队

微软亚太研发集团服务器与开发工具事业部高性能开发团队的中文博客。

May, 2008

  • Step by step LINPACK guidance

    I’ve been working on deploying LINPACK on our Windows HPC Server 2008, including compiling source code, setting up environment for the machines and also making adjustments on input parameters for LINPACK, so I would like to share some experience with you on the issue.

    In order to run LINPACK on windows platform, we should do such steps:

    1.       Find out right version of source code and compile it.

    There are several versions of LINPACK, the High Performance Computing LINPACK Benchmark is called HPL, and its current version is HPL 1.0a, you can find the source file “hpl.tgz” from the website: http://www.netlib.org/benchmark/hpl/.

    If your machine is INTEL based, you can also find the binary version from INTEL MKL, but pay attention to find the one suits your machines.

     

    2.       Set up running environment.

    In order to run LINPACK, we should have MPI and BLAS (Basic Linear Algebra Subprograms) libraries on our machines. So first, have HPC Pack installed, and then we can use MS MPI, and have a choice among BLAS libraries: GOTOAtlasACMLESSLMKL. Some libraries are machine specific, so find out the suitable one from http://www.netlib.org/blas/faq.html. Here I take INTEL MKL as a first choice; you may find it from http://www.intel.com/cd/software/products/asmo-na/eng/266857.htm. Install MKL in the nodes that you want to run LINPACK.

    3.       Install CCP SDK

    Find CCP SDK from http://www.microsoft.com/downloads/details.aspx?FamilyID=D8462378-2F68-409D-9CB3-02312BC23BFD&displaylang=en, or if you have Windows HPC Server 2008 installed, then CCP SDK is already included.

     

    4.       Configure paths

    In order to run LINPACK on multiple nodes, we should set up shared folders for input, output as well as program executable. Take my setting as an example, we establish a new share folder on head node, named “Scratch”, then make three directories: input, output, bin. To run LINPACK, we should provide a file named “HPL.DAT” containing the input parameters; we should put this input file in the directory “input”.  Then the output file containing results will be put into “output”, and the executable file of LINPACK in “bin”.

     

    5.       Estimate Results

    For better tuning the input parameters, we would like to see the performance efficiency under current configuration. The maximum value is calculated in this way: Clock Speed (GHZ) * Flops per Cycle. Flops per Cycle” are the number of flops per clock, for Opteron and Xeon the value is 2, for Xeon dual-cores and Quad-Cores, this values is 4. Then current result / max value will be your efficiency. 

    6.       Submit jobs

    ·         Input Parameters: Modify hpl.dat file to suit the target configuration. Firstly, four major parameters: N, NB, P, Q can be decided and the others remain default values. A standard input file is like the following:

    linpack1

    ·         Submit job:

    Use “Job submit /numberprocessors:P*Q /workdir : \\%CCP_CLUSTER_NAME%\Scratch\Linpack /stdout:hpl.log /stderr:hpl.err    mpiexec -wdir \\%CCP_CLUSTER_NAME%\Scratch\Linpack\bin xhp.exe” to submit the job. Then you may find it through “job management” in “admin console”:

    linpack2

    ·         View the benchmark results: After the job is finished, you may find the result like below:

    linpack3

    7.       Issues on input parameters

    Maybe you have heard there are 29 input parameters for LINPACK, so it is a very hard work to decide these inputs and it is always the most important work when running LINPACK. But   we can start from 4 parameters: N, NB, P and Q. N is the problem size, it should be large enough to reach the maximum performance, but not too large, which may result in paging, which would reduce the performance. . It is recommended that the matrix uses 80% of total memory. . As my experience, we can do some test on the machine, and monitor the available physical usage from heat map:

    linpack4

    If there are too many available physical memory, then we can increase N, and vice versa.  However, the best value will be obtained after several times of actual running.

    The value of NB should also be achieved from the real tests, a guideline is N mode NB = 0. Some experience results tell, for Intel Xeon processors, NB should be 192, but according to my tests on our TYAN cluster with Xeon dual cores, 224 is a better choice.  So I think we may increase NB at a fixed N, increase NB by 16 each time until we find a max Gflops.

    When related to P & Q, I really don’t know how to make a decision, the only thing I am very sure is, P * Q must be the number of cores. I’ve found a lot of materials written by different persons, some said values of P, Q must be close to each other, and others said P should be as small as possible. I’ve talked with Xavier, he suggests me to have a small P at first because when he does so, he gets the best performance.  However, it is very funny, when I am making a test on a four cores node, P = 4 with Q = 1 gets the best result and P = 1 with Q = 4 has a much poorer performance, the results are as below:

    linpack5

    But situation changes a lot when it comes to 3 nodes with 12 cores, P = 12 with Q = 1 performs much worse than P = 1 with Q = 12, the results are as below:

    linpack6

    Maybe the only way to find the best combination is through your own exploration.

     

    So these are some experience these weeks, though I’ve not achieved a satisfying efficiency, I am sure the performance can be improved in many ways, also I am very appreciate George for guidance and Xavier for precious suggestion.

     

    Lewis Liu 刘贤斐

    PM Intern,Microsoft STB China HPC

  • 来看看我们的产品之选择合适的网络拓扑结构

    大家好,我是HPC组的DEV朱仁琪。在加入微软的一年里,非常有幸能够亲眼目睹我们的Windows HPC Server 2008捷报频传,从第一个Beta版本到最近发布的社区预览版本(CTP),功能越来越强大,着实令人兴奋。在此我和大家一起分享一下我对于Windows HPC Server 2008网络配置的一些心得。

    众所周知,配置集群的网络通常是一件令人头疼但却又无法逃避的复杂工作。为了简化这项工作,Windows HPC Server 2008提供了一个向导(Network Wizard)来帮助我们完成网络的配置,如下图所示。

    Network Wizard

    在上图中我们可以看到,Windows HPC Server 2008支持五种不同的网络拓扑结构。那么我们应该如何去选择一个合适自身情况的拓扑呢?且听本文慢慢道来。

    首先,让我们来了解一下各种拓扑结构中涉及到的三种网络:

    • Enterprise网络(企业网络,在Beta 1中称为Public网络)

    不仅集群中的节点可能连接到这个网络,企业或机构中的其他计算机通常也连接到这个网络。大多数用户通过这个网络来进行通讯,进行他们的日常工作。

    • Private网络(私有网络)

    这是集群的内部网络,它可以用来承载集群内部节点间的通讯,集群外部的计算机通常无法连接到这个网络。

    • Application网络(应用网络,在Beta 1中称为MPI网络)

    这是一个集群内部的高速网络,通常具有很高的带宽和很低的延迟,可以用来满足集群内部并行MPI程序的通讯需求。常见的高速网络有Gigabit Ethernet、10 Gigabit Ethernet、Myrinet©、InfiniBand©等等。

    Windows HPC Server所支持的五种拓扑结构的区别在于包含上述网络中的不同子集,并且计算节点(Compute Node)能够连接到的网络有所不同。

    • 拓扑1:集群有Enterprise和Private两个网络,计算节点仅能连接到Private网络。
    • 拓扑2:集群有Enterprise和Private两个网络,计算节点可以同时连接到这两个网络。
    • 拓扑3:集群有Enterprise、Private和Application三个网络,但是计算节点只能连接到其中的Private和Application网络。
    • 拓扑4:集群有Enterprise、Private和Application三个网络,计算节点可以连接到所有这三个网络。
    • 拓扑5:集群只有Enterprise一个网络,所有节点均在这个网络上。这是五种拓扑中最简单的一种。

    在这五个拓扑结构中做出选择之前,我们可能会需要考虑这样一些因素:

    1. 是否想要利用Windows HPC Server 2008提供的一系列新的部署工具来部署集群中的节点?Windows HPC Server 2008利用Windows部署服务(Windows Deployment Service,简称WDS)来简化部署集群节点的繁琐工作,使得我们可以很方便的完成集群的部署(参见来看看我们的产品之Windows HPC Server 2008部署初体验)。如果您想要使用这项便利的功能,那么拓扑5将不在您的选择之列。
    2. 是否希望将集群内部的通讯同企业或机构内的其他通讯分离开以获得更平衡更好的网络性能?如果没有这种需求,可以考虑拓扑5,否则前四种拓扑会更为合适。
    3. 是否需要一个高速网络来承载MPI并行程序的通讯需求以获得更进一步的性能提升?如果您的答案是肯定的,拓扑3和拓扑4将会是最佳的选择。如果所有的应用程序都没有基于MPI程序库进行开发,那么Application网络就不是必须的,此时您可以考虑其他三种拓扑结构。
    4. 计算节点是否需要大量的访问位于Enterprise网络或Internet上的资源?如果是的,我们建议您使用拓扑2、4或5。虽然在拓扑1和拓扑3中,计算节点可以通过头节点(Head Node)上的NAT服务访问外部网络,但是此时所有与外界的通讯都需要通过头节点,头节点可能会成为性能的瓶颈。
    5. 安全性与访问方便性的平衡。拓扑1和拓扑3将集群内部的通讯与外部的通讯分离开,集群外部无法直接访问集群内部的节点,提高了集群的安全性。其他三个拓扑则将所有节点直接连在Enterprise网络上,使得在集群上开发和调试应用程序变得容易。所以您将不得不在安全性与访问方便性之间做出取舍。

    我们可以将上述因素总结如下表:

    拓扑1
    拓扑2
    拓扑3
    拓扑4
    拓扑5
    计算节点连接到Enterprise网络
    计算节点连接到Private网络

    计算节点连接到Application网络

    支持WDS
    集群内部通讯与外部通讯分离
    具有高速网络来提高MPI并行程序性能
    计算节点需要大量访问集群外资源
    较高的安全性

    较方便的访问集群内节点

    怎么样,现在是不是觉得很简单地能够在Windows HPC Server的五种网络拓扑结构中做出合适的选择呢?:)

     

    Renqi Zhu

    DEV, Windows HPC

    Shanghai, China