Welcome to TechNet Blogs Sign in | Join | Help

The Windows HPC Team Blog

"Your guide to all things Windows HPC"
Warning: Not all InfiniBand HCAs have a PSID

If you read the documents posted by Mellanox about their new 2..0.5 build 4453 InfiniBand drivers you may have noticed the advice to update your firmware. If so, you will need to discover your PSID. This should be pretty straight forward, just install the drivers and then using the HPC Management console Run a Command feature run vstat on the node you wish to update. If you are lucky, you’ll see something like this:

NODE-08 -> Finished

-------------------------------------------------------------------------------------------------

 

        hca_idx=0

        uplink={BUS=PCI_E, SPEED=2.5 Gbps, WIDTH=x8, CAPS=2.5*x8}

        vendor_id=0x08f1

        vendor_part_id=0x6278

        hw_ver=0xa0

        fw_ver=4.08.0200

        PSID=VLT0040010001

        node_guid=0008:f104:0399:2054

        num_phys_ports=2

               port=1

               port_state=PORT_ACTIVE (4)

               link_speed=5.0 Gbps (2)

               link_width=4x (2)

               rate=20 Gbps

               port_phys_state=LINK_UP (5)

               active_speed=5.0 Gbps (2)

               sm_lid=0x0001

               port_lid=0x0009

               port_lmc=0x0

               max_mtu=2048 (4)

 

               port=2

               port_state=PORT_DOWN (1)

               link_speed=NA

               link_width=NA

               rate=NA

               port_phys_state=POLLING (2)

               active_speed=2.5 Gbps (1)

               sm_lid=0x0000

               port_lid=0x0000

               port_lmc=0x0

               max_mtu=2048 (4)

 

 

If, like me, you are unlucky, you will not have a PSID line in the output. Like this:

NODE-07 -> Finished

--------------------------------------------------------------------------------------

 

       hca_idx=0

       uplink={BUS=PCI_E, SPEED=2.5 Gbps, WIDTH=x8, CAPS=2.5*x8}

       vendor_id=0x066a

       vendor_part_id=0x6274

       hw_ver=0xa0

       fw_ver=0x100020000

       node_guid=0006:6a00:9800:f356

       num_phys_ports=1

             port=1

             port_state=PORT_ACTIVE (4)

             link_speed=5.0 Gbps (2)

             link_width=4x (2)

             rate=20 Gbps

             port_phys_state=LINK_UP (5)

             active_speed=5.0 Gbps (2)

             sm_lid=0x0001

             port_lid=0x000a

             port_lmc=0x0

             max_mtu=2048 (4)

 

 

 

 

If this happens to you, don’t waste time trying to find the PSID of your HCA. As soon as I find a way to pick the right firmware upgrade for my own HCAs without PSIDs, I’ll post how to do it. Until then, you and I must run on whatever firmware we already have.

 

Sorry,

 

  Frankie

New Mellanox WinOF InfiniBand WHQLed Drivers: Driver Installation

Mellanox has released a new version of its WHQLed WinOF drivers. V2.0.5 build 4453

See: http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=32&menu_section=34

Click the download tab in the middle of the page and select the MLNX WinOF MSI v2.0.5 for x64 Platforms shortcut  to begin the .msi download from http://www.mellanox.com/downloads/WinOF/MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.msi . I give the first link because it references other documents of interest to anyone installing these drivers.

This release includes both a .msi package mentioned above and an INF compatible package.

To update drivers on a preinstalled system use the HPC Management console Run Command feature on group of nodes and the command:

msiexec /quiet /forcerestart  /i  \\headnode\Home\LocalAdmin\MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.msi

I was able to test on on a single node and it had the somewhat disconcerting effect of claiming in the Run a Command window that the command had failed. This was because it never completed cleanly due to the forced restart. This is likely to also apply to running this command from clusrun.

To use the INF install, first unzip the file MLNX_WinOF INF file v2.0.5 for x64 Platforms from the page you arrived at from clicking the Download tab above: http://www.mellanox.com/downloads/WinOF/MLNX_WinOF_HPC_x64_2_0_5.zip .Then use the HPC Management console Configure->To-do List ->Manage Drivers link point to the INF directory. You may have to first remove any references to the older version of the drivers that are there from your earlier insertion of InfiniBand drivers.

Have fun and compute well,

 

  Frankie

HPC Server 2008 MPI Diagnostic Fails on Eager Message No Business Card Error

An HPC Server 2008 user reported that his cluster was up and running and that all nodes could ping each other over all networks but the built-in MPI diagnostic was failing with an uninformative message "Failed To Run".

He had a topology number three with the head node connected to the Enterprise network and all compute nodes connected to the head node via Ethernet as the Private network and Infiniband as the Applications network.

Please be aware that "Failed To Run" is a separate category from "Failure" and when a test doen't succeed, you may have to check both places in the Diagnostics tree Test Results branch. Once you find this tab you don't get much information beyond the result "Failed To Run". However if you click on the red ! labelled line you will see the bottom pane light up, but it still only says "Test Failed to Run". Look to the right side of that banner and you will see a bright red "Result" followed by a v in a circle. Click on the v and you get more information about the failure.

  • The test did not run. Please navigate to 'Progress of the test' to view log and error messages.
  • So where is the Progress of the test to be found? Well, if like me you often don't have the Actions pane open, you better click on the Actions tab near the top of the console. Now near the top of the Action pane you will see the link to "Progress of the Test". This is progress, of a sort. You'll likely see just a single line with the red ! and a State of "Reverted". Now click on that line.

    Oh, boy, we're rockin' now. Here are the real error messages. This is so information rich it's almost embarassing.

    Time Message
    6/29/2009 10:16:53 AM Reverted
    6/29/2009 10:16:53 AM The operation failed due to errors during execution.
    6/29/2009 10:16:53 AM The operation failed and will not be retried.
    6/29/2009 10:16:53 AM ---- error analysis -----
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM mpi has detected a fatal error and aborted mpipingpong.exe
    6/29/2009 10:16:53 AM [2] on NODE-03
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM ---- error analysis -----
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM [3-6] terminated
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM Check the local NetworkDirect configuration or set the MPICH_ND_ENABLE_FALLBACK environment variable to true.
    6/29/2009 10:16:53 AM There is no matching NetworkDirect adapter and fallback to the socket interconnect is disabled.
    6/29/2009 10:16:53 AM CH3_ND::CEnvironment::Connect(296): [ch3:nd] Could not connect via NetworkDirect to rank 1 with business card (port=58550 description="10.1.0.2 192.168.0.28 192.168.0.39 NODE-02 " shm_host=NODE-02 shm_queue=3204:428 nd_host="10.1.0.2:157 " ).
    6/29/2009 10:16:53 AM MPIDI_CH3I_VC_post_connect(426)...: MPIDI_CH3I_Nd_connect failed in VC_post_connect
    6/29/2009 10:16:53 AM MPIDI_CH3_iSendv(239).............:
    6/29/2009 10:16:53 AM MPIDI_EagerContigIsend(519).......: failure occurred while attempting to send an eager message
    6/29/2009 10:16:53 AM MPIC_Sendrecv(120)................:
    6/29/2009 10:16:53 AM MPIR_Allgather(487)...............:
    6/29/2009 10:16:53 AM MPI_Allgather(864)................: MPI_Allgather(sbuf=0x00000000001FF790, scount=128, MPI_CHAR, rbuf=0x0000000000B70780, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed
    6/29/2009 10:16:53 AM Fatal error in MPI_Allgather: Other MPI error, error stack:
    6/29/2009 10:16:53 AM [2] fatal error
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM [0-1] terminated
    6/29/2009 10:16:53 AM 
    6/29/2009 10:16:53 AM [ranks] message
    6/29/2009 10:16:53 AM job aborted:
    6/29/2009 10:16:53 AM  

    First it tells us that Node-03 had a problem. Then it tells us to look at the Node-03 local Network Direct connection. Then it tells us that the environment is set to not fall back to Winsock Direct or TCP/IP. This is because falling back when people are expecting Network Direct performance can cause applications to run very slowly and is hard to diagnose. Trust me. I missed sleep over that one.

    Then we have serveral lines of MPI error messages, which I generally summarize as the 'eager message no business card' error. You can ignore the rest of the message but keep in mind that whenever you see the eager message no business card error you should suspect your MPI network has a problem.

    So, let's follow the advice at the beginning of the error messages, and check the InfiniBand status on Node-03. I use the Run Command feature of the Management Console to run the ndinstall tool. To make life easier, I copy the .exe for this tool to all of the nodes in C:\Windows\System32\ndinstall.exe . This tool is usually installed by the .msi install of the drivers on the head node. Search your system drive after you install the drivers and find this tool. Then put it on a head node share the compute nodes can see and use clusrun or the Run Command GUI to copy it to all the compute nodes. Here's the output from Node-03 (bad node no business card) and Node-02 (good node, pat pat).

    Node-03

    0000001001 - MSAFD Tcpip [TCP/IP]
    0000001002 - MSAFD Tcpip [UDP/IP]
    0000001003 - MSAFD Tcpip [RAW/IP]
    0000001004 - MSAFD Tcpip [TCP/IPv6]
    0000001005 - MSAFD Tcpip [UDP/IPv6]
    0000001006 - MSAFD Tcpip [RAW/IPv6]
    0000001007 - RSVP TCPv6 Service Provider
    0000001008 - RSVP TCP Service Provider
    0000001009 - RSVP UDPv6 Service Provider
    0000001010 - RSVP UDP Service Provider

    Node-02

    0000001001 - MSAFD Tcpip [TCP/IP]
    0000001002 - MSAFD Tcpip [UDP/IP]
    0000001003 - MSAFD Tcpip [RAW/IP]
    0000001004 - MSAFD Tcpip [TCP/IPv6]
    0000001005 - MSAFD Tcpip [UDP/IPv6]
    0000001006 - MSAFD Tcpip [RAW/IPv6]
    0000001007 - RSVP TCPv6 Service Provider
    0000001008 - RSVP TCP Service Provider
    0000001009 - RSVP UDPv6 Service Provider
    0000001010 - RSVP UDP Service Provider
    0000001011 - OpenIB Network Direct Provider

    Notice there is no 0000001011 OpenIB Network Direct Provider on Node-03. So, actually the diagnostic got it right immediately. I just took a long time to prove it. So now let's run ndinstall -i on Node-03. Again with the Run Command, eh? All we get is a Finished. Then run ndinstall -l again and verify that we get the 0000001011 OpenIB Network Direct Provider line. Yes we do, but do not be confused if, for you like for me, it is "0000001012 - OpenIB Network Direct Provider". The sequence number is not important.

    And finally let's run the diagnostic again and look in Diagnostics->Test Results->Success. Ah, now that's sweet!

    Test Name                                Result     Test Suite        Target      Last Updated
    MPI Ping-Pong: Quick Check    Success  Performance   7 nodes    6/29/2009 10:54:58 AM

    That's it for now, "Transfer fast and prosper."

     Frankie

     

     

    Overview of Charts and Reports in V2

    Recently a customer asked me to create a document to briefly describe the charting and reporting functionality in V2. After completing the document I felt that it would make a good blog posting to share with the HPC community. The document is attached to this blog posting. Please feel free to provide feedback.

     Thanks.

     

    2009 Summer Scripting Games

     This might be of interest to HPC PowerShell users.

     

    --------------- 

      

    PowerShellCommunity.org Joins Forces with Microsoft Scripting Guys to Host 2009 Summer Scripting Games

     

    LOS ANGELES – At Microsoft TechEd 2009, PowerShellCommunity.org, an online community where script writers connect and share knowledge, today announced a key alliance with the Microsoft Script Center (aka Scripting Guys) and PoshCode.org to host the 2009 Scripting Games, June 15–26, 2009. 

    “We started the Games to challenge scripters everywhere, invite them in to become part of a fun community, and to learn in a cost-effective way,” said John Merrill, IT content evangelist and publishing manager in the Windows Server Division User Assistance group. “We are pleased to work together with PowerShellCommunity.org in helping deliver a premium scripting experience for the two weeks in June when we host the Games.”

    The Scripting Games are a chance for IT professionals to practice and test their scripting skills during 10 events using either Microsoft Windows PowerShell or Microsoft VBScript. The Games begin as a live event with contestants submitting entries that are judged and scored by the community.

    “We are looking forward to the Scripting Games and being part of the community in helping up to 1,000 or more script writers in showcasing their craft,” said Hal Rottenberg, director of PowerShellCommunity.org. “Sponsors of PowerShellCommunity.org like Idera, Quest Software, Inc., Compellent and SAPIEN Technologies, Inc. help us provide this venue for Windows PowerShell users to collaborate and communicate.”

    To enter the Scripting Games visit http://www.microsoft.com/technet/scriptcenter/funzone/games/.

     

     

    MPI Cluster Debugger in Visual Studio2010 Beta1

    If you ever write MPI program in Windows HPC cluster, you should be familiar with MPI Cluster Debugger in Visual Studio2005/2008. Also, you can find much resource online talking about it (such as blog, white paper). Do you like to debug MPI program in cluster? Is the debugger easy to use? Visual Studio 2010 Beta1 is already released now. HPC team investigates much effort to improve the MPI Cluster Debugger. Let’s go through it.

    At the same place (Project Property Page), we find the MPI Cluster Debugger. The difference is that much more properties are there now. Don’t worry, although there are about 20 properties. You will be familiar with them soon. In most cases, we only need to concern 3 properties. Default values will be used for other properties, if leave them empty.

    clip_image002

    The most important thing is to specify the Headnode, when we want to debug program in a cluster. “Run Environment” is the first mandatory property. Click “Edit Hpc Node…”, the “Node Selector” dialog pops up. We can specify the Headnode and choose computer nodes here. Either specify the total number of the MPI processes, either precisely specify the number of MPI processes on the selected nodes. In this page, we also can get the real time CPU usage of each node in the cluster. If we only need to debug the program on local machine by 4 processes, just input “localhost/4”.

    clip_image004

    clip_image006

    Another mandatory property is “Working Directory”. It must be a local path. MPI Cluster Debugger will help us to create it if it doesn’t exist. The last mandatory property is “Application Command”. We can use the VS build-in macro there, such as $(TargetFileName)”.

    clip_image008

    “Deployment Directory” is optional, its default value is \\<HeadNode>\CcpSpoolDir\<UserName>. CcpSpoolDir is created during the installation of the Windows HPC Cluster. If we don’t like to use the default value, input ours. Make sure it is a shared path, and we have permission to read and write file there.

    We can select different debugger engine through “Debugger Type” property. If we want to debug MPI .Net program, “Managed Only” is the choice.

    clip_image010

    Each property has an explanation at the bottom of that page, so I don’t go through them one by one. Let me know if you don’t quite get it.

    When the mandatory properties are specified, we can use the basic feature of the MPI Cluster Debugger now. Press “F5”. After a while, the MPI processes are launch on the selected nodes, and attached by the vs. In the “Output View” of vs, we can get to know what happened. If error occurs, detail information will be printed here. In”Processes View”, we can find the MPI processes. The break point in the source file will be hit when process passes by there.

    clip_image012

    clip_image014

    We briefly go through the MPI Cluster Debugger above. Some small changes may happen in Beat2 or the RTM version. If you have any feedback or suggestions, reply to me. Thank you!

    Submit job from SUA (Subsystem for Unix based Application)

    SUA is a very nice tool for porting Unix/Linux application, but it can also be used to submit job. You just have to set the path to the HPCS binary into your PATH variable :

    With BASH :
      export PATH=$PATH:"/dev/fs/C/Program Files/Microsoft HPC Pack/Bin"

    With Korn Shell :
      set PATH $PATH:"/dev/fs/C/Program Files/Microsoft HPC Pack/Bin"

    And then you can submit job (copy from a Korn Shell windows on my head node):

    $ set PATH $PATH:"/dev/fs/C/Program Files/Microsoft HPC Pack/Bin"
    $   job.exe submit  /workdir:\\\\hpcs-fr\\Shared /stdout:out-tv.txt /stderr:ett-tv.txt ping -n 50 hpcs-fr
    Job has been submitted. ID: 37.
    $ job.exe list
    Id         Owner                Name                                     State        Priority    Resource Request
    ---------- -------------------- ---------------------------------------- ------------ ----------- ------------------
    36         HPCSFR\Administrator                                          Running      Normal      *-* cores
    37         HPCSFR\Administrator                                          Running      Normal      *-* cores

    2 matching jobs for HPCSFR\administrator

    DNS Suffix vs. Active Directory Domain for HPC cluster (Part 2)

    In an earlier post, I described an approach for resolving names when the connection specific DNS suffix did not match the Active Directory domain name.  Recently, I realized that there is a much simpler solution.  The Default ComputeNode Template which is created when you run through the Node Template Creation Wizard contains a step for joining the AD domain, but unless you edit the template, the field for specifying the AD domain name is blank.  If that field is blank, the compute nodes will use the primary DNS suffix of the head node as the name of the AD domain they attempt to join, and if there is no such domain, that step will fail. So, the solution is simply to edit the Node Template and add the AD Domain name to that field.

    How to set environment variables on computes nodes

    For ISV application you have to set some environments variables on the compute node.
    You can set this variable by hand, but if you have many nodes it can be long.
    So, the idea of this post is how can set them attaumaticaly ?

     Use the HPCS tools :

    You can set cluster wilde variable by using cluscgf in a Administrator cmd line windows :
       "cluscfg setenvs name=value"

    ex. : cluscfg setenvs LM_LICENSE_FILE=license_server@74000

    So each time you will submit a job this variable will be set for the job

     Use command line :

    Select the nodes into the node management view and run the following command
     "setx  name value /M"
    This command will add this environement variable in the system of all the selected nodes

    NOTE : /M set the variable for the system

    ex. : setx PATH "%PATH%;\\headnode\Apps\bin" /M

    How to Capture a “Master” Compute Node Image Using Node Templates

    Whenever I discuss image management capabilities of the HPC Cluster Manager, one question which invariably arises is, “How do I capture an image of a compute node which has been customized so that I can use it as the ‘master’ for deploying to the rest of the nodes?” My response has been to send an email with a link to the Windows Automated Installation Kit (AIK), and the Microsoft Deployment Toolkit (MDT).   While these kits include the necessary tools and techniques, it is very much like looking for a needle in a haystack because of the volume of information which does not pertain to the HPC environment.  The HPC Cluster Manager already leverages most of the tools described in the links above when it is provisioning new nodes from bare metal.  Having sifted through all of the extraneous material, the intent of this post is to describe a simple process for creating a master image and then deploying it using customized Node Templates. 

    Obtain a copy of imagex.exe

    The one tool which is required for capturing an image that is not already included with the HPC Cluster Manager is called imagex, and even though it is a free download, it is not easy find.  It may be available elsewhere, but here is what I had to do:

    1.       Download the AIK iso from the link above.

    2.       Burn a DVD, then install the AIK (it doesn't matter what machine you install it on, you are only doing this to extract the imagex.exe file)

    3.       Navigate to C:\Program Files\Windows AIK\Tools\amd64\

    4.       Copy imagex.exe to the C:\Program Files\Microsoft HPC Pack\Data\InstallShare\ directory on the headnode.   

    Customize the "master" compute node

    After deploying at least one compute node with the Windows HPC Server 2008 image, install all of the applications, patches, drivers, etc. which are necessary for your environment onto one of the compute nodes. 

    Note: Be sure that the node was installed from Volume License media and not from retail media.

    Run sysprep on the "master" compute node

    First run the re-arm command which will reset the grace period for activation for the node by opening a command prompt and typing:

    c:\windows\system32\slmgr.vbs –rearm

    Then navigate to the C:\Windows\system32\sysprep folder, then right click on sysprep and select “Run as Administrator”.  In the window that pops up, select Enter System Out-of-Box Experience (OOBE) for the System Cleanup Action field, and be sure to check the Generalize box.  (The first time I ran through this process, I did not have that box checked and the resulting image would not install properly).  Also, change the Shutdown Options to Shutdown rather than Reboot, because you don’t want it to boot again until after you’ve captured the image.

    Note: This operation will leave the master compute node in a somewhat crippled state.  Before it can re-join the cluster, it must either be re-installed along with the rest of the compute nodes, or go through a manual process of adding back the information which was removed by the sysprep operation.

    Create the Image Capture Node Template

    Fortunately, booting to WinPE is built-in to the process when you install a compute node from bare metal.  So, the only thing that needs to be added is the imagex command to capture the sysprep’ed image. This is done by creating a new Node Template.  To create a Node Template, first select Configuration from the main sections on the left side of the HPC Cluster Manager.  Then Select Node Templates from the list above.

    You can import this sample by copying the xml below to a file named ImageCapture.xml

    <?xml version="1.0" encoding="utf-8"?>

    <Template xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"

      Description="">

      <Item

        Name="UnicastCopy">

        <Parameter

          Name="Source"

          Value="imagex.exe" />

        <Parameter

          Name="Destination"

          Value="x:\imagex.exe" />

        <Parameter

          Name="Description"

          Value="Copies the imagex program to the temp drive on the compute node." />

      </Item>

      <Item

        Name="ExecuteCommand">

        <Parameter

          Name="Command"

          Value="x:\imagex.exe /capture c: z:\images\Goldenimage.WIM &quot;Golden Image&quot;" />

        <ParameterList

          Name="ErrorWhiteList" />

        <Parameter

          Name="Description"

          Value="Runs the capture function of Imagex." />

      </Item>

      </Template>

     

    Then, select Import on the Node Template Actions menu, navigate to the file you just created, then select it.

    Assign the Image Cature Node Template to the "master" compute node

     

    Once the Image Capture Node Template has been created, the process of capturing the image consists of simply booting up the master compute node and assigning the new template to it.

    First, make sure the node is set to PXE boot, and then boot it up.  Next, assign the image capture template to the node.  From the Node Management window, you can right click on the node, then select Assign Node Template …then choose the Image Capture template. You can monitor the process by selecting the node and clicking on the Provisioning Log tab.

     

    When this process is completed, it will have created an image named Goldenimage.WIM and copied it to the C:\Program Files\Microsoft HPC Pack\Data\InstallShare\images\ directory on the headnode. 

     

    As noted above, the master compute node will be left hanging and will need to be manually powered down and then either re-installed with the rest of the compute nodes or be manually reconfigured to restore the settings which were removed in the sysprep process.

     

    Add the new image

    This newly captured image will need to be added to the collection of images which are available for inclusion in the Node Templates.  To do this, first select Configuration from the main sections on the left side of the HPC Cluster Manager.

    Then Select Images from the list above.

    Then select Add Image from the Actions pane.

    Select Load an existing Operating System Image then click on the Browse button and navigate to the C:\Program Files\Microsoft HPC Pack\Data\InstallShare\images\ directory, select Goldenimage.WIM, then click Open, then OK.

    Create a new Node Template for deploying the master image

    Many of the steps which are included in the Default Node Template such as installing the .NET Framework and the HPC Pack are not necessary for deploying the master image, so we must create a new Node Template.  Copy the xml below to a file called “Deploy Captured Image.xml” then follow the steps outlined above to import this xml file to a Node Template.

    <?xml version="1.0" encoding="utf-8"?>

    <Template xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"

      Description="">

      <Item

        Name="CreateADAccount">

        <Parameter

          Name="Description"

          Value="Creates a computer account in Active Directory for the compute node." />

      </Item>

      <Item

        Name="UnicastCopy">

        <Parameter

          Name="Source"

          Value="config\diskpart.txt" />

        <Parameter

          Name="Destination"

          Value="x:\diskpart.txt" />

        <Parameter

          Name="Description"

          Value="Copies a file or folder from the head node using SMB protocol." />

      </Item>

      <Item

        Name="PartitionDisk">

        <Parameter

          Name="DiskPartScript"

          Value="x:\diskpart.txt" />

        <Parameter

          Name="Description"

          Value="Partitions the disks on the compute node using a script for Diskpart." />

      </Item>

      <Item

        Name="UnicastCopy">

        <Parameter

          Name="Source"

          Value="Images\Goldenimage.WIM" />

        <Parameter

          Name="Destination"

          Value="%INSTALLDRIVE%\Goldenimage.WIM" />

        <Parameter

          Name="Description"

          Value="Copies a file or folder from the head node using SMB protocol." />

      </Item>

      <Item

        Name="ExtractWim">

        <Parameter

          Name="WimPath"

          Value="%INSTALLDRIVE%\Goldenimage.WIM" />

        <Parameter

          Name="DestinationPath"

          Value="%INSTALLDRIVE%" />

        <Parameter

          Name="Description"

          Value="Extracts the files in a WIM file to a local disk on the compute node." />

      </Item>

      <Item

        Name="ExecuteCommand">

        <Parameter

          Name="Command"

          Value="del /F %INSTALLDRIVE%\Goldenimage.WIM" />

        <ParameterList

          Name="ErrorWhiteList" />

        <Parameter

          Name="ContinueOnFailure"

          Value="True" />

        <Parameter

          Name="Description"

          Value="Cleaning up WIM file" />

      </Item>

      <Item

        Name="WindowsSetup">

        <Parameter

          Name="Image"

          Value="Goldenimage.WIM\Golden Image" />

        <Parameter

          Name="ProductKey"

          Value="YQGMW-MPWTJ-34KDK-48M3W-X4Q6V" />

        <Parameter

          Name="AutogeneratedPassword"

          Value="False" />

        <Parameter

          Name="InstallDrive"

          Value="C" />

        <Parameter

          Name="Description"

          Value="Installs the Windows Server operating system on the compute node." />

      </Item>

      <Item

        Name="JoinDomain">

        <Parameter

          Name="Description"

          Value="Joins the compute node to an Active Directory domain." />

      </Item>

      <Item

        Name="Reboot">

        <Parameter

          Name="Description"

          Value="Restarts the compute node." />

      </Item>

      <Item

        Name="ActivateOsItem">

        <Parameter

          Name="Description"

          Value="Activates the operating system on the compute node." />

      </Item>

    </Template>

     

    Deploy new nodes using the master image

    These samples were generated using the Evaluation version of Windows HPC Server, so before you deploy your image, you will want to edit the template you just imported to change the Product Key and the Local Administrator password.

    Go through the steps of the wizard to add new compute nodes from bare metal using the Deploy Captured Image Template you just created.  When finished, all of the new nodes should be configured with all of the customizations which were performed on the master compute node.

    Accessing an NFS Server from Windows HPC compute nodes

    Because most HPC environments already have enterprise data stored on an NFS server, we often get asked how to enable the Windows HPC compute nodes to access the NFS server.  In Windows Server 2008, the NFS Client is included as part of the Services for Network File System (NFS), which is a Role Service under the File Server role.  There is a complete step-by-step guide available at this link:

    http://technet.microsoft.com/en-us/library/cc753302.aspx 

    Included on that page is a description of how to install Services for NFS using the Server Manager GUI.  But in an HPC cluster, it is not practical to log in to each compute node and run through those steps.  Fortunately, there is a command line method for performing the same function.  To install the Services for NFS, the following command should be run from the command line:

     

    servermanagercmd.exe -install FS-NFS-Services

     

    Once the Services for NFS have been installed, use the mount command to map the NFS fileshare to a logical drive letter.  Be aware that there is no automount option for the mount command, so to make this connection survive a reboot, you must execute this command first:

     

    net use /persistent:yes

     

    Then, you can execute the mount command using this syntax:

     

    mount -u:<UserName> -p:<Password> <ComputerName>:/<ShareName> <Drive Letter>

     

    The complete description of all of the options for the mount command is available at this link:

    http://technet.microsoft.com/en-us/library/cc733084.aspx

     

    All of these commands may be run from the headnode using the clusrun command.  Or, to have everything set up automatically as part of the provisioning process for new compute nodes, these commands may be added to the Node Template. To do this, open the Node Template in the Editor, select Add Task, then Maintenance, then Post Install Command.  Then add the servermanagercmd command above to the row at the bottom of the properties section. Repeat these steps to add the other two commands to the Node Template.

    Windows HPC WCF/SOA tracing

    I was recently looking into a failure in a WCF client application that runs just fine with the debugger attached but fails to run stand alone.

    The error message was as follows: 

    Microsoft.Hpc.ServiceBroker Warning: 0 : Service net.tcp://private.TestHeadNode:9088/346/1728/_defaultEndpoint failed. Error:The server did not provide a meaningful reply; this might be caused by a contract mismatch, a premature session shutdown or an internal server error.

        DateTime=2009-03-10T14:44:26.3899654Z

    Microsoft.Hpc.ServiceBroker Warning: 10000 : Request urn:uuid:729a078e-7c79-479d-af75-976a2a2b7219 from user Anonymous user has been given up because Request has failed more than 3 times, broker will not deliver it again

        DateTime=2009-03-10T14:44:28.2337272Z

     

    Following my previous blog post the trace logs were obtained and analyzed using the Service Trace Viewer Tool. It was quite evident from the traces that the message size was bigger than the service host on the compute nodes were able to handle which led to the premature session.

     

    By following the direction on handling large messages here: http://msdn.microsoft.com/en-us/library/cc907051(VS.85).aspx#message_size the problem was solved and the application runs as expected (with or without the debugger attached).

     

     

    Troubleshooting Windows HPC WCF/SOA Issues

    HPC uses HPC sessions to support the service-oriented architecture (SOA) programming model based on Windows Communication Foundation (WCF). Sometimes troubleshooting errors from this SOA based applications could be challenging. However this tip I'm about to share should be helpful to figure out exactly where the issue is coming from. Looking through the trace of the communication between the service hosts (running on the compute nodes) and the broker is often the key to identifying where the problem lies.

    You can use the Windows Communication Foundation (WCF) Service Trace Viewer Tool to analyze messages logged by WCF. Service Trace Viewer is included in the Microsoft Windows Software Development Kit (SDK) for Windows Vista and .NET Framework Runtime Components. You can download the Windows SDK from the Microsoft Download Center at http://go.microsoft.com/fwlink/?LinkID=75636. For more information about using this tool, see "Service Trace Viewer Tool (SvcTraceViewer.exe)"at http://go.microsoft.com/fwlink/?LinkId=88991.

    The following are the instructions to enable tracing.

    1.       Modify the system.diagnostic section HpcServiceHost.exe.config in %CCP_HOME%\bin\ (*for each compute nodes*) as follow:

     

    <system.diagnostics>

        <sources>

          <source name="Microsoft.Hpc.HpcServiceHosting" switchValue="All">

            <listeners>

              <add name="Console" />

              <add name="ServiceHostTraceListener" />

            </listeners>

          </source>

        </sources>

        <sharedListeners>

          <add initializeData="\\<HEADNODE>\CcpSpoolDir\host.svclog" type="System.Diagnostics.XmlWriterTraceListener"

            name="ServiceHostTraceListener">

            <filter type="" />

          </add>

          <add type="System.Diagnostics.ConsoleTraceListener, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"

            name="Console" traceOutputOptions="DateTime, ThreadId">

            <filter type="" />

          </add>

        </sharedListeners>

        <trace autoflush="true" />

    </system.diagnostics>

     

    Modify the modify the system.diagnostic section of the HpcWcfBroker.exe.config in %CCP_HOME%\bin (*on all broker nodes*) as follows:

     

      <system.diagnostics>

        <sources>

          <source name="Microsoft.Hpc.ServiceBroker" switchValue="All">

            <listeners>

              <add name="Console">

                <filter type="" />

              </add>

              <add name="WSLBTraceListener">

                <filter type="" />

              </add>

              <remove name ="Default" />

            </listeners>

          </source>

        </sources>

        <sharedListeners>

          <add type="System.Diagnostics.ConsoleTraceListener, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"

            name="Console" traceOutputOptions="DateTime, ThreadId">

            <filter type="" />

          </add>

          <add initializeData="\\<HEADNODE>\CcpSpoolDir\broker.svclog"

            type="System.Diagnostics.XmlWriterTraceListener, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"

            name="WSLBTraceListener" traceOutputOptions="Timestamp">

            <filter type="" />

          </add>

        </sharedListeners>

        <trace autoflush="true" />

      </system.diagnostics>

     

    2.       replace the <HEADNODE> in both files with your headnode name.

    3.       Run your application until you see the errors

    4.       All svclog files will be under \\<HEADNODE>\CcpSpoolDir\.

     

     

    Helpful tip for dealing with HPC Pack setup failures

    I once ran into an issue while installing the HPC Pack on a machine and I was wondering where to go to find more information about the setup failure preferably a log of some sort to see the reason for the failure. Here is a tip that helped.

    The HPC Pack setup writes two logs to the installer’s %temp%\HPCSetupLogs directory: one for the MSI related installs and one for the setup program itself. If you run into errors while installing, more information would be present in setup-<date>-<time>.txt log.

    Simply navigate to your %temp% folder and look for the HPCSetupLogs directory and you can see a logged output of the setup.

    Hope someone finds this helpful someday.

     

     

    Authentication Failure error on Windows HPC caused by port conflict

     I recently ran into an issue where all attempts to run any cluster command from the command line resulted in an authentication failure. We were able to connect to the cluster from the GUI and powershell however all attempts to connect to the cluster from the command line falied with that simple error - Authentication failure.

    I was pretty stumped on this as I could not understand why this failure occured only on the command line but not via the GUI or powershell. A look at the technet article here http://technet.microsoft.com/en-us/library/cc719008.aspx#BKMK_Firewall shows that there are specific ports used for communication between the cluster services on the head node and compute nodes. As an example, the command line tools uses port 5800 for communication with the HPC Job Scheduler Service on the head node, and port 5969 is used by the client tools on the enterprise network to connect to the HPC Job Scheduler Service on the head node. If you're having trouble communicating to the Job scheduler services on the head node it is always a good idea to investigate which process is listening on which port. A useful tool to accomplish this is netstat.exe.  

    Running netstat -ano displays all connections and listening ports, addresses and port numbers in numerical form and the owning process PID is listening on each port connection. Compare this with the output from tasklist.exe and you can pretty much figure out which process is listening on which port.

    Doing this in my scenario revealed that a different application (VNC Server) was listening on port 5800 and as a result, the command line interface was unable to connect to the scheduler service on that port. The solution to this was simply to reconfigure the VNC application to listen on a different port and then restart the HPC Job Scheduler service.

    After this the command line interface to the job scheduler was working well just as expected.

    I hope someone out there in Windows HPC land finds this post helpful someday.

     

    More Posts Next page »
    Page view tracker