Welcome to TechNet Blogs Sign in | Join | Help

The Windows HPC Team Blog

"Your guide to all things Windows HPC"
How to connect to a HPC Server cluster from a PC not into the AD

To manage/monitor/use a cluster you don’t need to be logged on the head node, you just have to install the HPC Pack on your Windows Workstation.

Idealy your PC and the cluster will be into the same Active Directory, but if this is not the case, you will need to follow this step : 

1 - Install HPC Pack on your PC

2 - If the head node name resolution is not made :

          Add your the head node into your host file ( located in c:\Windows\System32\drivers\etc\ )

3 - Use an account into the cluster AD when you talk to the head node :

         Add your AD credential into the Network Password Manager of your PC
         (you will use this credential each time you “talk” to the head node): 

  • For Vista :
    1. Goes into the "Control Pannel \ User Account" and then select "Manage your network passwords"
    2. Into the WIndows "Stored Credential Properties", enter the head node name and then the domain account name and password

           Your can also go and see this Tutorial : http://www.vistax64.com/tutorials/76585-stored-user-names-passwords.html

  • For XP/2003 :
    1.  If you are running Windows Server 2003, click Control Panel, click Stored User Names and Passwords, and then go to step 5. If you are running Windows XP, go to step 2.
    2. In Control Panel, click User Accounts.
    3. Click the Advanced tab.
    4. Click Manage Passwords.
    5. Click Add.
    6. Type the name of the SoftGrid Virtual Application Server.
    7. Type the domain name and the user name in the following format:
      Domain\Username
    8. Type the password.
    9. Click OK, click Close, and then click OK

        see : http://support.microsoft.com/kb/931172 

 

Coming early November - New windowshpc.net web site!

What will change?

Http://windowshpc.net will be migrated from its existing Community Server platform to Microsoft Office SharePoint Server 2007.

Existing users and community members will need to re-register using Microsoft Live ID .

We are planning to migrate as much of your user information to the new site as possible.  Due to privacy regulations we may not be able to move all information.  We appreciate your understanding and cooperation in this matter.

 

Why are you migrating the site?

We are going to implement new features such as RSS feeds with the new site.  Users will be able to receive new information through RSS feeds from other Microsoft premises.

Community members will be presented with information based on their registration profile.  Non-profile related content will be still searchable and available.

 

What will the new site look like?

 Stay tuned   

7 Minute Job Scheduling Thrill-Ride

Shahrokh was nice enough to film and post a video of me walking through the HPC Job Scheduler and some of its new features at SuperComputing 2008.  You can check out the video up on YouTube.

In the video I cover many of the different scheduler interfaces and policies, so it's a great primer for those unsure about what our scheduler offers.

MSMPI tracing and Parallel Dijkstra

MSMPI Tracing

Windows HPC Server 2008 (formerly known as Computer Cluster Server v2) includes all new MPI tracing featuring Windows ETW tracing. The new MPI tracing is always available and with a flick of a switch you can turn tracing on or off without the need to re-compile or re-link your MPI application. The ETW mechanism is pervasively available throughout Windows component. One can log system events alongside MPI tracing; moreover, an application can log its own ETW events into the same event file. The trace log files are stored locally on each compute node, thus some form of clock synchronization is required to get a coherent view of these events.

 

While implementing MSMPI trace clock sync I had a need for a parallel Dijkstra shortest path algorithm. Searching the net for such a parallel algorithm, I found two main flavors of the parallel algorithm. One finds the shortest path from every node to every other node. The parallel algorithm runs a serial single source Dijkstra algorithm for each vertex in parallel. This is an embarrassedly parallel algorithm where the processes do not need to communicate data. The second flavor parallelizes a single source Dijkstra algorithm, but uses a complex graph partitioning scheme and algorithm that is too rich for my needs.

To read more about the Dijkstra algorithm see http://en.wikipedia.org/wiki/Dijkstra's_algorithm

 

I ended up putting together a parallel single source Dijkstra algorithm that is very close to the serial algorithm. With a simple twist, using MPI_Allreduce I turned the algorithm into a parallel one. The algorithm presented here is rather simple with very good performance characteristics.

The Parallel Algorithm

Let V be the vertices set and E the edges set in the graph G. Let P be the process set that parallelize the algorithm and thus |P| is the parallelism level.

 

Divide the set of vertices V into |P| subsets, where each subset contains about |V|/|P| vertices, we’ll call the subset VP. It does not matter how the graph is partitioned, any vertex can be in any subset. It is also okay if not all subsets have exactly the same number of vertices.

The edges go together with their associated vertices, but since adjacent vertices can belong to different subsets, we end up with 2|E|/|P| average number of edges in each subset.

 

Process Data:

Each process holds a subset of vertices VP. Each vertex v in VP includes the following members

v.id                the vertex unique id in V

v.edge         the edge used for the shortest path to the source

v.dist            the vertex distance to the source

v.edges       the set of this vertex edges; including their weights and neighbor vertex id

 

Initialization:

Each vertex is loaded into VP including its edges which make a total of |V|/|P| vertices and 2|E|/|P| edges per process.

The memory requirements are then O( (|V|+2|E|)/|P| ) Each v.dist is initialized with the max possible distance MAX_DISTANCE.

Initialization time complexity is the same as the memory requirements.

 

Pseudo Code:

 

 

  1 function shortest_path(source, |V|, VP):

  2   v.id := source                    // the first vertex is the source

  3   v.dist := 0                       // and its distance is zero

  4   for ( n = |V|; n > 1; n-- ):      // iterate |V|-1 times

  5      remove_vertix( v, VP )         // if vertex v exists in VP remove it

  6      for each neighbor u of v:      // update all vertices distance to source

  7         alt := v.dist + u.edges(v.id)

  8         if alt < u.dist             // found better route

  9            u.dist := alt            // relax u

 10            u.edge := v.id           // and remember the path

 11      v = min(VP)                    // find best vertex in VP

 12      MPI_Allreduce( {v.dist, v.id, v.edge}, op_compare )  // find best vertex in V

 13      print “vertex ” v.id “ distance to source “ source “ is “ v.dist “ going through “ v.edge

 

 

Every process iterates |V| - 1 times and with each iteration one vertex distance to the source is resolved. The MPI_Allreduce collective call compares the candidate vertex from all processes and returns the vertex with the shortest distance to the source.

The min(VP) call returns the vertex with the minimal distance to the source or a vertex with v.dist = MAX_DISTANCE if VP is empty.

 

 

  1 function op_compare(a, b):

  2   if a.dist < b.dist

  3      return a

  4   if a.dist = b.dist and a.id < b.id

  5      return a

  6   return b

 

 

The op_compare function chooses the vertex with the shortest distance to the source or, the vertex with the lowest id when there are several vertices with the same shortest distance. Thus, the vertices to compare are totally ordered as the vertex ids are unique.

 

The Algorithm Performance

It’s easy to see that the algorithm time complexity is O( |V| * (remove_vertix + update vertices + min + MPI_Allreduce ).

 

O(remove_vertix)

Accessing and removing a vertex in the set VP can be implemented in O(1) using a hash table where v.id is the key. Alternatively it could be implemented as a simple array if the set V is partitioned in such a way that each VP includes a set of vertices with contiguous ids. Each vertex would have an extra field v.removed; and removing it would be a matter of flipping this bit to 1.

 

O(update vertices)

In the worse case each vertex in the set VP is visited, which makes the best implementation O( |V|/|P| ), but still we need to check whether u is a is a neighbor of v. Checking that, can be implemented in O(1) using a hash table, or simple array (in the latter case the memory requirements grow significantly). Thus, the overall time complexity is O( |V|/|P| ).

 

O(min)

Finding the vertex with the minimal distance to the source can be implemented in O( |V|/|P| ) by walking the list of vertices in VP. Alternatively, the best vertex can be found while updating the vertices (if walking the entire set).

 

O(MPI_Allreduce)

The time complexity for MPI_Allreduce in many implementations is O( log|P| ).

 

Thus, the overall complexity of this algorithm is O( |V| * (1 + |V|/|P| + log|P|) ) which is,

 

O ( |V|2/|P| + |V|log|P| ) => O( |V|2/|P| )

 

This time complexity has a linear parallel speed up of |P|, which is nice to have.

 

In some cases (like MSMPI trace clock sync) you can choose |P| that is proportional to |V|; that is k|P| = |V|. In this case the parallel time complexity can be expressed as O( k2|P| + k|P|log|P| ) => O( |P|log|P ) or O( |V|log|V| ) which is the best time

 

Thanks,

.Erez

Creating Submission and Activation Filters
There were a lot of questions about how to use Activation and Submission Filters to help customize queue managmeent and do things like license-aware scheduling.  That, on top of some changes made in a QFE, led us to do an updated doc on using filters.  You can check it out here:
http://technet.microsoft.com/en-us/library/dd277833.aspx
 
It contains updated sample code and explanations.  Let us know if you think any information is missing!
Matlab Users among Actuaries?

We recently had the chance to speak with some modelers using Matlab to run some analysis for their Enterprise Risk Management program.

 

They were surprised to find out that Matlab was supported on the Windows Compute Cluster Server. Is anybody else out there running Matlab for Economic capital analysis or ERM?

 

Please let me know,

 

Weather Research and Forecast (WRF) Model Port to Windows

The Weather Research and Forecast (WRF) project is a multi-year/multi-institution collaboration to develop a next generation regional forecast model and data assimilation system for operational numerical weather prediction (NWP) and atmospheric research. Under this project – a collaboration between The National Center for Atmospheric Research (NCAR), Microsoft Corporation, Advanced Micro Devices, Inc., and The Portland Group, Inc (PGI) – a prototype version of WRF has been developed and demonstrated running in parallel using MPI on an AMD Opteron™ dual-core processor based server running Windows Compute Cluster Server 2003, Microsoft Compute Cluster Pack,. This preliminary report describes progress and issues encountered in porting WRF to an HPC cluster running Windows, using Microsoft Subsystem for UNIX-based Applications (SUA) and the PGI Fortran compiler.

Get the complete whitepaper @ http://archives.windowshpc.net/files/699/download.aspx

Integrating Windows HPC Server 2008 with Linux

We find many of our Windows HPC Server 2008 deployments are going into environments where there are existing Linux (and Linux HPC) solutions. It is possible to configure these two environments to achieve integrated authentication, file sharing and job submission.

 

The following technical documents provide step by step instructions on how to do a typical installation of Windows HPC and a Linux HPC distribution so as to achieve a single sign on environment, file sharing and to submit jobs from a Linux environment running Sun Grid Engine into the Windows HPC 2008 job scheduler using the HPC Basic Profile web service specification.

 

 

 

Title

Details Page

Direct Download

Installation of Fedora Samba for Windows AD Compatibility

http://www.microsoft.com/downloads/details.aspx?FamilyId=1C2C91A8-6D81-4BC2-94E9-448D68A7D06D&displaylang=en

http://download.microsoft.com/download/1/6/9/16963418-6d06-4cb6-8b65-9fe3da11c583/Installation_of_Fedora-Samba_for_Windows_AD_Compatibility_Final.doc

Installation Instructions for Cluster Corp Rocks+ on the HP Proliant DL145 G2 Based Cluster

http://www.microsoft.com/downloads/details.aspx?FamilyId=7AE6D41D-4C86-4B34-9C62-466646915926&displaylang=en

http://download.microsoft.com/download/5/b/5/5b55533c-12a6-4979-8849-f7b7a57eff61/Linux_Installation_Final.doc

Installation of Fedora 8 Linux to Access a Windows HPC Server 2008 Cluster

http://www.microsoft.com/downloads/details.aspx?FamilyId=25647D7B-CE6A-45D2-8472-12F4DB537951&displaylang=en

http://download.microsoft.com/download/f/1/9/f190260a-a892-491b-ab2e-c884c72a9e7d/Linux_Installation_for_Windows_Cluster_Access_Final.doc

Installation Instructions for a Windows HPC Server 2008 Based Cluster on HP Proliant DL145 G2 Hardware

http://www.microsoft.com/downloads/details.aspx?FamilyId=1349438A-A05B-4E2B-91F8-8BF3058EB307&displaylang=en

http://download.microsoft.com/download/0/1/a/01af1aba-0015-4236-a4ab-7498d2e51829/Microsoft_Installation_with_AD_Final.doc

The Windows HPC Server 2008 Cluster in a Linux Environment

http://www.microsoft.com/downloads/details.aspx?FamilyId=9E65676E-D34E-4671-B841-0D1DCA996A8B&displaylang=en

http://download.microsoft.com/download/1/1/6/116be099-9c6b-424c-81e6-c9ce2455ae80/Windows HPC in Linux Environment_Final.doc

 

Why choose a single tool?

Last week at the Valuation Actuaries Symposium I was able to attend a session hosted by Milliman at the Embassy Suites Hotel during which the MG-ALFA modeling product and the MG-Triton Valuation product were discussed and reviewed in the context of how they contributed to their combined effectiveness.

One of the questions raised during the session focused on the possibility of using a single modeling tool for both deterministic and Stochastic modeling. My interpretation of this question was , would it be possible in the future to have a single modeling tool for all purposes?

My reaction to this question was, why would we want to limit ourselves to a single tool? It might be easier to provision and train your team on one tool, but in an increasingly specialized world it seems that we would be missing out on a lot of expertise.

As an audience member I had the opportunity to add comments to the answer on the question. All I could say is that as a representative of Microsoft, I felt my way to add value to this community is by working to make all the tools for actuaries available on a single easily accessed and flexibly deployed platform.

I'd love to hear your opinions on this subject. Is it a worthy goal to get all the tools on a single compute platform?

Clusrunning with Windows HPC Server 2008

One of our most popular features in the Compute Cluster Pack was clusrun (known to you GUI users as “Remote Command Execution”), which allowed you to run a command line command across a set of cluster nodes in parallel, with their output piped back to you on the client.  Not content to rest on our laurels, we’ve made some additions to clusrun’s capabilities in Windows HPC Server 2008.  I’ll dig into some of them below.

 

But First, Clusrun Basics

At a basic level, clusrun runs a job with a task in it for each node that you specify.  This job completely bypasses the queue to start right away, and the tasks pipe their information back to the client machine.  This has a couple of requirements to work, namely:

·         All of the target machines must be nodes in the cluster (with the HPC pack installed and able to communicate with the head node), but they don’t have to be in the “Online” state

·         Your compute nodes must be able to right to a fileshare on the client computer; you can test this by logging into a node and attempting to connect to \\client\c$

·         Your job scheduler needs to be working

Assuming these requirements are met, you can run a clusrun command either from the command line (using the clusrun command) or from the HPC Cluster Manager (by right clicking some nodes and selecting “Run Command . . .”).  As a simple example, try running clusrun /all hostname.exe, each of the nodes in your cluster will print out its name onto your client:

PS> clusrun /all hostname.exe

Enter the password for 'REDMOND\jbarnard' to connect to 'JBarnardHN':

Remember this password? (Y/N)Y

-------------------------- JBARNARDCN01 returns 0 --------------------------

JBARNARDCN01

-------------------------- JBARNARDCN03 returns 0 --------------------------

JBARNARDCN03

-------------------------- JBARNARDHN returns 0 --------------------------

JBarnardHN

-------------------------- JBARNARDCN02 returns 0 --------------------------

JBARNARDCN02

-------------------------- Summary --------------------------

4 Nodes succeeded

0 Nodes failed

 

So What’s New?

There are a lot of new options for clusrun in HPCS 2008.  These includes

 

New Formatting Options: Sorted or Interleaved Output

By default, clusrun returns output as each node completes the command.  But you can override this by using either the /sorted or /interleaved flags.

/Sorted prints node output in alphabetical order, making it easier to find a specific node.  /Interleaved prints out lines of output as they come back, which is great for processing with a script or for determining just where things are going wrong.

 

Picking Your Nodes: Exclude, Job, Task

We’ve got some great new options for picking your nodes, including the ability to exclude a set of nodes with the /exclude flag.  So the command “clusrun /all /exclude:Node14 ipconfig” will return the IP configuration of every node other than Node14.

Next up are the /job and /task options, which are my personal favorites!  They allow you to run a clusrun command against all of the nodes which are (or were) assigned to a particular job or task.  For example, “clusrun /task:10.4 del /q SomeFile.txt” will delete SomeFile.txt from every node that ran task #10.4.

 

History Tracking

Clusrun jobs now live in the database just like regular jobs, making it easier to track what you’ve done and to uncover failures.  You can easily find them from the command line by running job list /jobname:”Remote command”, or in the HPC Cluster Manager by selecting the “Clusrun Commands” node in the navigation pane.  Each node in the run will have a separate task (including exit code, error message, etc . . .) allowing you to more easily dig into the causes of failures.

 

Happy Clusrunning!

-Josh

MPI Process Placement with Windows HPC Server 2008

We get an awful lot of questions about how to go about getting the desired process placement across nodes in an MPI job (just check our forums if you don’t believe me), so I thought I’d post here to shed some light on the things that are possible.

 

Requesting Resources

The first thing you need to do is express the number of resources that you’d like allocated for your job.  This can be done at either the node, socket, or core, level.  Requesting 4-8 cores will assign your job as few as 4 and as many as 8 processors; requesting 4-8 nodes will get you as few as 4 and as many as 8 hosts, and so forth.  For more details on how this sort of request works, check out my previous post, How that Node/Socket/Core thing Works.

Why a Min and a Max?

You may be wondering why it’s necessary to specify a Min and a Max for your job.  The reason is that this enables you to help the Job Scheduler decide when and where to run your job.  The job scheduler will start your job as soon as it has at least the minimum number of processors.  If at that time there are more than the minimum number of resources available, the scheduler will give you up to your maximum number of resources. Thus, setting a small min will allow your job to start sooner, while setting a large min will cause your job to wait in the queue until more resources are available.  There is no best way for everyone: you should decide on the min and max that works best for you!  But the general guidelines are:

·         The smaller your Minimum, the sooner your job will run, so pick as small a Minimum as you’d reasonably accept

·         Set a Maximum that’s as large as your job can take advantage of to reduce its overall run time, so pick a Maximum which matches the largest your application scales with reasonable performance gains

·         You can always set the Minimum and Maximum to the same value to request a fixed number of nodes

This capability, and all of the MPI process placement features described below, are designed to allow you to specify how you want your job to run without needing to know ahead of time how many or which nodes your MPI application will end up running on.

 

Setting How Many Processes and Where They Will Run

Once you’ve figured out how many resources you want for your job, the next step is to figure out how you want your MPI ranks started across these nodes.  By default, ranks will be started on a “1-per resource requested” basis, so requesting 4 sockets will result in 4 MPI ranks, 1 per socket.  For example:

Command Line

Result

C:\>job submit /numsockets:9 mpiexec MyApp.exe

9 ranks of MyApp.exe will be started across an unknown number of nodes, with no more than 1 rank per socket started on any node.

C:\>job submit /numnodes:2-4 mpiexec  MyApp.exe

 

2-4 ranks of MyApp.exe will be started across 2-4 physical nodes (depending on how many are available), with 1 rank per node.

Table 1: Submitting an MPI Job with Default Process Placement

 

Setting the Number of Cores per Node with -c

We provide a new mpiexec option, -cores (or –c) which allows you to specify the number of ranks to start on each node assigned to your job.  This especially useful with node-level scheduling; allowing you to control the size and placement of your job with laser-like precision!  Adding some of the other node selection options (like corespernode) will make this even more powerful.  For example:

Command Line

Result

C:\>job submit /numnodes:4-4 mpiexec –cores 2 MyApp.exe

 

MyApp.exe will be started across 4 nodes, with 2 ranks per node (for a total of 8 ranks).

C:\>job submit /numnodes:1-8 mpiexec –cores 3 MyApp.exe

Between 3 and 24 ranks of MyApp.exe will be started, with 3 ranks per node spanning up to 8 nodes.

C:\>job submit /numnodes:8 /corespernode:8 mpiexec –cores 7 MyApp.exe

MyApp.exe will start on 8 nodes.  All 8 nodes must have at least 8 cores on them, and 7 ranks of MyApp.exe will be started on each of the nodes (for a total of 32 ranks).

Table 2: Submitting an MPI Job and Specifying the Number of Cores per Node

Note: The /corespernode option refers to the minimum number of cores which must be present on any node assigned to the job, not the number of cores to allocate on a node.

 

Setting the Number of Total Ranks with -n

You can use the –n argument to mpiexec to set the total number of ranks to start across the entire run, allowing even more fine grained control.  For example:

Command Line

Result

C:\>job submit /numcores:8 mpiexec –n 16 MyApp.exe

 

16 ranks of MyApp.exe will be started, 2 to a core over 8 cores.

C:\>job submit /numnodes:4 mpiexec –n 8 MyApp.exe

8 ranks of MyApp.exe will be started across 4 nodes.

Table 3: Using the -n option to mpiexec

 

Now Set Affinity