We get an awful lot of questions about how to go about getting the desired process placement across nodes in an MPI job (just check our forums if you don’t believe me), so I thought I’d post here to shed some light on the things that are possible.
The first thing you need to do is express the number of resources that you’d like allocated for your job. This can be done at either the node, socket, or core, level. Requesting 4-8 cores will assign your job as few as 4 and as many as 8 processors; requesting 4-8 nodes will get you as few as 4 and as many as 8 hosts, and so forth. For more details on how this sort of request works, check out my previous post, How that Node/Socket/Core thing Works.
You may be wondering why it’s necessary to specify a Min and a Max for your job. The reason is that this enables you to help the Job Scheduler decide when and where to run your job. The job scheduler will start your job as soon as it has at least the minimum number of processors. If at that time there are more than the minimum number of resources available, the scheduler will give you up to your maximum number of resources. Thus, setting a small min will allow your job to start sooner, while setting a large min will cause your job to wait in the queue until more resources are available. There is no best way for everyone: you should decide on the min and max that works best for you! But the general guidelines are:
· The smaller your Minimum, the sooner your job will run, so pick as small a Minimum as you’d reasonably accept
· Set a Maximum that’s as large as your job can take advantage of to reduce its overall run time, so pick a Maximum which matches the largest your application scales with reasonable performance gains
· You can always set the Minimum and Maximum to the same value to request a fixed number of nodes
This capability, and all of the MPI process placement features described below, are designed to allow you to specify how you want your job to run without needing to know ahead of time how many or which nodes your MPI application will end up running on.
Once you’ve figured out how many resources you want for your job, the next step is to figure out how you want your MPI ranks started across these nodes. By default, ranks will be started on a “1-per resource requested” basis, so requesting 4 sockets will result in 4 MPI ranks, 1 per socket. For example:
C:\>job submit /numsockets:9 mpiexec MyApp.exe
9 ranks of MyApp.exe will be started across an unknown number of nodes, with no more than 1 rank per socket started on any node.
C:\>job submit /numnodes:2-4 mpiexec MyApp.exe
2-4 ranks of MyApp.exe will be started across 2-4 physical nodes (depending on how many are available), with 1 rank per node.
Table 1: Submitting an MPI Job with Default Process Placement
We provide a new mpiexec option, -cores (or –c) which allows you to specify the number of ranks to start on each node assigned to your job. This especially useful with node-level scheduling; allowing you to control the size and placement of your job with laser-like precision! Adding some of the other node selection options (like corespernode) will make this even more powerful. For example:
C:\>job submit /numnodes:4-4 mpiexec –cores 2 MyApp.exe
MyApp.exe will be started across 4 nodes, with 2 ranks per node (for a total of 8 ranks).
C:\>job submit /numnodes:1-8 mpiexec –cores 3 MyApp.exe
Between 3 and 24 ranks of MyApp.exe will be started, with 3 ranks per node spanning up to 8 nodes.
C:\>job submit /numnodes:8 /corespernode:8 mpiexec –cores 7 MyApp.exe
MyApp.exe will start on 8 nodes. All 8 nodes must have at least 8 cores on them, and 7 ranks of MyApp.exe will be started on each of the nodes (for a total of 56 ranks).
Table 2: Submitting an MPI Job and Specifying the Number of Cores per Node
Note: The /corespernode option refers to the minimum number of cores which must be present on any node assigned to the job, not the number of cores to allocate on a node.
You can use the –n argument to mpiexec to set the total number of ranks to start across the entire run, allowing even more fine grained control. For example:
C:\>job submit /numcores:8 mpiexec –n 16 MyApp.exe
16 ranks of MyApp.exe will be started, 2 to a core over 8 cores.
C:\>job submit /numnodes:4 mpiexec –n 8 MyApp.exe
8 ranks of MyApp.exe will be started across 4 nodes.
Table 3: Using the -n option to mpiexec
Setting affinity can result in huge performance improvements for MPI applications, and we’ve made it way easier for you to take advantage of affinity! How easy? Just use the –affinity flag to mpiexec, and each rank of your MPI application will be locked to a single core (which can dramatically improve performance for certain applications). For example:
C:\>job submit /numnodes:2-4 mpiexec –cores 2 –affinity MyApp.exe
MyApp.exe will be started with 2 ranks on each of between 2 – 4 nodes, for a total of 4 -8 ranks. Each rank will get affinity to one of the cores on its assigned node, so that the two ranks sharing a node cannot step on each other’s toes. If the nodes in question have a NUMA architecture, the ranks on each node will automatically be placed on separate NUMA nodes.
C:\>job submit /numsockets:8 mpiexec –affinity MyApp.exe
8 ranks of MyApp.exe will be started across an unknown number of nodes, where each rank will have a dedicated path to memory that cannot be used by any other job.
Table 4: Submitting an MPI Job with Affinity
Note: Mpiexec will attempt to automatically insure that ranks are spaced as “far apart” as possible; i.e. on different sockets in a NUMA system.
You can run a very simple test that will tell you how your placement worked. If you use mpiexec –l hostname.exe (plus any other arguments you need), your output will be a list of MPI ranks and the nodes that they appeared in. This will allow you to see the number of ranks started on each node, as well as the round-robin order that mpiexec uses.
That’s the scoop on process placement with Windows HPC Server 2008. Go try it out! And if you encounter any problems, please post up on our forums and we’ll be happy to help you out.