This week, I’d like to take some time to explain how a new feature, Multi Level Resource Allocation, can help you get the most out of your applications.

 

The basic explanation for this feature is that when creating a job, you can choose at what granularity your job gets scheduled.  This is as simple as picking from a drop down in the UI, but as with most choices, it deserves a bit of thought!

Figure 1: Setting the resource unit type on a job

 

The first question that pops to mind is: what exactly do Core, Node, and Socket mean?

·         Node (a.k.a. host, machine, computer) refers to an entire compute node.  Each node contains 1 or more sockets.

·         Socket (a.k.a. numa node) refers to collection of cores with a direct pipe to memory.  Each socket contains 1 or more cores.  Note that this does not necessarily refer to a physical socket, but rather to the memory architecture of the machine, which will depend on your chip vendor.

·         Core (a.k.a. processor, cpu, cpu core, logical processor) refers to a single processing unit capable of performing computations.  A core is the smallest unit of allocation available in HPC Server 2008.

 

Next, let me explain how resources actually get allocated to your job.  To do, I’ll refer to this handy diagram (labeled as Figure 2 if I’ve got my post to publish correctly).

 

Figure 2: Multi Level Resource Allocation at work

In the above example, job J1 requested allocation at the Socket level.  This may mean it has a single task that requires 3 sockets, or many tasks which each require 1 socket.  The scheduler has reserved 3 sockets for it (and since it’s running on quad-core sockets, it’s implicitly been allocated 12 cores).  Assuming it is a job with many single-socket tasks, the scheduler will start a single task per socket in the job’s allocation.

Job J2, on the other hand, requested allocation at the Node level, and has been allocated a single node (and implicitly, 16 cores).  The scheduler will thus start 1 task on each node in the jobs allocation.  No other jobs or tasks can be started on that node, so it’s quite similar to using the task Exclusive property.

Job J3 has requested Core allocation, and has shown above, it is has been allocated 4 cores.  The scheduler starts 1 task per core.

 

When should I use each level?

When to use each of these settings will depend on your application, and some experimentation is necessary.  In general, the rule is:

·         Use core allocation if your application is CPU bound; the more processors you can throw at it the better!

·         Use socket allocation if memory access is what bottlenecks your application’s performance.  Since how much data can come in from memory is what limits the speed of the job, running more tasks on the same memory bus won’t result in speed-up since all of those tasks are fighting over the path to memory.

·         Use node allocation if some node-wide resource is what bottlenecks your application.  This is the case with applications that are relying heavily on access to disk or to networks resources.  Running multiple tasks per node won’t result in a speed-up since all of those tasks are waiting for access to the same disk or network pipe.

 

Some key facts:

·         The unit type set on your job also applies to all tasks in that job (i.e. you can’t have a job requesting 4 nodes with a bunch of tasks requesting 2 cores each).

·         You can still use batch scripts or your applications mechanisms to launch multiple threads or processes on the resources that your job is allocated.

·         By using these correctly, you can improve your cluster utilization since jobs are more likely to get only the resources they need.  See Figure 2, where job J1 and job J2 can peacefully coexist on a node.

·         This feature is explicitly designed to work with heterogeneous systems, namely those where your compute nodes have varying hardware.  So a socket allocation job will still get a dedicated pipe to memory for each task whether you are running single-core, dual-core, or quad-core processors.  A node allocation job will get a node per task, whether those nodes have 1 core or 16.