MSMPI Tracing

Windows HPC Server 2008 (formerly known as Computer Cluster Server v2) includes all new MPI tracing featuring Windows ETW tracing. The new MPI tracing is always available and with a flick of a switch you can turn tracing on or off without the need to re-compile or re-link your MPI application. The ETW mechanism is pervasively available throughout Windows component. One can log system events alongside MPI tracing; moreover, an application can log its own ETW events into the same event file. The trace log files are stored locally on each compute node, thus some form of clock synchronization is required to get a coherent view of these events.

While implementing MSMPI trace clock sync I had a need for a parallel Dijkstra shortest path algorithm. Searching the net for such a parallel algorithm, I found two main flavors of the parallel algorithm. One finds the shortest path from every node to every other node. The parallel algorithm runs a serial single source Dijkstra algorithm for each vertex in parallel. This is an embarrassedly parallel algorithm where the processes do not need to communicate data. The second flavor parallelizes a single source Dijkstra algorithm, but uses a complex graph partitioning scheme and algorithm that is too rich for my needs.

To read more about the Dijkstra algorithm see http://en.wikipedia.org/wiki/Dijkstra's_algorithm

I ended up putting together a parallel single source Dijkstra algorithm that is very close to the serial algorithm. With a simple twist, using MPI_Allreduce I turned the algorithm into a parallel one. The algorithm presented here is rather simple with very good performance characteristics.

The Parallel Algorithm

Let V be the vertices set and E the edges set in the graph G. Let P be the process set that parallelize the algorithm and thus |P| is the parallelism level.

Divide the set of vertices V into |P| subsets, where each subset contains about |V|/|P| vertices, we’ll call the subset VP. It does not matter how the graph is partitioned, any vertex can be in any subset. It is also okay if not all subsets have exactly the same number of vertices.

The edges go together with their associated vertices, but since adjacent vertices can belong to different subsets, we end up with 2|E|/|P| average number of edges in each subset.

Process Data:

Each process holds a subset of vertices VP. Each vertex v in VP includes the following members

v.id                the vertex unique id in V

v.edge         the edge used for the shortest path to the source

v.dist            the vertex distance to the source

v.edges       the set of this vertex edges; including their weights and neighbor vertex id

Initialization:

Each vertex is loaded into VP including its edges which make a total of |V|/|P| vertices and 2|E|/|P| edges per process.

The memory requirements are then O( (|V|+2|E|)/|P| ) Each v.dist is initialized with the max possible distance MAX_DISTANCE.

Initialization time complexity is the same as the memory requirements.

Pseudo Code:

1 function shortest_path(source, |V|, VP):

2   v.id := source                    // the first vertex is the source

3   v.dist := 0                       // and its distance is zero

4   for ( n = |V|; n > 1; n-- ):      // iterate |V|-1 times

5      remove_vertix( v, VP )         // if vertex v exists in VP remove it

6      for each neighbor u of v:      // update all vertices distance to source

7         alt := v.dist + u.edges(v.id)

8         if alt < u.dist             // found better route

9            u.dist := alt            // relax u

10            u.edge := v.id           // and remember the path

11      v = min(VP)                    // find best vertex in VP

12      MPI_Allreduce( {v.dist, v.id, v.edge}, op_compare )  // find best vertex in V

13      print “vertex ” v.id “ distance to source “ source “ is “ v.dist “ going through “ v.edge

Every process iterates |V| - 1 times and with each iteration one vertex distance to the source is resolved. The MPI_Allreduce collective call compares the candidate vertex from all processes and returns the vertex with the shortest distance to the source.

The min(VP) call returns the vertex with the minimal distance to the source or a vertex with v.dist = MAX_DISTANCE if VP is empty.

1 function op_compare(a, b):

2   if a.dist < b.dist

3      return a

4   if a.dist = b.dist and a.id < b.id

5      return a

6   return b

The op_compare function chooses the vertex with the shortest distance to the source or, the vertex with the lowest id when there are several vertices with the same shortest distance. Thus, the vertices to compare are totally ordered as the vertex ids are unique.

The Algorithm Performance

It’s easy to see that the algorithm time complexity is O( |V| * (remove_vertix + update vertices + min + MPI_Allreduce ).

O(remove_vertix)

Accessing and removing a vertex in the set VP can be implemented in O(1) using a hash table where v.id is the key. Alternatively it could be implemented as a simple array if the set V is partitioned in such a way that each VP includes a set of vertices with contiguous ids. Each vertex would have an extra field v.removed; and removing it would be a matter of flipping this bit to 1.

O(update vertices)

In the worse case each vertex in the set VP is visited, which makes the best implementation O( |V|/|P| ), but still we need to check whether u is a is a neighbor of v. Checking that, can be implemented in O(1) using a hash table, or simple array (in the latter case the memory requirements grow significantly). Thus, the overall time complexity is O( |V|/|P| ).

O(min)

Finding the vertex with the minimal distance to the source can be implemented in O( |V|/|P| ) by walking the list of vertices in VP. Alternatively, the best vertex can be found while updating the vertices (if walking the entire set).

O(MPI_Allreduce)

The time complexity for MPI_Allreduce in many implementations is O( log|P| ).

Thus, the overall complexity of this algorithm is O( |V| * (1 + |V|/|P| + log|P|) ) which is,

O ( |V|2/|P| + |V|log|P| ) => O( |V|2/|P| )

This time complexity has a linear parallel speed up of |P|, which is nice to have.

In some cases (like MSMPI trace clock sync) you can choose |P| that is proportional to |V|; that is k|P| = |V|. In this case the parallel time complexity can be expressed as O( k2|P| + k|P|log|P| ) => O( |P|log|P ) or O( |V|log|V| ) which is the best time

Thanks,

.Erez