**MSMPI Tracing**

Windows HPC Server 2008 (formerly known as Computer Cluster Server v2) includes all new MPI tracing featuring Windows ETW tracing. The new MPI tracing is always available and with a flick of a switch you can turn tracing on or off without the need to re-compile or re-link your MPI application. The ETW mechanism is pervasively available throughout Windows component. One can log system events alongside MPI tracing; moreover, an application can log its own ETW events into the same event file. The trace log files are stored locally on each compute node, thus some form of clock synchronization is required to get a coherent view of these events.

While implementing MSMPI trace clock sync I had a need for a parallel Dijkstra shortest path algorithm. Searching the net for such a parallel algorithm, I found two main flavors of the parallel algorithm. One finds the shortest path from every node to every other node. The parallel algorithm runs a serial single source Dijkstra algorithm for each vertex in parallel. This is an embarrassedly parallel algorithm where the processes do not need to communicate data. The second flavor parallelizes a single source Dijkstra algorithm, but uses a complex graph partitioning scheme and algorithm that is too rich for my needs.

To read more about the Dijkstra algorithm see http://en.wikipedia.org/wiki/Dijkstra's_algorithm

I ended up putting together a parallel single source Dijkstra algorithm that is very close to the serial algorithm. With a simple twist, using MPI_Allreduce I turned the algorithm into a parallel one. The algorithm presented here is rather simple with very good performance characteristics.

**The Parallel Algorithm**

Let V be the vertices set and E the edges set in the graph G. Let P be the process set that parallelize the algorithm and thus |P| is the parallelism level.

Divide the set of vertices V into |P| subsets, where each subset contains about |V|/|P| vertices, we’ll call the subset V_{P}. It does not matter how the graph is partitioned, any vertex can be in any subset. It is also okay if not all subsets have exactly the same number of vertices.

The edges go together with their associated vertices, but since adjacent vertices can belong to different subsets, we end up with 2|E|/|P| average number of edges in each subset.

**Process Data:**

Each process holds a subset of vertices V_{P}. Each vertex v in V_{P} includes the following members

v.id – the vertex unique id in V

v.edge – the edge used for the shortest path to the source

v.dist – the vertex distance to the source

v.edges – the set of this vertex edges; including their weights and neighbor vertex id

**Initialization:**

Each vertex is loaded into V_{P} including its edges which make a total of |V|/|P| vertices and 2|E|/|P| edges per process.

The memory requirements are then O( *(|V|+2|E|)/|P| *) Each v.dist is initialized with the max possible distance MAX_DISTANCE.

Initialization time complexity is the same as the memory requirements.

**Pseudo Code:**

1 **function** *shortest_path*(source, |V|, V_{P}):

2 v.id := source // the first vertex is the source

3 v.dist := 0 // and its distance is zero

4 **for** ( n = |V|; n > 1; n-- ): // iterate |V|-1 times

5 remove_vertix( v, V_{P} ) // if vertex v exists in V_{P} remove it

6 **for** **each** neighbor u of v: // update all vertices distance to source

7 alt := v.dist + u.edges(v.id)

8 if alt < u.dist // found better route

9 u.dist := alt // relax u

10 u.edge := v.id // and remember the path

11 v = min(V_{P}) // find best vertex in V_{P}

12 *MPI_Allreduce*( {v.dist, v.id, v.edge}, *op_compare* ) // find best vertex in V

13 **print** “vertex ” v.id “ distance to source “ source “ is “ v.dist “ going through “ v.edge

Every process iterates |V| - 1 times and with each iteration one vertex distance to the source is resolved. The MPI_Allreduce collective call compares the candidate vertex from all processes and returns the vertex with the shortest distance to the source.

The min(V_{P}) call returns the vertex with the minimal distance to the source or a vertex with v.dist = MAX_DISTANCE if V_{P} is empty.

1 **function** *op_compare*(a, b):

2 **if** a.dist < b.dist

3 return a

4 **if** a.dist = b.dist and a.id < b.id

5 return a

6 return b

The *op_compare* function chooses the vertex with the shortest distance to the source or, the vertex with the lowest id when there are several vertices with the same shortest distance. Thus, the vertices to compare are totally ordered as the vertex ids are unique.

**The Algorithm Performance**

It’s easy to see that the algorithm time complexity is O( *|V| * (remove_vertix + update vertices + min + MPI_Allreduce* ).

O(remove_vertix)

Accessing and removing a vertex in the set V_{P} can be implemented in O(1) using a hash table where v.id is the key. Alternatively it could be implemented as a simple array if the set V is partitioned in such a way that each V_{P} includes a set of vertices with contiguous ids. Each vertex would have an extra field v.removed; and removing it would be a matter of flipping this bit to 1.

O(update vertices)

In the worse case each vertex in the set V_{P} is visited, which makes the best implementation O( *|V|/|P|* ), but still we need to check whether u is a is a neighbor of v. Checking that, can be implemented in O(1) using a hash table, or simple array (in the latter case the memory requirements grow significantly). Thus, the overall time complexity is O( *|V|/|P|* ).

O(min)

Finding the vertex with the minimal distance to the source can be implemented in O( *|V|/|P|* ) by walking the list of vertices in V_{P}. Alternatively, the best vertex can be found while updating the vertices (if walking the entire set).

O(MPI_Allreduce)

The time complexity for MPI_Allreduce in many implementations is O( *log|P|* ).

Thus, the overall complexity of this algorithm is O( *|V| * (1 + |V|/|P| + log|P|)* ) which is,

O ( *|V| ^{2}/|P| + |V|log|P|* ) => O(

*|V|*)

^{2}/|P|This time complexity has a linear parallel speed up of |P|, which is nice to have.

In some cases (like MSMPI trace clock sync) you can choose |P| that is proportional to |V|; that is k|P| = |V|. In this case the parallel time complexity can be expressed as O( k^{2}|P| + k|P|log|P| ) => O( |P|log|P ) or O( |V|log|V| ) which is the best time

Thanks,

.Erez