## Large Graph Algorithms for Massively Multithreaded Architectures

by

Pawan Harish, Vibhav Vineet, P J Narayanan

Report No: IIIT/TR/2009/74



Centre for Visual Information Technology International Institute of Information Technology Hyderabad - 500 032, INDIA February 2009

1

# Large Graph Algorithms for Massively Multithreaded Architectures

Pawan Harish, Vibhav Vineet and P. J. Narayanan Center for Visual Information Technology International Institute of Information Technology Hyderabad, INDIA

harishpk@research.iiit.ac.in, vibhavvinet@research.iiit.ac.in, pjn@iiit.ac.in Technical Report Number IIIT/TR/2009/74

Abstract—The Graphics Processing Units (GPUs) provide high computation power at a low cost and is an important compute accelerator with a massively multithreaded architecture. In this paper, we present fast implementations of common graph operations like breadth-first search, st-connectivity, single-source shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow for undirected graphs on the GPU using the CUDA programming model. Our implementations exhibit high performance, especially on large graphs. We use two data-parallel programming methodologies for these algorithms. One is an iterative, mask-based approach that processes valid data elements like vertices and edges using independent threads for each. The other is a divide-and-conquer approach that reduces the problem into smaller problems that are handled later using the same approach. Parallel algorithms for such problems have been reported in the literature before, especially on supercomputers. The massively multithreaded model of the GPU makes it possible to adopt the data-parallel approach even to irregular algorithms like graph algorithms, using O(V) or O(E) simultaneous threads. The algorithms and the underlying techniques presented in this paper are likely to be applicable to many irregular algorithms. We show the impact of our implementations on random, scalefree, and real-life graphs of up to millions of vertices on highend and low-end GPUs. The availability and spread of GPUs to desktops and laptops make them ideal candidates to accelerate graph operations over the CPU-only implementations. Practical implementations of basic operations go a long way in realizing their potential.

Index Terms—Graph Algorithms, GPU, CUDA.

#### I. INTRODUCTION

Modern Graphics Processing Units (GPUs) provide high computation power at low costs and have been described as desktop supercomputers. Several high-performance, general data processing algorithms such as sorting, matrix multiplication, etc., have been developed for the GPUs. We present a set of general graph algorithms on the GPU using the CUDA programming model. Graphs are popular data representations in many computing, engineering, and scientific areas. Fundamental graph operations such as breadth first search, stconnectivity, shortest paths, etc., are building blocks to many applications. Implementations of serial fundamental graph algorithms exist [1], [2] with computing time of the order of vertices and edges. Such implementations become impractical on very large graphs involving millions of vertices and edges, common in many domains like VLSI layout, phylogeny reconstruction, network analysis, etc. Parallel processing is essential to apply graph algorithms on large datasets. Parallel implementations of some graph algorithms on supercomputers are reported, but are accessible only to a few owing to the high hardware costs [3], [4], [5]. CPU clusters have been used for distributed implementations. Synchronization however becomes a bottleneck for them. All graph algorithms cannot scale to parallel hardware models. For example, there does not exist an efficient PRAM solution to the DFS problem. A suitable mix of parallel and serial hardware is required for efficient implementation in such cases.

The GPUs expose a general, data-parallel programming model today in the form of CUDA and CAL. The recently adopted OpenCL standard [6] provides a common computing model to all GPUs and also to other platforms like multicore, manycore, and Cell/B.E. The Compute Unified Device Architecture (CUDA) from Nvidia presents a heterogeneous programming model where the parallel hardware can be used in conjunction with the CPU. CUDA can be used to imitate a parallel random access machine (PRAM) if global memory alone is used. In conjunction with a CPU, it can be used as a bulk synchronous parallel (BSP) hardware with the CPU deciding the barrier for synchronization.

CUDA presents the GPU as a massively threaded parallel architecture, allowing up to millions of threads to run in parallel over its processors, with each having access to a common global memory. Such a tight architecture is a departure from supercomputers, which typically have a small number of powerful cores. The parallelizing approach there is that of divide-and-conquer, where individual processing nodes solve smaller sub-problems followed by a combining step. The massively multithreaded model presented by the GPU makes it possible to adopt the data-parallel approach even on irregular algorithms, using O(V) or O(E) simultaneous threads, breaking down and working at the problem at its smallest constituent.

In this paper, we present a set of general graph algorithms on the GPU, using the CUDA programming model. We adopt two data parallel approaches in this paper: the iterative mask based approach and the divide and conquer approach to solve irregular graph algorithms. Specifically, we present implementations of breadth first search (BFS), st-connectivity (STCON), single source shortest path (SSSP) and maximum flow (MF) using the iterative mask based approach. And the implementation of minimum spanning tree (MST) using the divide-and-conquer approach. We compare various approaches to solve the all pairs shortest path (APSP) problem including iterative, recursive and a matrix multiplication approach. Our implementations exhibit high performance, especially on large graphs. We show experiments on random, scale-free, and real-life graphs of up to millions of vertices. Using a single graphics card, we perform BFS in about half a second on a 10M vertex graph with 120M edges, and SSSP on it in 1.5 seconds. On the DIMACS USA graph of 24M vertices and 58M edges it takes less than 9 seconds for our implementation to compute the minimum spanning tree. We study different approaches to APSP and show a speed up by a factor of 2-4times over Katz and Kider [7]. Compared to the CPU a speed up of nearly 10 - 15 times over the Boost Graph Library is achieved for all algorithms reported in this paper.

The prevalence of GPUs on desktops and laptops today make them feasible accelerators for a wide variety of applications including common graph algorithms. Comparison of timing with the CPU implementations gives an indication of the accelerated performance one can get using low-end and high-end GPUs. Our BFS and SSSP code is already being used by different users and has been included in the Rodinia benchmark [8]. We will make all code available to whoever is interested in using them.

#### II. COMPUTE UNIFIED DEVICE ARCHITECTURE

In this section we present a small overview of the CUDA programming and hardware models. Please see [9] for more details about CUDA programming. Figure 1 depicts the CUDA programming model, mapping a software CUDA block to a hardware CUDA multiprocessor. A number of blocks can be assigned to a multiprocessor and they are time-shared internally by the CUDA programming environment. Each multiprocessors consists of a series processors which run the threads present inside a block in a time-shared fashion based on the warp size of the CUDA device. Each multiprocessor further contains a small shared memory, a set of 32-bit registers, texture, and constant memory caches common to all processors inside it. Processors in the multiprocessor executes the same instruction on different data, which makes CUDA an SIMD model. Communication between multiprocessors is through the device global memory which is accessible to all processors within a multiprocessor.

The CUDA API provides a set of library functions which can be coded as an extension of the C language. A compiler generates executable code for the CUDA device. The code executes as threads running in parallel in batches of warp size, time-shared on the CUDA processors. Each thread can use a number of private registers for its computation. Threads of each block have access to a small amount of common shared memory. Synchronization barriers are also available for all threads of a block. The available shared memory and registers are split equally amongst all blocks that timeshare a multiprocessor. An execution on a device generates a number of blocks, collectively known as a *grid* (Figure 1).

Each thread executes a single instruction set called the *kernel*. Threads and blocks are given a unique ID that can

be accessed within the thread during its execution. These can be used by a thread to perform the kernel task on its part of the data resulting in an SIMD execution. Algorithms may use multiple kernels, which share data through the global memory and synchronize their execution either at the end of each kernel or forcefully using barriers.

## III. Representation and Programming Methodology

We adopt two data parallel programming approaches in our implementations.

- The iterative mask based approach, in which a set of vertices take part in execution at each iteration. We process each vertex in the mask in parallel. Synchronization occurs after execution of all vertices at every iteration. We use this approach in implementing BFS, STCON, SSSP and Maximum Flow.
- The divide-and-conquer approach. Here we divide the problem into its simplest constituent and process each constituent in parallel while merging them recursively as we move up the hierarchy. We give one thread to each constituent and process them in parallel. This approach is used in the implementation of the Minimum Spanning Tree.

In implementing all pairs shortest paths we compare implementations using both approaches, iterative from our group and recursive from Buluc et al. [10], along-with another matrix multiplication approach. In all implementations we map the problem to a data parallel scenario. We assume there can exist a thread for each vertex/edge in the graph. This assumption is in contrast with previous supercomputing approaches, where the problem is mapped onto a fixed set of processes. A bulk synchronous parallel programming model is used in implementing all algorithms.

#### A. Graph Representation

Efficient data structures for graph representation have been studied in depth. Complex data structures like hash tables [11] have been used for efficiency on the CPU. Creating an efficient data structure on the GPU memory model, however, is a challenging problem [12], [13].

Adjacency matrix representation is not suitable for large sparse graphs because of its  $O(V^2)$  space requirements, restricting the size of graphs that can be handled by the GPU. Adjacency list is a more practical representation for large sparse graphs requiring O(V + E) space. We represent graphs using a compact adjacency list representation with each vertex pointing to its starting edge list in a packed adjacency list of edges (Figure 2). CUDA model treats memory as general arrays and can support such representation efficiently. We assume the GPU can hold entire data into memory using this representation.

Table I states the variables used for representing graph in adjacency list format. The vertex list  $V_a$  points to its starting index in the edge list  $E_a$ . Each entry in the edge list  $E_a$  points to a vertex in vertex list  $V_a$ .  $W_a$  holds the edge weight for each edge. We deal with undirected graphs resulting in each edge



Fig. 1. The CUDA hardware model (top) and programming model (bottom), showing the block to multiprocessor mapping.

| TABLE I                                                |
|--------------------------------------------------------|
| GENERAL VARIABLES USED IN GRAPH REPRESENTATION AND THE |
| CPU SKELETON CODE                                      |

| Variable  | Purpose                                     |
|-----------|---------------------------------------------|
| $V_a$     | Holds starting index of edge list in $E_a$  |
| $E_a$     | Holds vertex id of outgoing vertex          |
| $W_a$     | Holds the weight of every edge              |
| Terminate | Global variable written over by all threads |
|           | to achieve consensus using logical OR       |

having one entry for each of its end vertices. Cache efficiency is hard to achieve using this representation as the edge list can point to any vertex in  $V_a$  and can cause random jumps in memory. The problem of laying out data in memory for efficient cache usage is similar to the BFS problem itself.



Fig. 2. Graph representation is in terms of a vertex list that points to a packed edge list.

This representation is used for all algorithms reported in this paper except in all pairs shortest paths matrix multiplication method (explained in section VII). A block-divided adjacency matrix representation is used to exploit better cache efficiency there. We do not assume the entire matrix can be held in the GPU memory. We stream parts of the matrix from the CPU to GPU memory. APSP output requires  $O(V^2)$  space and thus adjacency matrix proves a more suitable representation.

## B. Algorithm Outline on CUDA

The CUDA hardware can be seen as a multicore/manycore co-processor in a bulk synchronous parallel mode when used in conjunction with the CPU. Synchronization of CUDA threads can be achieved with the CPU deciding the barrier for synchronization. Broadly a bulk synchronous parallel machine follows three steps: (a) *Concurrent computation*: Asynchronous computation takes place on each processing element (PE). (b) *Communication*: PEs exchange data between each other. (c) *Barrier Synchronization*: Each PE waits for all PEs to finish their task. Concurrent computation takes place at the CUDA device in the form of program kernels with communication through the global memory. Synchronization is achieved only at the end of each kernel. Algorithm 1 outlines the CPU code in this scenario. The skeleton code runs on the CPU while the kernels run on a CUDA device.

## Algorithm 1 CPU\_SKELETON

- 1: Create and initialize working arrays on CUDA device.
- 2: while NOT Terminate do
- 3:  $Terminate \leftarrow true$
- 4: For each vertex/edge/color in parallel:
- 5: Invoke Kernel1
- 6: Synchronize
- 7: For each vertex/edge/color in parallel:
- 8: Invoke Kernel2
- 9: Synchronize
- 10: etc...
- 11: For each vertex/edge/color in parallel:
- 12: Invoke Kernel*n* and modify *Terminate*
- 13: Synchronize
- 14: Copy *Terminate* from GPU to CPU
- 15: end while

The termination of an operation depends on a consensus between threads. A logical OR operation needs to be performed over all active threads for termination. We use a single boolean variable (initially set to true) that is written over by all threads independently, typically by the last kernel during execution. Each non-terminating thread writes a false to this location in global memory with conflicts. If no thread modifies this value, the loop terminates. The variable needs to be copied from GPU to CPU in each iteration to check for termination (Algorithm 1 line 2).

Algorithms presented in this paper differ from each other in the kernel code and the data structure requirements but the CPU skeleton pseudo-code given in Algorithm 1 applies to all algorithms reported in this paper.

## C. Vertex List Compaction

We assign threads to an attribute of the graph (vertex, color etc.) in all implementations to exploit maximum dataparallelism. This leads to an execution of maximum O(V) parallel threads, though they are time-shared by the CUDA environment. The number of active vertices, however, varies in each iteration of execution. Active vertices are indicated in an *activity mask*, which holds a 1 for each active vertex. Each vertex thread confirms its status from the activity mask and continues execution if active. This can lead to poor load balancing on the GPU as CUDA blocks have to be scheduled even when all vertices of the block are inactive, leading to an unbalanced SIMD execution. Performance improves if we deploy only as many threads as the active vertices, reducing the number of blocks and thus time sharing on the CUDA device.



Fig. 3. Vertex compaction is used to reduce the number of threads needed when not all vertices are active.

A scan operation [14] on the activity mask determines the number of active vertices as well as gives an ordinal number to each. This establishes a mapping between the original vertex index and a new index amongst the currently active vertices. We compact all entries in the activity mask to an *active mask* (Figure 3) creating the mapping of *new* thread IDs to *old* vertex IDs. Each thread can now find its vertex id by looking at its active mask, and thereafter can execute normally.

There exists a trade-off between time taken by parallel thread execution and time taken by scan and compacting. For graphs where parallelism expands slowly, compaction makes most sense, as many threads are inactive in a single grid execution. For faster expanding graphs, however, compacting becomes an overhead. We report experiments where vertex compaction gives better performance than the non compacted version.

## IV. BREADTH FIRST SEARCH (BFS)

The BFS problem is to find the minimum number of edges needed to reach every vertex in graph G from a source vertex s. BFS is well studied in serial setting with best time complexity reported as O(V + E). Parallel versions of BFS algorithm also exist. A study of the BFS algorithm on Cell/B.E. processor using the bulk synchronous parallel model appeared in [15]. Zhang et al. [16] gave a heuristic search for BFS using level synchronization. Bader et al.[3] implement BFS for the CRAY MTA-2 supercomputer and Yoo et al. [5] on the BlueGene/L.

We treat the GPU as a bulk synchronous device and use level synchronization to implement BFS. BFS traverses the graph in levels, once a level is visited it is not visited again during execution. We use this as our barrier and synchronize threads at each level. A BFS *frontier* corresponds to all vertices at the current level, see Figure 4. Concurrent computation takes place at the BFS frontier where each vertex updates the cost of its neighboring vertices by assigning cost values to their respective indices in the global memory.

We assign one thread to every vertex, eliminating the need for queues in our implementation. This decision further eliminates the need to change grid configuration and reassigning indices in the global memory with every kernel execution, which incurs additional overheads and slows down the execution.



Fig. 4. Parallel BFS: Vertices in the *frontier* list execute in parallel in each iteration. Execution stops when the *frontier* is empty.

#### **GPU** Implementation

Table II states the variables used in BFS implementation. We keep two boolean arrays  $F_a$  and  $X_a$  of size |V| for the frontier and visited vertices respectively. Initially,  $X_a$  is set to false and  $F_a$  contains the source vertex. In the first kernel (Algorithm 2), each thread looks at its entry in the frontier array  $F_a$ , if present, it updates the cost of its unvisited neighbors by writing its own cost plus one to its neighbor's index in the global cost array  $C_a$ .\*

TABLE II VARIABLES AND THEIR USE IN BFS IMPLEMENTATION

| Variable | Purpose                                           |
|----------|---------------------------------------------------|
| $F_a$    | Holds active vertices in each iteration.          |
| $X_a$    | Holds the visited state for each vertex.          |
| $C_a$    | Holds the BFS cost per vertex.                    |
| $F_{ua}$ | Used to resolve read after write inconsistencies. |

Algorithm 2 KERNEL1\_BFS

1:  $tid \leftarrow getThreadID$ 2: if  $F_a[tid]$  then  $F_a[tid] \leftarrow false$ 3: for all neighbors nid of tid do 4: if NOT  $X_a[nid]$  then 5:  $C_a[nid] \leftarrow C_a[tid] + 1$ 6:  $F_{ua}[nid] \leftarrow true$ 7: end if 8. end for 9: 10: end if

Each thread removes its vertex from the frontier array  $F_a$ and adds its neighbors to an alternate updating frontier array  $F_{ua}$ . This is needed as there is no synchronization possible between all CUDA threads. Modifying the frontier at the time of updation may result in read after write inconsistencies. A second kernel (Algorithm 3) copies the updated frontier  $F_{ua}$  to the actual frontier  $F_a$ . It adds the vertex in  $F_{ua}$  to the visited vertex array  $X_a$ . The vertex thread sets the termination flag to false if the vertex is added to the frontier array,  $F_a$ .

| Algorithm 3 KERNEL2_BFS              |
|--------------------------------------|
| 1: $tid \leftarrow getThreadID$      |
| 2: if $F_{ua}[tid]$ then             |
| 3: $F_a[tid] \leftarrow \text{true}$ |
| 4: $X_a[tid] \leftarrow \text{true}$ |
| 5: $F_{ua}[tid] \leftarrow false$    |
| 6: $Terminate \leftarrow false$      |
| 7: <b>end if</b>                     |
|                                      |

The process is repeated until the frontier array is empty and the while loop in Algorithm 1 line 2 terminates. In the worst case, the algorithm needs the order of the diameter of the graph G(V, E) iterations.

## V. ST-CONNECTIVITY (STCON)

The st-Connectivity problem resembles the BFS problem closely. Given an unweighted graph G(V, E) and two vertices, s and t, find a path from s to t assuming one exists. Bader et al. [3] implement STCON by extending their BFS implementation; they find the smallest distance between s and

t by keeping track of all expanded frontier vertices. We also modify BFS to find the smallest number of edges needed to reach t from s for undirected graphs.

Our approach starts BFS concurrently from s and t with Red and Green colors assigned respectively to them. In each iteration, colors are propagated to neighbors along with the BFS cost. Termination occurs when both colors meet. Evidently, both BFS frontiers hold the smallest distance to the current processing vertex from their respective source vertices. The smallest path from s to t is reached when frontiers come in contact with each other. Figure 5 depicts two termination conditions due to merging of frontiers, either at a vertex or an edge. We set the *Terminate* variable to false in this implementation and each thread writes a true in this variable if termination condition is reached.



Fig. 5. Parallel st-connectivity with colors expanding from s and t vertices.

## GPU Implementation

Along with  $V_a$ ,  $E_a$ ,  $F_a$ , and  $C_a$  we keep two boolean arrays  $R_a$  and  $G_a$ , for red and green colors, of size |V| as the vertices visited by s and t frontiers respectively, see Table III. Initially

TABLE III VARIABLES AND THEIR USE IN STCON IMPLEMENTATION

| Variable                 | Purpose                                   |
|--------------------------|-------------------------------------------|
| $F_a$                    | Holds active vertices in each iteration   |
| $C_a$                    | Holds the total length per vertex         |
| $R_a$                    | Vertices visited by Red frontier          |
| $G_a$                    | Vertices visited by Green frontier        |
| $R_f$                    | Holds the current Red frontier value      |
| $G_{f}$                  | Holds the current Green frontier value    |
| $F_{ua}, G_{ua}, R_{ua}$ | Resolves read after write inconsistencies |

 $R_a$  and  $G_a$  are set to false and  $F_a$  contains the source and target vertices. To keep the state of variables intact and avoid read after write inconsistencies, alternate updating arrays  $R_{ua}$ ,  $G_{ua}$  and  $F_{ua}$  of size |V| are also used in each iteration. Variables  $R_f$  and  $G_f$  keep track of the Red and Green frontier lengths at current execution.

Each vertex, if present in  $F_a$ , reads its color in both  $R_a$  and  $G_a$  and sets its own color to one of the two. This is exclusive

<sup>\*</sup>It is possible for many vertices to write a value at one location concurrently while executing this step, leading to clashes in the global memory. We do not lock memory for concurrent write operations because all frontier vertices write the same value at their neighbor's index location in  $C_a$ . CUDA guarantees at least one of them will succeed which is sufficient for our BFS cost propagation.

as a vertex can only exist in one of the two arrays, an overlap is a termination condition for the algorithm. Each vertex updates the cost of its unvisited neighbors by adding 1 to its own cost and writing it to the neighbor's index in  $C_a$ . Based on its color, the vertex also adds its neighbors to its own color's visited vertices by adding them to either  $R_{ua}$  or  $G_{ua}$ . The algorithm terminates if any unvisited neighbor of the vertex is of the opposite color. We need not update both frontiers for termination at an edge, only the Red frontier is updated in this case as shown in Algorithm 4, line 7. The vertex removes itself from the frontier array  $F_a$  and adds its neighbors to the updating frontier array  $F_{ua}$ . Kernel1 (Algorithm 4) depicts these steps.

| Alge | prithm 4 KERNEL1_STCON                           |
|------|--------------------------------------------------|
| 1:   | $tid \leftarrow getThreadID$                     |
| 2:   | if $F_a[tid]$ then                               |
| 3:   | $F_a[tid] \leftarrow false$                      |
| 4:   | for all neighbors nid of tid do                  |
| 5:   | if $(G_a[nid] \mid R_a[nid])$ then               |
| 6:   | if $(R_a[tid]\&G_a[nid])$ then                   |
| 7:   | $R_f \leftarrow C_a[tid] + 1$                    |
| 8:   | $Terminate \leftarrow true$                      |
| 9:   | end if                                           |
| 10:  | if $(G_a[tid]\&R_a[nid])$ then                   |
| 11:  | $Terminate \leftarrow true$                      |
| 12:  | end if                                           |
| 13:  | else                                             |
| 14:  | if $G_a[tid]$ then $G_{ua}[nid] \leftarrow$ true |
| 15:  | if $R_a[tid]$ then $R_{ua}[nid] \leftarrow$ true |
| 16:  | $F_{ua}[nid] \leftarrow \text{true}$             |
| 17:  | $C_a[nid] \leftarrow C_a[tid] + 1$               |
| 18:  | end if                                           |
| 19:  | end for                                          |
| 20:  | end if                                           |
|      |                                                  |

## Algorithm 5 KERNEL2\_STCON

| 1: $tid \leftarrow getThreadID$                                                     |
|-------------------------------------------------------------------------------------|
| 2: if $F_{ua}[tid]$ then                                                            |
| 3: $F_a[tid] \leftarrow \text{true}$                                                |
| 4: <b>if</b> $R_{ua}[tid]$ <b>then</b>                                              |
| 5: $R_a[tid] \leftarrow \text{true}$                                                |
| 6: $R_f \leftarrow C_a[tid]$                                                        |
| 7: <b>end if</b>                                                                    |
| 8: <b>if</b> $G_{ua}[tid]$ <b>then</b>                                              |
| 9: $G_a[tid] \leftarrow \text{true}$                                                |
| 10: $G_f \leftarrow C_a[tid]$                                                       |
| 11: <b>end if</b>                                                                   |
| 12: $F_{ua}[tid] \leftarrow false$                                                  |
| 13: $R_{ua}[tid] \leftarrow false$                                                  |
| 14: $G_{ua}[tid] \leftarrow \text{false}$                                           |
| 15: <b>if</b> $G_{ua}[tid]$ & $R_{ua}[tid]$ <b>then</b> $Terminate \leftarrow$ true |
| 16: end if                                                                          |
|                                                                                     |

The second Kernel (Algorithm 5) copies the updating arrays  $F_{ua}$ ,  $R_{ua}$ ,  $G_{ua}$  to actual arrays  $F_a$ ,  $R_a$  and  $G_a$  for all newly

visited vertices. It also checks the termination condition due to merging of frontiers and terminates the algorithm if frontiers meet at any vertex. Variables  $R_f$  and  $G_f$  are updated to reflect the current frontier lengths. The length of the path between s and t can be obtained by adding  $R_f$  and  $G_f$ . The algorithm needs a maximum of the radius of the Graph G iterations to terminate.

#### VI. SINGLE SOURCE SHORTEST PATH (SSSP)

The sequential solution to single source shortest path problem comes from Dijkstra [17]. Originally the algorithm required  $O(V^2)$  time but was later improved using Fibonacci heap to  $O(V \log V + E)$ . A parallel version of Dijkstra's algorithm on a PRAM given in [18] introduces a  $O(V^{1/3} \log V)$ algorithm requiring  $O(V \log V)$  work. Nepomniaschaya et al. [19] parallelized Dijkstra's algorithm for associative parallel processors. Narayanan [20] solves the SSSP problem for processor arrays. Although parallel implementations of the Dijkstra's SSSP algorithm are reported [21], an efficient PRAM algorithm does not exist [22].

Single source shortest path does not traverse a graph in levels, as cost of a visited vertex may change due to a low cost path being discovered later in the execution. In our implementation simultaneous updates are triggered by vertices undergoing a change in cost values. These vertices constitute an *execution mask*. Termination condition is reached with equilibrium when there is no change in cost for any vertex.

We assign one thread to every vertex. Threads in the execution mask execute in parallel. Each vertex updates the cost of its neighbors and removes itself from the execution mask. Any vertex whose cost is updated is put into the execution mask for next iteration of execution. This process is repeated until there is no change in cost for any vertex. Figure 6 shows the execution mask (shown as colors) and cost states for a simple case, costs are updated in each iteration, with vertices undergoing re-execution if their cost changes.



Change in cost with each iteration

Execution terminates when mask is empty

Fig. 6. SSSP execution: In each iteration, vertices in the mask update costs of their neighbors. A vertex whose cost changes is put in the mask for execution in the next iteration.

## **GPU** Implementation

For our implementation (Algorithm 6 and Algorithm 7) we keep a boolean mask  $M_a$  and cost array  $C_a$  of size |V|.  $W_a$  holds the weights of edges and an updating cost array  $C_{ua}$  is used for intermediate cost values. Table IV states the variables

TABLE IV VARIABLES AND THEIR USE IN SSSP IMPLEMENTATION

| Variable | Purpose                                   |
|----------|-------------------------------------------|
| $M_a$    | Holds active vertices in each iteration   |
| $C_a$    | Holds the current cost per vertex         |
| $C_{ua}$ | Resolves read after write inconsistencies |

and their usage. Initially the mask  $M_a$  contains the source vertex. Each vertex looks at its entry in the mask  $M_a$ . If true, it updates the cost of its neighbors if greater than its own cost plus the edge weight to the corresponding neighbor in an alternate updating cost array  $C_{ua}$ . The alternate cost array  $C_{ua}$  is used to resolve read after write inconsistencies in the global memory. Updates in  $C_{ua}$  need to lock the memory location before modifying the cost value, as many threads may write different values at the same location concurrently. We use the *atomicMin* function supported on CUDA 1.1 hardware (lines 5-9, Algorithm 6) to resolve this.

| A | lgorithm | 6 | KERNEL1 | _SSSP |
|---|----------|---|---------|-------|
|---|----------|---|---------|-------|

| 1:  | $tid \leftarrow getThreadID$                 |
|-----|----------------------------------------------|
|     | if $M_a[tid]$ then                           |
| 3:  | $M_a \ [tid] \leftarrow false$               |
| 4:  | for all neighbors nid of tid do              |
| 5:  | Begin Atomic                                 |
| 6:  | if $C_{ua}[nid] > C_a[tid] + W_a[nid]$ then  |
| 7:  | $C_{ua}[nid] \leftarrow C_a[tid] + W_a[nid]$ |
| 8:  | end if                                       |
| 9:  | End Atomic                                   |
| 10: | end for                                      |
| 11: | end if                                       |

| Algorithm | 7 | KERNEL2_ | SSSP |
|-----------|---|----------|------|
|           |   |          |      |

| 1: $tid \leftarrow getThreadID$      |
|--------------------------------------|
| 2: if $C_a[tid] > C_{ua}[tid]$ then  |
| 3: $C_a[tid] \leftarrow C_{ua}[tid]$ |
| 4: $M_a[tid] \leftarrow \text{true}$ |
| 5: $Terminate \leftarrow false$      |
| 6: end if                            |
| 7: $C_{ua}[tid] \leftarrow C_a[tid]$ |

Atomic functions resolve concurrent writes by assigning exclusive rights to one thread at a time. The clashes are thus serialized in an unspecified order. The function compares the existing  $C_{ua}(v)$  cost with  $C_a(u)+W_a(u,v)$  and updates the value if necessary. A second kernel (Algorithm 7) is used to reflect updating cost  $C_{ua}$  to the cost array  $C_a$ . If  $C_a$  is greater than  $C_{ua}$  for any vertex, it is set for execution in the mask  $M_a$  and the termination flag is toggled to continue execution. This process is repeated until the mask is empty. The algorithms takes the order of diameter of the graph to converge to equilibrium.

## VII. ALL PAIRS SHORTEST PATHS (APSP)

Warshall defined boolean transitive closure for matrices that was later used to develop the Floyd Warshall algorithm for the APSP problem. The algorithm had  $O(V^2)$  space complexity and  $O(V^3)$  time complexity. Numerous parallel versions for the APSP problem have been developed to date [23], [24], [25]. Micikevicius [26] reported a GPGPU implementation for the same, but due to  $O(V^2)$  space requirements he reported results on small graphs.

The Floyd Warshall parallel CREW PRAM algorithm (Algorithm 8) can be easily extended to CUDA if the graph is represented as an adjacency matrix. The kernel implements line 4 of Algorithm 8 while the rest of the code runs on the CPU. This approach however requires entire matrix to be present on the CUDA device. In practice this approach performs slower as compared to approaches outlined below. Please see [27] for a comparative study.

| 1: Create adjacency Matrix A from $G(V, E, W)$      |
|-----------------------------------------------------|
| (v, L, w)                                           |
| 2: for $k$ from 1 to $V$ do                         |
| 3: for all Elements of A, in parallel do            |
| 4: $A[i,j] \leftarrow min(A[i,j], A[i,k] + A[k,j])$ |
| 5: end for                                          |
| 6: end for                                          |
|                                                     |

## A. APSP using SSSP

Reducing space requirements on the CUDA device directly translates to handle larger graphs. A simple space conserving solution to the APSP problem is to run SSSP from each vertex iteratively using the graph representation given in Figure 2. This implementation requires O(V + E) space on the GPU with a vector of O(V) copied back to the CPU memory in each iteration. However for dense graphs this approach proves inefficient. We implemented this approach for general graphs and found it to be a scalable solution for low degree graphs. See the results in Figure 11(d).

## B. APSP as Matrix Multiplication

Katz and Kider [7] formulate a CUDA implementation for APSP on large graphs using a matrix block approach. They implement the Floyd Warshall algorithm based on transitive closure with a cache efficient blocking technique (extension of method proposed by Venkataraman [28]), in which the adjacency matrix (broken into blocks) present in the global memory is brought into the multiprocessor shared memory intelligently. They handle larger graphs using multiple CUDA devices by partitioning the problem across the number of devices. We take a different approach and use streaming of data from the CPU to GPU memory for handling larger matrices. Our implementation uses a modified parallel matrix multiplication with blocking approach. Our times are slightly slower as compared to Katz and Kider for fully connected small graphs. For general large graphs however we gain 2-4times speed over the method proposed by Katz and Kider.

A simple modification to the matrix multiplication algorithm yields an APSP solution (Algorithm 9). Lines 4 - 11 is the general matrix multiplication algorithm with the multiplication

and addition operations replaced by addition and minimum operations respectively, line 7. The outer loop (line 3) utilizes the transitive property of matrix multiplication and runs  $\log V$  times.

| Algorithm 9 MATRIX_APSP                                             |  |  |  |  |
|---------------------------------------------------------------------|--|--|--|--|
| $1: D^1 \leftarrow A$                                               |  |  |  |  |
| 2: for $m \leq \log V$ do                                           |  |  |  |  |
| 3: for $i \leftarrow 1$ to V do                                     |  |  |  |  |
| 4: for $j \leftarrow 1$ to $V$ do                                   |  |  |  |  |
| 5: $D_{i,j}^m \leftarrow \infty$                                    |  |  |  |  |
| 6: <b>for</b> $k \leftarrow 1$ to $V$ <b>do</b>                     |  |  |  |  |
| 7: $D_{i,j}^m \leftarrow min(D_{i,j}^m, D_{i,j}^{(m-1)} + A_{k,j})$ |  |  |  |  |
| 8: end for                                                          |  |  |  |  |
| 9: end for                                                          |  |  |  |  |
| 10: <b>end for</b>                                                  |  |  |  |  |
| 11: end for                                                         |  |  |  |  |

We modify the parallel version of matrix multiplication proposed by Volkov and Demmel [29] for our APSP solution. We replace the multiplication and addition operations in Volkov and Demmel kernel to addition and min operations. The kernel is looped over  $\log V$  times using an outer loop to solve the APSP problem.



Fig. 7. Blocks for matrix multiplication by Volkov and Demmel [29] modified to stream from CPU to GPU.

*Cache Efficient Graph Representation:* For matrix multiplication based APSP, we use an adjacency matrix to represent graph. Figure 7 depicts an extension of the cache efficient, conflict free, blocking scheme used for matrix multiplication by Volkov and Demmel. We present two new ideas over the basic matrix multiplication scheme. The first is the modification to handle graphs larger than the device memory by streaming data as required from the CPU. The second is the lazy evaluation of the minimum finding which

results in a boost in performance.

Streaming Blocks: To handle large graphs, the adjacency matrix present in the CPU memory is divided into rectangular row and column sub-matrices. These are streamed into the device global memory and a matrix-block  $D^m$  based on their values is computed. Let R be the row and C the column sub-matrices of the original matrix present in the device memory. For every row sub-matrix R we iterate through all column sub-matrices C of the original matrix. We assume CPU memory is large enough to hold the adjacency matrix, though our method can be easily extended to secondary storage with slight modification.

Let the size of available device memory be  $GPU_{mem}$ . We divide the adjacency matrix into rows R and column C submatrices of size  $(B \times V)$  and  $(V \times B)$  respectively such that

size 
$$(R_{B\times V} + C_{V\times B} + D^m_{B\times B}) \leq \text{GPU}_{mem},$$

where B is the block size. A total of

$$\log V\left(\frac{V^3}{B} + V^2\right) \equiv O\left(\log V\left(\frac{V^3}{B}\right)\right)$$

elements are transferred between CPU and GPU for a  $V \times V$  adjacency matrix for our APSP computation, with  $V^3 \log V/B$  reads and  $V^2 \log V$  writes. Time taken for this data transfer is negligible compared to the computation time, and can be easily hidden using asynchronous read and write operations supported on current generation CUDA hardware as will be shown in Section X.

For example, for a  $18K \times 18K$  matrix with integer entries and 1GB device memory, a block size  $B \simeq 6K$  can be used. At the PCI-e×16 practical transfer rate of 3.3 GB/s, data transfer takes nearly 16 seconds. This time is negligible as compared to  $\simeq 800$  seconds of computation time taken on Tesla for a  $18K \times 18K$  matrix without streaming (result taken from Table XI).

Lazy Minimum Evaluation: The basic step of Floyd's algorithm is similar to matrix multiplication with multiplication replaced by addition and addition by minimum finding. However, for sparse-degree graphs, the connections are few and the remaining entries of the adjacency matrix are infinity. With entries involving infinity, additions and subsequent minimum finding can be skipped altogether without affecting correctness. We, therefore, evaluate the addition and the minimum in a lazy manner, skipping all paths involving a non-existent edge. This results in a speedup of 2 to 3 times over complete evaluation on most graphs, however, making the running time degree-dependent.

*GPU Implementation:* Let R be the row and C be the column sub-matrices of the adjacency matrix. Let  $D^i$  denote a temporary matrix variable of size  $B \times B$  used to hold intermediate values. In each iteration of outer loop (Algorithm 9, line 2)  $D^i$  is modified using C and R. Lines 3-10 of Algorithm 9 are executed on the CUDA device using modified Volkov and Demmel kernel, while the rest of the code executes on the CPU. Shared memory is used as a user managed cache to

improve the performance and translates directly from Volkov and Demmel kernel. They bring sections of matrices R, C and  $D^i$  into shared memory in blocks: R is brought in  $64 \times 4$  sized blocks, C in  $16 \times 16$  sized blocks and  $D^i$  in  $64 \times 16$  sized blocks. These values are selected to maximize throughput of the CUDA device.

## C. Gaussian Elimination Based APSP

In a parallel work , Buluc et al. [10] formulate a fast recursive APSP algorithm based on Gaussian elimination. They cleverly extend the R-Kleene [30] algorithm for in place APSP computation on global memory. They split each APSP step recursively into 2 APSPs involving graphs of half the size, 6 matrix multiplications and 2 matrix additions. The base-case is when there are 16 or fewer vertices; Floyd's algorithm is applied in that case by modifying the CUDA matrix multiplication kernel proposed by Volkov and Demmel [29]. They also use the fast matrix multiplication for other steps. Their implementation is degree independent and fast; they achieve a speed up of 5 - 10 times over the APSP implementation presented above.

While the approach of Buluc et al. is the fastest APSP implementation on the GPU so far, our key ideas can extend it further. Our APSP specific optimizations can improve performance over their native implementation, for example, we incorporated the lazy minimum evaluation into the Volkov and Demmel kernel used their approach and obtained a speed up of more than 2 over their native code. Their approach is memory heavy and is best suited when the adjacency matrix can fit completely in the GPU device memory. The approach involves several matrix multiplications and additions. Extending which to stream the data from CPU to the GPU for matrix operations in terms of blocks that fit in the device memory will involve many more communications and computations. The CPU to GPU communication bandwidth has not at all kept pace with the increase in the number of cores or computation power of the GPU. Thus, our non-matrix approach is likely to scale better to arbitrarily high graphs than the Gaussian Elimination based approach by Buluc et al.

Comparison of the matrix multiplication approach with APSP using SSSP and Gaussian elimination approach is summarized in Figure 11(d). Comparison of matrix multiplication approach with Katz and Kider is given in Figure 11(e). Behavior of the matrix approach with varying degree is reported in Table VII.

## VIII. MINIMUM SPANNING TREE (MST)

Best time complexity for a serial solution to the MST problem, proposed by Bernard Chazelle [31], is  $O(E\alpha(E, V))$ , where  $\alpha$  is the functional inverse of Ackermann's function. Borůvka's algorithm [32] is a popular solution to the MST problem. In a serial setting it takes  $O(E \log V)$  time. Numerous parallel variations of this algorithm also exist [33]. Chong et al. [34] report a EREW PRAM algorithm requiring  $O(\log V)$  time and  $O(V \log V)$  work. Bader et al. [35] design a fast algorithm for symmetric multiprocessors with O((V + E)/p) lookups and local operations for a p processor machine. Chung et al. [36] efficiently implement Borůvka's algorithm on a asynchronous distributed memory machine by reducing communication costs. Dehne and Götz implement three variations of Borůvka's algorithm using the BSP model [37].

We implement a modified parallel Borůvka algorithm on CUDA using the divide-and-conquer approach similar to the algorithm reported by Johnson and Metaxas in [38]. We initiate colored trees from all vertices. Grow individual trees by adding the minimum weighted edge to the minimum outgoing vertex and merge colors when trees come in contact with each other. Cycles are removed explicitly in each iteration. Connected components are found via color propagation, an approach similar to our SSSP implementation (section VI).

We represent each supervertex in Borůvka's algorithm as a color. Each supervertex finds the minimum weighted edge to another supervertex and adds it to the output MST array. Each newly added edge in the MST edge list updates the colors of both its supervertices until there is no change in color values for all supervertices. Cycles are removed from the newly created graph and each vertex in a supervertex updates its color to the new color of the supervertex. This processes is repeated and the number of supervertices keep on decreasing. The algorithm terminates when exactly one supervertex remains.



Fig. 8. Parallel minimum spanning tree.

#### GPU Implementation

We use colors array  $C_a$ , color index array  $Ci_a$  (per vertex color index to which the vertex belongs to), active colors array  $Ac_a$  and newly added MST edges  $NMst_a$  of size |V|. Output is a set of edges present in the MST, let  $Mst_a$  of size |E|denote this. Further, we keep degrees array  $Deg_a$  and cycle edges array  $Cy_a$  of size |V| for cycle finding and elimination. Arrays  $V_a$ ,  $E_a$  and  $W_a$  retain their previous meanings. Initially,  $C_a$  holds the vertex id as color and each vertex points to this color in  $Ci_a$ ,  $Ac_a$  and  $Mst_a$  are set to false and  $NMst_a$  holds null. We assign one color to each vertex in the graph initially, eliminating uncolored vertices and thus race conditions due to them. An overview of the algorithm using steps presented in following sections is given in Algorithm 10. The variables and their purpose is stated in Table V.

## Algorithm 10 Minimum Spanning Tree

- 1: Create  $V_a$ ,  $E_a$ ,  $W_a$  from G(V, E, W)
- 2: Initialize  $C_a$  and  $Ci_a$  to vertex id.
- 3: Initialize  $Mst_a$  to false
- 4: while More than 1 supervertex remains do
- 5: Clear *NMst*<sub>a</sub>,  $Ac_a$ ,  $Deg_a$  and  $Cy_a$
- 6: **Kernel**1 for each vertex: Finds the minimum weighted outgoing edge from each supervertex to the lowest outgoing color by working at each vertex of the supervertex, sets the edge in  $NMst_a$ .
- 7: **Kernel**<sup>2</sup> for each supervertex: Each supervertex sets its added edge in  $NMst_a$  as part of output MST,  $Mst_a$ .
- 8: Kernel3 for each supervertex: Each added edge, in  $NMst_a$ , increments the degrees of both its supervertices in  $Deg_a$  using color as index. Old colors are saved in  $PrevC_a$ .
- 9: while no change in color values  $C_a$  do
- 10: **Kernel**4 for each supervertex: Each edge in  $NMst_a$  updates colors of supervertices by propagating the lower color to the higher.
- 11: end while
- 12: while 1 degree supervertex remains do
- 13: **Kernel**<sup>5</sup> for each supervertex: All 1 degree supervertices nullify their edge in  $NMst_a$ , and decrement their own degree and the degree of its outgoing supervertex using old colors from  $PrevC_a$ .
- 14: end while
- 15: **Kernel**6 for each supervertex: Each remaining edge in  $NMst_a$  adds itself to  $Cy_a$  using new colors from  $C_a$ .
- 16: **Kernel**7 for each supervertex: Each entry in  $Cy_a$  is removed from the output MST,  $Mst_a$ , resulting in cycle free MST.
- 17: **Kernel**8 for each vertex: Each vertex updates its own colorindex to the new color of its new supervertex.
- 18: end while
- 19: Copy  $Mst_a$  to CPU memory as output.

 TABLE V

 VARIABLES AND THEIR USE IN MST IMPLEMENTATION

|                                                                    | Variable | Purpose                                          |  |  |
|--------------------------------------------------------------------|----------|--------------------------------------------------|--|--|
|                                                                    | $C_a$    | Holds color values                               |  |  |
|                                                                    | $Ci_a$   | Holds index of color for every vertex            |  |  |
|                                                                    | $Ac_a$   | Holds active colors out of $C_a$                 |  |  |
|                                                                    | $NMst_a$ | Holds newly selected edges per iteration         |  |  |
|                                                                    | $Mst_a$  | Edges selected in MST upto current iteration     |  |  |
|                                                                    | $Deg_a$  | Degree of every supervertex in current iteration |  |  |
|                                                                    | $Cy_a$   | Used to eliminate cycle making edges             |  |  |
| <i>PrevC<sub>a</sub></i> Stores previous state of $C_a$ in each it |          |                                                  |  |  |

#### A. Finding Minimum Weighted Edge

Each vertex finds its minimum weighted outgoing edge using edge weights  $W_a$ . The index of this edge is written atomically to the color index of the supervertex in global memory. Multiple edges in a supervertex can have minimum weights, the one with minimum outgoing color is selected. Algorithm 11 finds the minimum weighted edge for each supervertex. Please note lines 10 - 14 in the pseudo code (Algorithm 11) are implemented as multiple atomic operations in practice.

Algorithm 12 adds the minimum weighed edge from each supervertex to the final MST output array  $Mst_a$ . This kernel is important as we cannot add an edge to  $Mst_a$  until all vertices belonging to the supervertex have voted for their lowest weighted edge. This Kernel executes for all supervertices (or active colors) after KERNEL1\_MST executes for every vertex of the graph.

## Algorithm 11 KERNEL1\_MST

- 1:  $tid \leftarrow getThreadID$ 2:  $cid \leftarrow Ci_a[tid]$ 3:  $col \leftarrow C_a[cid]$ 4: for all edges eid of tid do  $col2 \leftarrow C_a[Ci_a[E_a[eid]]]$ 5: if NOT  $Mst_a[eid] \& col \neq col2$  then 6: 7:  $Ieid \leftarrow Index(min(W_a[eid] \& col2))$ 8: end if 9: end for 10: Begin Atomic 11: if  $W_a[Ieid] > W_a[NMst_a[col]]$  then  $NMst_a[col] \leftarrow Ieid$ 12: 13: end if 14: End Atomic
  - 14: Enu Atomic
  - 15:  $Ac_a[col] \leftarrow true$

## Algorithm 12 KERNEL2\_MST

- 1:  $col \leftarrow getThreadID$
- 2: if  $Ac_a[col]$  then
- 3:  $Mst_a[NMst_a[col]] \leftarrow$  true
- 4: **end if**

#### B. Finding and Removing Cycles

As C edges are added for C colors, at least one cycle is expected to be formed in the new graph of supervertices. Multiple cycles can also form for disjoint components of supervertices. Figure 9 shows such a case. It is easy to see that each such component can have at most one cycle consisting of exactly 2 supervertices with both edges in the cycle having equal weights. Identifying these edges and removing one edge per cycle is crucial for correct output.

In order to find these edges, we assign degrees to supervertices using newly added edges  $NMst_a$ . We then remove all 1-degree supervertices iteratively until there is no 1-degree supervertex left, resulting in supervertices whose edges form cycles.

Each added edge increments the degree of both its supervertices using color of the supervertex as its index in  $Deg_a$  (Algorithm 13). After color propagation, i.e., merger of supervertices (Section VIII-C), all 1-degree supervertices nullify their added edge in  $NMst_a$ . They also decrement their own degree and the degree of their added edge's outgoing supervertex in  $Deg_a$ 



Fig. 9. For C colors, C edges are added, resulting in multiple cycles. One edge per cycle must be removed.

(Algorithm 15). This process is repeated until there is no 1degree supervertex left, resulting in supervertices whose edges form a cycle.

Incrementing the degree array needs to be done before propagating colors, as the *old* color is used as index in  $Deg_a$  for each supervertex. Old colors are also needed after color propagation to identify supervertices while decrementing the degrees. We preserve old colors before propagating *new* colors in an alternate color array  $PrevC_a$  (Algorithm 13).

After removing 1-degree supervertex edges, resulting supervertices write their edge from  $NMst_a$  to their new color location in  $Cy_a$  (Algorithm 16), after new colors have been assigned to supervertices of each disjoint component using Algorithm 14. One edge of the two, per disjoint component cycle, survives this step. Since both edges have equal weights, no preference is given over edges. Edges in  $Cy_a$  are then removed from the output MST array  $Mst_a$  (Algorithm 17) resulting in cycle free set of MST edges.

## Algorithm 13 KERNEL3\_MST

| 1: $col \leftarrow getThreadID$                  |
|--------------------------------------------------|
| 2: if $Ac_a[col]$ then                           |
| 3: $col2 \leftarrow C_a[Ci_a[E_a[NMst_a[col]]]]$ |
| 4: Begin Atomic                                  |
| 5: $Deg_a[col] \leftarrow Deg_a[col] + 1$        |
| 6: $Deg_a[col2] \leftarrow Deg_a[col2] + 1$      |
| 7: End Atomic                                    |
| 8: end if                                        |
| 9: $PrevC_a[col] \leftarrow C_a[col]$            |

## C. Merging Supervertices

Each added edge merges two supervertices. Lesser color of the two is propagated by assigning it to the higher colored supervertex. This process is repeated until there is no change in color values for any supervertex. Color propagation mechanism is similar to our SSSP step. Kernel4 (Algorithm 14) executes for each added edge and updates the colors of both the vertices to the lower one. As in the SSSP implementation, we use an alternate color array to store intermediate values and to resolve read after write inconsistencies (not shown in Algorithm 14).

| Algorithm 14 KERNEL4_MST |  |
|--------------------------|--|
|--------------------------|--|

| $l \leftarrow \text{getThreadID} \\ \leftarrow C_a[cid]$ |
|----------------------------------------------------------|
|                                                          |
|                                                          |
| $Ac_a[col]$ then                                         |
| $cid2 \leftarrow Ci_a[E_a[NMst_a[col]]]$                 |
| Begin Atomic                                             |
| if $C_a[cid] > C_a[cid2]$ then                           |
| $C_a[cid] \leftarrow C_a[cid2]$                          |
| end if                                                   |
| if $C_a[cid2] > C_a[cid]$ then                           |
| $C_a[cid2] \leftarrow C_a[cid]$                          |
| end if                                                   |
| End Atomic                                               |
| d if                                                     |
|                                                          |

#### Algorithm 15 KERNEL5\_MST

- 1:  $cid \leftarrow getThreadID$
- 2:  $col \leftarrow PrevC_a[cid]$
- 3: if  $Ac_a[col]$  &  $Deg_a[col] = 1$  then
- 4:  $col2 \leftarrow PrevC_a[Ci_a[E_a[NMst_a[cid]]]]$
- 5: Begin Atomic
- 6:  $Deg_a[col] \leftarrow Deg_a[col] 1$
- 7:  $Deg_a[col2] \leftarrow Deg_a[col2] 1$
- 8: End Atomic
- 9:  $NMst_a[col] \leftarrow \phi$
- 10: end if

## Algorithm 16 KERNEL6\_MST

- 1:  $cid \leftarrow getThreadID$
- 2:  $col \leftarrow PrevC_a[cid]$
- 3: if  $Ac_a[col] \& NMst_a[col] \neq \phi$  then
- 4:  $newcol \leftarrow C_a[Ci_a[E_a[NMst_a[col]]]]$
- 5:  $Cy_a[newcol] \leftarrow NMst_a[col]$
- 6: end if

## Algorithm 17 KERNEL7\_MST

- 1:  $col \leftarrow$  getThreadID 2: **if**  $Cy_a[col] \neq \phi$  **then**
- 3:  $Mst_a[Cy_a[col]] \leftarrow false$
- 4: end if

#### D. Assigning Colors to Vertices

Each vertex in a supervertex must know its color; merging of colors in the previous step does not necessarily end with all vertices in a component being assigned the minimum color of that component. Rather, a link in color values is established during the previous step. This link must be traversed by each vertex to find the lowest color it should point to. The colors are set same as the index initially, leading to same color and index

## Algorithm 18 KERNEL8\_MST

1:  $tid \leftarrow getThreadID$ 2:  $cid \leftarrow Ci_a[tid]$ 3:  $col \leftarrow C_a[cid]$ 4: while  $col \neq cid$  do 5:  $col \leftarrow C_a[cid]$ 6:  $cid \leftarrow C_a[col]$ 7: end while 8:  $Ci_a[tid] \leftarrow cid$ 9: if  $col \neq 0$  then 10:  $Terminate \leftarrow false$ 11: end if

for all active colors. This property is exploited while updating colors for each vertex. Each vertex in Kernel8 (Algorithm 18) finds its colorindex *cid* and traverses the colors array  $C_a$  until coloridnex is not equal to color, converging at the lowest active color of its supervertex. The entire process is repeated until a single supervertex remains. A total of  $\sqrt{\log V}$  iterations are needed for the algorithm to terminate [38].

## E. Primitive based MST

Another variation of the MST algorithm in a recursive framework using primitives such as scan, segmented-scan and split is developed from our group [39]. Though the algorithm reported in [39] is 2-3 times faster than the implementation stated above, it is heavy on memory requirements and cannot handle graphs larger than 6M and weights larger than 1Kbecause of the 32-bit restriction of the segmented-scan operation and O(E) sized scan and split operations used in the implementation. Please see [39] for comparison of the above mentioned and recursive MST implementations on smaller sized graphs than reported here.

## IX. MAXIMUM FLOW (MF)/MIN CUT

Maxflow tries to find the minimum weighed cut that separates a graph into two disjoint sets of vertices, containing the source s and the target t vertices. The fastest serial solution due to Goldberg and Rao takes  $O(Emin(V^{2/3}, \sqrt{E}) \log(V^2/E) \log(U))$  time [40], where U is the maximum capacity of the graph.

Popular serial solutions to the max flow problem include Ford-Fulkerson's algorithm [41], later improved by Edmond and Karp [42], and the Push-Relabel algorithm [43] by Goldberg and Tarjan. Edmond-Karp's algorithm repeatedly computes augmenting paths from s to t using BFS, through which flows are pushed, until no augmented paths exist. The Push-Relabel algorithm, however, works by pushing flow from s to t by increasing heights of nodes farther away from t. Rather than examining the entire residual network to find an augmenting path, it works locally, looking at each vertex's neighbors in the residual graph.

Anderson and Setubal [44] first gave a parallel version of the Push-Relabel algorithm. Bader and Sachdeva implemented parallel cache efficient variation of the push-relabel algorithm using an SMP [45]. Alizadeh and Goldberg [46] implemented the same on a massively parallel Connection Machine CM-2. GPU implementations of the push-relabel algorithm are also reported [47]. A CUDA implementation for grid graphs specific to vision applications is reported in [48]. We implement the parallel push-relabel algorithm using CUDA for general graphs.

## The Push-Relabel Algorithm

The push-relabel algorithm constructs and maintains a residual graph at all times. The residual graph  $G_f$  of the graph Ghas the same topology, but consists of the edges which can admit more flow,  $E_f$ . Each edge has a current capacity in  $G_f$ , called its residual capacity which is the amount of flow that it can admit currently. Each vertex in the graph maintains a reservoir of flow (excess flow) and a height. Based on its height and excess flow either push or relabel operations are undertaken at each vertex. Initially height of s is set to |V| and height of t to 0. Height at all times is a conservative estimate of the vertex's distance from the source.

- **Push**: The push operation is applied at a vertex if its height is one more than any of its neighbor and it has excess flow in its reservoir. The result of push is either saturation of an edge in  $E_f$  or saturation of vertex, i.e., empty reservoir.
- **Relabel**: Relabel operation is applied to change the heights. Any vertex having excess flow, which cannot flow due to height mismatch undergoes relabeling. The relabel operation ensures the height of the vertex to be one more than the minimum height of its neighbor.

Better estimates of height values can be obtained using global or gap relabeling [45]. Global relabeling uses BFS to correctly assign distances from the target whereas gap relabeling finds gaps using height mismatches in the entire graph. However, both are expensive operations, especially when executed on parallel hardware. The algorithm terminates when neither push nor relabeling can be applied. The excess flows in the nodes are then pushed back to the source and the saturated nodes of the final residual graph gives the maximum flow/minimal cut.



Fig. 10. Parallel maxflow, showing push and relabel operations based on  $e_a$  and  $h_a$ 

## **GPU** Implementation

We keep  $e_a$  and  $h_a$  arrays representing excess flow and height per vertex. An activity mask  $M_a$  holds three distinct sates per vertex, 0 corresponding to the relabeling state  $(e_a(u)>0, h_a(v) \ge h_a(u) \forall$  neighbors  $v \in G_f$ ).  $M_a$  is set to 1 for the push state  $(e_a(u) > 0 \text{ and } h_a(u) = h_a(v)+1$  for any neighbor  $v \in G_f$ ) else the mask is set as 2 for saturation (Table VI). Based on these values the push and relabel

TABLE VI VARIABLES AND THEIR USE IN MAX FLOW IMPLEMENTATION

| Variable | Purpose                                |
|----------|----------------------------------------|
| $M_a$    | Holds activity state per vertex        |
| $e_a$    | Holds excess flow per vertex           |
| $h_a$    | Holds height at every vertex           |
| $W_a$    | Holds residual capacity for every edge |

operations are undertaken. Initially activity mask is set to 0 for all vertices. A backward BFS from the sink node is used for global relabeling. Global relabeling is used heuristically in our implementation. We apply multiple pushes before applying the relabel operation. Multiple local relabels are applied before applying a single global relabel step (Algorithm 19).

| Algorithm 19 Max Flow Step |  |  |  |  |  |
|----------------------------|--|--|--|--|--|
| 1: for 1 to $k$ times do   |  |  |  |  |  |
| 2: Apply $m$ pushes        |  |  |  |  |  |
| 3: Apply Local Relabel     |  |  |  |  |  |
| 4: <b>end for</b>          |  |  |  |  |  |
| 5: Apply Global Relabel    |  |  |  |  |  |
|                            |  |  |  |  |  |

**Relabel:** Relabels are applied as given in Algorithm 19. Local relabel operation is applied at a vertex if it has positive excess flow but no push is possible to any neighbor due to height mismatch. The height of vertex is increased by setting it to one more than the minimum height of its neighboring nodes. Kernel1 (Algorithm 20) explains this operation. Global relabeling uses backward BFS from sink, which propagates the height values to each vertex in the residual graph based on its actual distance from sink.

| Algorithm 20 KERNEL1_MAXFLOW                              |  |  |  |  |  |
|-----------------------------------------------------------|--|--|--|--|--|
| 1: $tid \leftarrow getThreadID$                           |  |  |  |  |  |
| 2: if $M_a[tid] = 0$ then                                 |  |  |  |  |  |
| 3: for all neighbors <i>nid</i> of <i>tid</i> do          |  |  |  |  |  |
| 4: <b>if</b> $nid \in G_f \& minh > h_a[nid]$ <b>then</b> |  |  |  |  |  |
| 5: $minh \leftarrow h_a[nid]$                             |  |  |  |  |  |
| 6: end if                                                 |  |  |  |  |  |
| 7: end for                                                |  |  |  |  |  |
| 8: $h_a[tid] \leftarrow minh + 1$                         |  |  |  |  |  |
| 9: $M_a[tid] \leftarrow 1$                                |  |  |  |  |  |
| 10: end if                                                |  |  |  |  |  |

**Push:** Each vertex looks at its activity mask  $M_a$ , if 1 it pushes the excess flow along the edges present in residual graph. It atomically subtracts the flow from its own reservoir

and adds it to the neighbor's reservoir. For every edge (u, v) of u in residual graph it atomically subtracts the flow from the residual capacity of (u, v) and adds (atomically) it to the residual capacity of (v, u). Kernel2 (Algorithm 21) performs the push operation.

| Algorithm 21 KERNEL2_MAXFLOW                                      |  |  |  |  |  |  |
|-------------------------------------------------------------------|--|--|--|--|--|--|
| 1: $tid \leftarrow getThreadID$                                   |  |  |  |  |  |  |
| 2: if $M_a[tid] = 1$ then                                         |  |  |  |  |  |  |
| 3: for all neighbors <i>nid</i> of <i>tid</i> do                  |  |  |  |  |  |  |
| 4: <b>if</b> $nid \in G_f \& h_a[tid] = h_a[nid] + 1$ <b>then</b> |  |  |  |  |  |  |
| 5: $minflow \leftarrow min(e_a[tid], W_a[nid])$                   |  |  |  |  |  |  |
| 6: Begin Atomic                                                   |  |  |  |  |  |  |
| 7: $e_a[tid] \leftarrow e_a[tid] - minflow$                       |  |  |  |  |  |  |
| 8: $e_a[nid] \leftarrow e_a[nid] + minflow$                       |  |  |  |  |  |  |
| 9: $W_a(tid,nid) \leftarrow W_a(tid,nid) - minflow$               |  |  |  |  |  |  |
| 10: $W_a(nid,tid) \leftarrow W_a(nid,tid) + minflow$              |  |  |  |  |  |  |
| 11: End Atomic                                                    |  |  |  |  |  |  |
| 12: <b>end if</b>                                                 |  |  |  |  |  |  |
| 13: end for                                                       |  |  |  |  |  |  |
| 14: end if                                                        |  |  |  |  |  |  |

Algorithm 22 changes the state of each vertex. The activity mask is set to either 0, 1 or 2 states reflecting relabel, push and saturation states based on the excess flow, residual edge capacities and height mismatches at each vertex. Each vertex sets the termination flag to false if its state undergoes a change.

| Algorithm 22 KERNEL3_MAXFLOW                                         |  |  |  |  |
|----------------------------------------------------------------------|--|--|--|--|
| 1: $tid \leftarrow getThreadID$                                      |  |  |  |  |
| 2: for all neighbors nid of tid do                                   |  |  |  |  |
| 3: <b>if</b> $e_a[tid] \leq 0$ OR $W_a(tid, nid) \leq 0$ <b>then</b> |  |  |  |  |
| 4: $state \leftarrow 2$                                              |  |  |  |  |
| 5: else                                                              |  |  |  |  |
| 6: <b>if</b> $e_a[tid] > 0$ <b>then</b>                              |  |  |  |  |
| 7: <b>if</b> $h_a[tid] = h_a[nid] + 1$ <b>then</b>                   |  |  |  |  |
| 8: $state \leftarrow 1$                                              |  |  |  |  |
| 9: <b>else</b>                                                       |  |  |  |  |
| 10: $state \leftarrow 0$                                             |  |  |  |  |
| 11: <b>end if</b>                                                    |  |  |  |  |
| 12: end if                                                           |  |  |  |  |
| 13: <b>end if</b>                                                    |  |  |  |  |
| 14: <b>end for</b>                                                   |  |  |  |  |
| 15: if $M_a[tid] \neq state$ then                                    |  |  |  |  |
| 16: $Terminate \leftarrow false$                                     |  |  |  |  |
| 17: $M_a[tid] \leftarrow state$                                      |  |  |  |  |
| 18: end if                                                           |  |  |  |  |

The operations terminates when there is no change in the activity mask. This does not necessarily occur when all nodes have saturated. Due to saturation of edges, unsaturated nodes may get cutoff from sink. Such nodes are not actively taking part in the process, consequently their state does not change. Termination hence cannot be based on saturation of nodes. Results of this implementation are given in Figure 11(g). Figure 11(h) and Figure 11(i) shows the behavior of our implementation with varying m and k in accordance to Algorithm 19.

## X. PERFORMANCE ANALYSIS

We choose graphs representatives of real world problems. Our graph sizes vary from 1M to 10M vertices for all algorithms except APSP. Scarpazza et al. [15] focus on improving the throughput of the Cell/B.E. for BFS. Bader and Madduri [3], [4], [35] use CRAY MTA-2 for BFS, STCON and MST implementations. Dehne and Götz [37] use CC-48 to perform MST. Edmonds et al. [49] use Parallel Boost graph library and Crobak et al. [50] use CRAY MTA-2 for their SSSP implementations. Yoo et al. [5] use the BlueGene/L for a BFS implementation. Though our input sizes are not comparable with the ones used in these implementations, of orders of billions of vertices and edges, we show implementations on a hardware several orders less expensive. Because of the large difference in input sizes, we do not compare our results with these implementations directly. We show a comparison of our APSP approach with Katz and Kider [7] and Buluc et al. [10] on similar graph sizes as the implementations are directly comparable.

The focus of the performance analysis is on what GPUs can deliver on the seemingly irregular problems involving graphs. The low costs of the GPUs make them highly available to a wide audience and are good candidates for accelerating different types of tasks. We compare the GPU performance with other GPU implementations when available. We also compare the performance on a standard CPU using standard implementations as an indication of the practical acceleration that the GPU can provide. To this end, we show performance of all our implementations on a high-end GPU. We also show the performance on low-end and medium-end GPUs on feasible graph sizes. Comparison with the CPU is not otherwise meaningful as the two devices are radically different.

## A. Types of Graphs

We tested our algorithms on various types of synthetic and real world large graphs including graphs from the ninth DIMACS challenge [51]. Primarily, three generative models were used for performance analysis, using the Georgia Tech. graph generators [52].

- Random Graphs: Random graphs have a short band of degree where all vertices lie, with a large number of vertices having similar degrees. A slight variation from the average degree results in a drastic decrease in number of such vertices in the graph.
- R-MAT [53]/Scale Free/Power law: A large number of vertices have small degree with a few vertices having large degree. This model best approximates large graphs found in real world. Practical large graphs models including, Erdös-Rényi, power-law and its derivations follow this generative model. Due to its small degree distribution over most vertices and uneven degree distribution these graphs expand slowly in each iteration and exhibit uneven load balancing. These graphs therefore are a worst case scenario for our algorithms as verified empirically.
- SSCA#2 [54]: These graphs are made up of random sized *cliques* of vertices with a hierarchical distribution of edges between cliques based on a distance metric.

These models approximate real world datasets and are good representatives for graphs commonly used in real world domains. We assume all graphs to be connected with positive weights.

#### B. Experimental Setup

Our testbed consisted of a single Nvidia GTX 280 graphics adapter with 1024MB memory controlled by a Quad Core Intel processor (Q6600 @ 2.4GHz) with 4GB RAM running Fedora Core 9. For CPU comparison we use the C++ Boost graph library (BGL), with the exception of BFS, compiled using *gcc* at optimization setting -O4 on the Intel Quad Core Q6600, 2.4GHz processor. We use our own BFS implementation on CPU as it proved faster than Boost. BFS was implemented using STL and C++, compiled with *gcc* using -O4 optimization. A quarter Tesla S1070 1U was used for graphs larger than 6M in most cases, it has a similar GPU as GTX 280 with 4096MB of memory clocked at a slightly lower frequency. We also show scalability of our algorithms on low end GPUs including 8600GT (32 stream processors and 256MB RAM) and 8800GT (112 stream processors and 512MB RAM).

We are aware there may exist more optimized implementations of algorithms reported than BGL. Our aim to to show data parallel approaches presented to be applicable on readily available hardware, with better scalability and performance than CPU. Detailed timings of the plots given are listed in Table X and Table XI in the Appendix.

## C. Iterative Mask Based Approach

We implement BFS, STCON, SSSP and Maxflow using the iterative mask based approach. A speedup of nearly 15 - 20 times over BGL is observed in these implementations on Random and SSCA#2 graphs as shown in Figures 11(a), 11(b), 11(c) and 11(g). Data parallelism is exploited in 80-90% of the total time taken for execution. For Random and SSCA#2 graphs 7 - 8% time is taken to reach full parallelism (number of threads  $\geq$  number of processors), while R-MAT graphs take nearly 20% of the total time. Low degree and linear graphs exhibit lower performace for the iterative mask based implementations emperically, as seen in Table IX. This behavior is not surprising, since this approach cannot exploit parallelism under such a senario. A maximum of two vertices can be processed in parallel in each iteration for a linear graph using the iterative mask based approach. Vertex list compaction, however, helps reduce running time by 25-40% in such cases. Large variation in degree also slows down the execution on an SIMD machine owing to uneven load per thread. This behavior is seen in all algorithms on R-MAT graphs (Figures 11(a), 11(b), 11(c) and 11(g)).

BFS (Figure 11(a)), STCON (Figure 11(b)) and SSSP (Figure 11(c)) show similar behavior on all graph models. We use vertex list compaction in BFS and SSSP leading to 40% reduction in time in case of R-MAT graphs. Running times for 100 iterations of randomly selected *s* and *t* are reported for STCON. The implementations are highly scalable and exhibit an almost linear respose in timings as the size of the graph is increased. R-MAT graphs, even after compaction,



(a) Breadth first search on varying number of vertices for synthetic graphs.



(b) st-Connectivity for varying number of vertices for synthetic graph models.



(c) Single source shortest path on varying number of vertices for synthetic graph models.



(d) Comparing APSP using SSSP, APSP using matrix multiplications and APSP using Gaussian elimination [10] approaches on a GTX280 and Tesla

2 3 4 5 6 7 4 Number of Vertices in millions, average degree 12, weights in range 1–100

(g) Maxflow results for varying number of

vertices for synthetic graph models

10

Milliseconds

Log plot of Time



(e) Comparing our matrix-based APSP approach with Katz and Kider [7] on a Quadro FX5600



(h) Maxflow behavior on Random and SSCA2 on a 1M vertex graph



(I) Minimum spanning tree results for varying number of vertices for synthetic graph models



(i) Maxflow behavior on RMAT on a 1M vertex graph

Fig. 11. Experiments on varying sizes and varying degree for three types of graph

- Random GP - RMAT GPU

SSCA2 GPU Random CPI RMAT CPU

SSCA2 CPL

perform badly on the GPU as compared to other graph models, conversely, however, they perform better on the CPU. Larger fanouts and low degrees lead to load imbalance and slow frontier expansion for R-MAT graphs.

SSSP converges much later than BFS because of its uneven frontier configuration. Atomic clash serialization, however, does not slow down SSSP implementation. This is because a maximum of warp size clashes can occur at a vertex using atomic operations, which is small and is fixed for a CUDA

## device.

Maxflow timings for various graphs are shown in Figure 11(g). We average timings over 5 iterations for randomly selected source s and sink t vertices. R-MAT GPU times out shoots the CPU times because of their low degree nature and slow convergence of local relabel thereof. The behavior of our max flow implementation for varying m and k, controlling the periodicity of the local and global relabeling steps (Algorithm 19), is given in Figure 11(h) and Figure 11(i)

for a 1M vertex graph. Random and SSCA#2 graphs show a similar behavior with time increasing with number of pushes for low or no local relabels. Time decreases as we apply more local relabels. We found for m = 3 and k = 7 the timing were optimal for Random and SSCA#2 graphs. R-MAT graphs however exhibit different behavior for varying m and k. For low local relabels the time increases with increasing number of pushes similar to Random and SSCA#2. However as local relabels are increased we see an increase in timings. This further reinforces the fact that low degree poses slow convergence of local relabels.

A speed up of nearly 5-7 times in case of R-MAT graphs and 15-20 times in case of Random and SSCA#2 graphs is observed over BGL for these implementations. Larger degree graphs benefit more using iterative mask based implementations as the expansion per iteration is more, resulting in better expansion of data and thus better performance.

#### D. The Divide-and-Conquer Approach

The MST algorithm is implemented using a divide-andconquer approach. Timings for minimum spanning tree implementation are summarized in Figure 11(f) for synthetic graphs. The divide-and-conquer approach is not affected by linearity of the graph, as each supervertex is processed in parallel independent to other supervertices and there is no frontier expansion. However, for R-MAT graphs we see a slowdown due to uneven loops over vertices with high degree, which prove inefficient on an SIMD model.

## E. All Pairs Shortest Paths (APSP)

The SSSP, matrix multiplication and Gaussian elimination APSP implementations are compared in Figure 11(d) on a GTX 280 and Tesla. The SSSP based solution uses iterative mask based approach where as Gaussian Elimination based approach due to Buluc et al. [10] is recursive divide-andconquer. The matrix multiplication based APSP uses graph representation outlined in Section VII. We stream data from CPU to GPU for graphs larger than 18K for this approach. As seen from the experiments, APSP using SSSP performs badly on all types of graph, but is a scalable solution for large, low-degree graphs. For smaller graphs, matrix approach proves much faster. We do not use lazy min for fully connected graphs as it becomes an overhead for them. We are able to process a fully connected 25K graph using streaming of matrix blocks in nearly 75 minutes on a single unit of Tesla S1070, which has similar compute power to that of GTX280, but with 4 times the memory. The Gaussian Elimination based APSP by Buluc et al. [10] is the fastest among the approaches. However, introducing the lazy minimum evaluation to their approach provides a further speed up of 2-3 as can be seen from Figure 11(d) and Table XI. For direct comparison with Katz and Kider [7], we also show results on Quadro FX 5600. Figure 11(e) summarizes the results of these experiments. In case of fully connected graphs we are 1.5 times slower than Katz and Kider up to the 10K graph. We achieve a speed up of 2-4 times over Katz and Kider for larger general graphs.

#### F. Scalability

Behavior of our implementations with varying degrees are summarized in Table VII. We show results for a 100Kvertex graph with varying degree. For APSP matrix based approach results for a 4K graph are shown. The running time increase with increasing degree in all cases, however, GPU implementations scale better than their CPU counterparts for all implementations. A sub-linear reduction in time is observed for GPU in contrast to a linear slow down on the CPU. The behavior can be explained based on the work done in these implementations, that is distributed over parallel threads resulting in a lower increase in time as compared to CPU. Table VIII shows results for the BFS, SSSP and MST implementations on low end graphics processors, the 8600GT with 32 stream processors and 256MB RAM and the 8800GT with 112 stream processors and 512MB of RAM. Inferring from experiments we can see both iterative mask and divideand-conquer approaches scale linearly with the number of stream processors on the CUDA device. We see a speedup of 2-3 times on 8600GT and 7-8 times for 8800GT over CPU for these approaches, which are entry level CUDA devices. Nvidia has integrated GPUs on motherboards which support CUDA processing, the 9400 series motherboards come with a low end GPU. Performance over CPU can be gained by porting algorithms to CUDA on such devices.

Results on the ninth DIMACS challenge [51] dataset are summarized in Table IX. GPU performs worse than CPU in most implementations for these inputs expect in case of minimum spanning tree. The divide and conquer approach is inert to linearity of the graph and thus performs better than the iterative mask based implementations. Average degree is 2-3makes these graphs almost linear and expand data minimally in each iteration the frontier based implementations.

#### XI. CONCLUSIONS

In this paper, we presented massively multithreaded algorithms on large graphs for the GPU using the CUDA model. Each operation was typically broken down using a BSP-like model into several kernels executing on the GPU, synchronized by the CPU. We showed results on random graphs and scale-free graphs that are inspired by real-life problems. The high performance demonstrated makes the GPUs attractive as co-processors to the CPU for several scientific and engineering tasks that are modeled as graph algorithms. The wide availability of the GPUs puts it within the reach of every user who needs high performance computing. Practical implementations that deliver superior performance will ease the adoption of GPUs by a wide range of users. The source code for our implementations will be made available on the web. This is likely to facilitate their adoption by other users, going by our experience on BFS and SSSP. In addition to the higher performance, we believe the approaches presented will be applicable to the multicore and manycore architectures that are in the pipeline from different manufacturers for graph algorithms.

TABLE VII

Scalability with varying degree on a 100K vertex graph, 4K for APSP, weights in range 1 - 100. Times in Milliseconds.

| Degree | gree BFS GPU/CPU |           |           | STCON GPU/CPU    |            |            | SSSP GPU/CPU        |            |            |
|--------|------------------|-----------|-----------|------------------|------------|------------|---------------------|------------|------------|
|        | Random           | R-MAT     | SSCA#2    | Random           | R-MAT      | SSCA#2     | Random              | R-MAT      | SSCA#2     |
| 100    | 15/420           | 91/280    | 7/160     | 0.8/1.1          | 1.47/14.3  | 1.04/5.8   | 169/260             | 305/190    | 120/170    |
| 200    | 48/800           | 122/460   | 13/290    | 1.5/1.1          | 2.36/18.9  | 1.11/8.7   | 375/380             | 400/250    | 237/220    |
| 400    | 125/1520         | 163/770   | 24/510    | 2.8/1.2          | 3.93/27.9  | 1.53/8.9   | 898/710             | 504/360    | 474/320    |
| 600    | 177/2300         | 182/1050  | 38/730    | 4.1/1.3          | 5.26/41.9  | 2.81/9.1   | 1449/-              | 587/430    | 683/410    |
| 800    | 253/3060         | 210/1280  | 67/980    | 5.5/1.5          | 6.85/55.4  | 2.589/9.8  | 1909/-              | 691/540    | 1042/520   |
| 1000   | 364/-            | -/-       | -/-       | 6.8/-            | -/-        | -/-        | 2157/-              | -/-        | -/-        |
| Degree | ree MST GPU/CPU  |           |           | Max Flow GPU/CPU |            |            | APSP Matrix GPU/CPU |            |            |
|        | Random           | R-MAT     | SSCA#2    | Random§          | R-MAT      | SSCA#2§    | Random              | R-MAT      | SSCA#2     |
| 100    | 302/12150        | 461/10290 | 122/7470  | 808/6751         | 3637/4950  | 345/3750   | 6111/30880          | 5450/19470 | 3400/16390 |
| 200    | 369/25960        | 638/22180 | 218/16700 | 2976/13430       | 6308/9230  | 615/7670   | 8100/53480          | 5875/27370 | 5253/25860 |
| 400    | 1149/-           | 849/-     | 347/-     | 10842/32900      | 8502/16570 | 1267/17360 | 9034/92700          | 6202/38070 | 7078/41580 |
| 600    | 1908/-           | 1103/-    | 499/-     | 14722/-          | 11238/-    | 6018/-     | 9123/102383         | 6317/41273 | 7483/48729 |
| 800    | 2484/-           | 1178/-    | 883/-     | 22489/-          | 14598/-    | 8033/-     | 9231/126830         | 6391/45950 | 7888/57460 |
| 1000   | 3338/-           | -/-       | -/-       | 32748/-          | -/-        | -/-        | 9613/167630         | 6608/54540 | 8309/68580 |

TABLE VIII

SCALABILITY OF BFS, SSSP AND MST ON LOWER END GPUS. TIMES IN MILLISECONDS. AVERAGE DEGREE 12.

| Number   | Time 8600GT/8800GT |          |         |          |          |           |        |           |           |  |  |
|----------|--------------------|----------|---------|----------|----------|-----------|--------|-----------|-----------|--|--|
| of       |                    | BFS      |         |          | SSSP     |           | MST    |           |           |  |  |
| vertices | Random             | RMAT     | SSCA#2  | Random   | RMAT     | SSCA#2    | Random | RMAT      | SSCA#2    |  |  |
| 1M       | 161/68             | 620/339  | 133/55  | 1733/919 | 1862/994 | 1879/1038 | -/1532 | 5682/3814 | 2461/1310 |  |  |
| 2M       | -/143              | 1237/642 | 270/114 | -/2016   | -/2527   | -/1803    | -/2820 | -/8002    | -/2731    |  |  |
| 3M       | -/223              | -/1235   | -/170   | -/-      | -/3880   | -/2690    | -/-    | -/-       | -/4029    |  |  |
| 4M       | -/299              | -/1546   | -/229   | -/-      | -/-      | -/3900    | -/-    | -/-       | -/-       |  |  |
| 5M       | -/370              | -/2139   | -/287   | -/-      | -/-      | -/-       | -/-    | -/-       | -/-       |  |  |

TABLE IX

Results on the ninth DIMACS challenge [51] graphs, weights in range 1 - 300K. Times in Milliseconds.

| Graphs with           | Vertices | Edges    | Time GPU/CPU |            |              |            |                       |  |
|-----------------------|----------|----------|--------------|------------|--------------|------------|-----------------------|--|
| distances as weights  |          |          | BFS          | STCON      | SSSP         | MST        | Max Flow <sup>§</sup> |  |
| New York              | 264346   | 733846   | 147/20       | 1.25/8.8   | 448/190      | 76/780     | 657/420               |  |
| San Fransisco Bay     | 321270   | 800172   | 199/20       | 2.2/11.3   | 623/230      | 85/870     | 1941/740              |  |
| Colorado              | 435666   | 1057066  | 414/30       | 2.36/15.9  | 1738/340     | 116/1280   | 3021/2770             |  |
| Florida               | 1070376  | 2712798  | 1241/80      | 5.02/37.7  | 4805/810     | 261/3840   | 6415/2810             |  |
| Northwest USA         | 1207945  | 2840208  | 1588/100     | 7.8/48.3   | 8071/1030    | 299/4290   | 11018/3720            |  |
| Northeast USA         | 1524453  | 3897636  | 2077/140     | 8.8/66.5   | 8563/1560    | 383/6050   | 18722/4100            |  |
| California and Nevada | 1890815  | 4657742  | 2762/180     | 9.4/100    | 11664/1770   | 435/7750   | 19327/4270            |  |
| Great Lakes           | 2758119  | 6885658  | 5704/240     | 19.8/114.7 | 32905/2730   | 671/12300  | 21915/6360            |  |
| Eastern USA           | 3598623  | 8778114  | 7666/400     | 24.4/183.8 | 41315/4140   | 1222/16280 | 70147/16920           |  |
| Western USA           | 6262104  | 15248146 | 14065/800    | 58/379.8   | 82247/8500   | 1178/32050 | 184477/25360          |  |
| Central USA           | 14081816 | 34292496 | 37936/3580   | 200/1691   | 215087/34560 | 3768/-     | 238151/-              |  |
| Full USA <sup>‡</sup> | 23947347 | 58333344 | 102302/-     | 860/-      | 672542/-     | 8348/-     | -/-                   |  |

<sup>‡</sup>Results taken on Tesla

<sup>§</sup>Max Flow results at m = 3 and k = 7

## ACKNOWLEDGMENT

We would like to thank the Nvidia Corporation for their generous support, especially for providing hardware used in this work. We would also like to acknowledge Georgia Tech Institute for their graph generating software. We would like to thank Yokesh Kumar for his help and discussions for various algorithms given as part of this work.

## APPENDIX

## TABLE X

## SUMMARY OF RESULTS FOR SYNTHETIC GRAPHS. TIMES IN MILLISECONDS

| Algo  | Graph                   |       | Nı    | umber of ' | Vertices, av | erage degree        | e 12, weight       | s varying f        | rom 1 to 1         | 00                |                   |
|-------|-------------------------|-------|-------|------------|--------------|---------------------|--------------------|--------------------|--------------------|-------------------|-------------------|
|       | Туре                    | 1M    | 2M    | 3M         | 4M           | 5M                  | 6M                 | 7M                 | 8M                 | 9M                | 10M               |
| BFS*  | Random GPU <sup>†</sup> | 38    | 82    | 132        | 184          | 251                 | 338                | 416                | 541                | 635 <sup>‡</sup>  | 678‡              |
|       | Random CPU              | 530   | 1230  | 2000       | 2710         | 3480                | 4290               | 5040               | 5800               | -                 | -                 |
|       | R-MAT $GPU^{\dagger}$   | 244   | 433   | 778        | 944          | 1429                | 1526               | 1969               | 2194               | 2339 <sup>‡</sup> | 3349 <sup>‡</sup> |
| БГЗ   | R-MAT CPU               | 340   | 760   | 1230       | 1680         | 2270                | 2760               | 3220               | 3620               | -                 | -                 |
|       | SSCA#2 GPU <sup>†</sup> | 30    | 62    | 95         | 142          | 178                 | 233                | 294                | 360                | 433 <sup>‡</sup>  | 564 <sup>‡</sup>  |
|       | SSCA#2 CPU              | 420   | 930   | 1460       | 2010         | 2550                | 3150               | 3710               | 4310               | -                 | -                 |
|       | Random GPU              | 1.42  | 3.06  | 4.28       | 5.34         | 6.62                | 7.37               | 9.96               | 10.8               | 11.15             | -                 |
|       | Random CPU              | 68    | 164   | 286        | 310          | 416                 | 536                | 692                | -                  | -                 | -                 |
| STCON | R-MAT GPU               | 19.2  | 32.37 | 172.1      | 347.4        | 408.3               | 579.1              | 626                | 1029               | -                 | -                 |
| SICON | R-MAT CPU               | 160   | 358   | 501        | 638          | 926                 | 1055               | 1288               | -                  | -                 | -                 |
|       | SSCA#2 GPU              | 1.96  | 3.76  | 5.33       | 5.44         | 7.23                | 8.08               | 9.1                | 12.33              | -                 | -                 |
|       | SSCA#2 CPU              | 78    | 176   | 286        | 422          | 552                 | 595                | 665                | -                  | -                 | -                 |
|       | Random GPU              | 116   | 247   | 393        | 547          | 698                 | 920                | 947                | 1140               | 1247              | 1535              |
|       | Random CPU              | 2330  | 5430  | 10420      | 18130        | -                   | -                  | -                  | -                  | -                 | -                 |
| SSSP  | R-MAT GPU <sup>†</sup>  | 576   | 1025  | 1584       | 1842         | 2561                | 3575               | 11334              | -                  | -                 | -                 |
| 5551  | R-MAT CPU               | 1950  | 4200  | 6700       | 11680        | -                   | -                  | -                  | -                  | -                 | -                 |
|       | SSCA#2 GPU              | 145   | 295   | 488        | 632          | 701                 | 980                | 1187               | 1282               | 1583              | 2198‡             |
|       | SSCA#2 CPU              | 2110  | 4490  | 6970       | 9550         | -                   | -                  | -                  | -                  | -                 | -                 |
|       | Random GPU <sup>†</sup> | 770   | 1526  | 2452       | 3498         | 4654                | 6424 <sup>‡</sup>  | 8670 <sup>‡</sup>  | 11125 <sup>‡</sup> | -                 | -                 |
|       | Random CPU              | 12160 | 26040 | -          | -            | -                   | -                  | -                  | -                  | -                 | -                 |
| MST   | R-MAT GPU <sup>†</sup>  | 2076  | 4391  | 5995       | 9102         | 10875               | 12852              | 15619 <sup>‡</sup> | 21278 <sup>‡</sup> | -                 | -                 |
| 14151 | R-MAT CPU               | 10230 | 22340 | -          | -            | -                   | -                  | -                  | -                  | -                 | -                 |
|       | SSCA#2 GPU <sup>†</sup> | 551   | 1174  | 1772       | 2970         | 4173                | 4879               | 7806 <sup>‡</sup>  | 9993‡              | -                 | -                 |
|       | SSCA#2 CPU              | 7540  | 15980 | 25230      | -            | -                   | -                  | -                  | -                  | -                 | -                 |
|       | Random GPU <sup>§</sup> | 598   | 3013  | 5083       | 7179         | 7323 <sup>‡</sup>   | 16871 <sup>‡</sup> | 30201 <sup>‡</sup> | 34253‡             | -                 | -                 |
|       | Random CPU              | 15390 | 33290 | -          | -            | -                   | -                  | -                  | -                  | -                 | -                 |
| MF    | R-MAT GPU               | 30743 | 55514 | 74767      | 148627       | 232789 <sup>‡</sup> | 311267‡            | -                  | -                  | -                 | -                 |
| 1911  | R-MAT CPU               | 8560  | 18770 | -          | -            | -                   | -                  | -                  | -                  | -                 | -                 |
|       | SSCA#2 GPU§             | 459   | 2548  | 2943       | 7388         | 8606 <sup>‡</sup>   | 12742 <sup>‡</sup> | -                  | -                  | -                 | -                 |
|       | SSCA#2 CPU              | 9760  | 20960 | -          | -            | -                   | -                  | -                  | -                  | -                 | -                 |

TABLE XI SUMMARY OF RESULTS FOR SYNTHETIC GRAPHS APSP APPROACHES. TIMES IN MILLISECONDS

| APSP       | Graph       |      |       | N     | umber o | of Vertic | es, averaș | ge degree | 12, weigl           | nts in range        | 1 - 100                |                        |
|------------|-------------|------|-------|-------|---------|-----------|------------|-----------|---------------------|---------------------|------------------------|------------------------|
|            | Туре        | 256  | 512   | 1024  | 2048    | 4096      | 9216       | 10240     | 11264               | 18K                 | 25K                    | 30K                    |
| Using      | Random      | 499  | 1277  | 3239  | 7851    | 18420     | 56713      | 65375     | 77265               | 160608              | 316078                 | 556313                 |
| SSSP       | R-MAT       | 489  | 1531  | 4145  | 12442   | 38812     | 143991     | 170121    | 211277              | 465037              | 1028275                | 1362119                |
| GTX 280    | SSCA#2      | 469  | 1300  | 3893  | 7677    | 17450     | 50498      | 58980     | 67794               | 163081              | 353166                 | 461901                 |
|            | Random      | 2.77 | 11.3  | 55.7  | 330.4   | 2240.8    | 41150      | 58889     | 72881 <sup>‡</sup>  | 244264 <sup>‡</sup> | 1724970 <sup>‡</sup> ¶ | 3072443 <sup>‡</sup> ¶ |
| Matrix     | R-MAT       | 2.54 | 10.9  | 66.2  | 478     | 3756      | 32263      | 56906     | 71035 <sup>‡</sup>  | 339188 <sup>‡</sup> | 1152989 <sup>‡</sup> ¶ | 4032675 <sup>‡</sup> ¶ |
| GTX 280    | SSCA#2      | 2.55 | 8.6   | 42.9  | 263.6   | 2063.7    | 43045      | 62517     | 76399‡              | 220868‡             | 1360469‡ ¶             | 1872394‡¶              |
|            | Fully Conn. | 2.9  | 15.2  | 112   | 959     | 8363      | 110658     | 151820    | 202118 <sup>‡</sup> | 799035 <sup>‡</sup> | 4467455 <sup>‡</sup> ¶ | -                      |
|            | Random      | 3.25 | 21.3  | 136.3 | 827.9   | 5548      | 65552      | 87043     | 113993              | 1048598¶            | -                      | -                      |
| Matrix     | R-MAT       | 2.99 | 22    | 174.8 | 1307.6  | 10534     | 94373      | 115294    | 137854              | 1487025¶            | -                      | -                      |
| FX 5600    | SSCA#2      | 2.91 | 15.78 | 103.2 | 635.3   | 4751      | 62555      | 82708     | 109744              | 1001212¶            | -                      | -                      |
|            | Fully Conn. | 2.9  | 25.1  | 221.5 | 1941    | 16904     | 268757     | 368397    | 490157              | 4300447¶            | -                      | -                      |
| GE Based   | Random      | 1.8  | 4.4   | 12.1  | 44.9    | 230       | 1505       | -         | -                   | -                   | -                      | -                      |
| Lazy Min   | R-MAT       | 1.8  | 4.5   | 12.1  | 45      | 230       | 1505       | -         | -                   | -                   | -                      | -                      |
| GTX 280    | SSCA#2      | 1.8  | 4.5   | 12.1  | 44      | 230       | 1497       | -         | -                   | -                   | -                      | -                      |
| GE Based   | Random      | 2.18 | 6     | 19.64 | 96.5    | 639       | 3959       | -         | -                   | -                   | -                      | -                      |
| Buluc [10] | R-MAT       | 1.86 | 5.9   | 19.1  | 96      | 638       | 3959       | -         | -                   | -                   | -                      | -                      |
| GTX 280    | SSCA#2      | 2.18 | 5.9   | 19.67 | 96      | 638       | 3965       | -         | -                   | -                   | -                      | -                      |
| Katz [7]   | -           | 7.7  | 34.9  | 230.1 | 1735.6  | 13720     | 158690     | 216400    | 1015700             | -                   | -                      | -                      |

\*CPU implementation is ours <sup>†</sup>Using Compaction process <sup>‡</sup>Results taken on a Tesla S1070 <sup>§</sup>Max Flow results at m = 3 and k = 7<sup>¶</sup>Results using streaming from CPU to GPU memory

#### REFERENCES

- T. Lengauer and R. E. Tarjan, "A fast algorithm for finding dominators in a flowgraph," ACM Trans. Program. Lang. Syst., vol. 1, no. 1, pp. 121–141, 1979.
- [2] J. D. Cho, S. Raje, and M. Sarrafzadeh, "Fast Approximation Algorithms on Maxcut, k-Coloring, and k-Color Ordering for VLSI Applications," *IEEE Transactions on Computers*, vol. 47, no. 11, pp. 1253–1266, 1998.
- [3] D. A. Bader and K. Madduri, "Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2." in *ICPP*, 2006, pp. 523–530.
- [4] D. A. Bader and K. Madduri, "Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks," in *ICPP '06: Proceedings* of the 2006 International Conference on Parallel Processing, 2006, pp. 539–550.
- [5] A. Yoo, E. Chow, K. Henderson, W. McLendon, B. Hendrickson, and U. Catalyurek, "A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L," in SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, 2005, p. 25.
- [6] A. Munshi, "OpenCL: Parallel computing o the GPU and CPU," in SIGGRAPH, Tutorial, 2008.
- [7] G. J. Katz and J. T. Kider, Jr, "All-pairs shortest-paths for large graphs on the GPU," in GH '08: Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, 2008, pp. 47–55.
- [8] "Rodinia: Accelerating Compute-Intensive Applications with Accelerators." [Online]. Available: www.cs.virginia.edu/~skadron/wiki/rodinia
- [9] Nvidia, "Nvidia CUDA Programming Guide 2.0." [Online]. Available: http://www.nvidia.com/object/cuda\_develop.html
- [10] A. Buluc, J. R. Gilbert, and C. Budak, "Gaussian Elimination Based Algorithms on the GPU," Tech. Rep., November 2008.
- [11] J. Hyvonen, J. Saramaki, and K. Kaski, "Efficient data structures for sparse network representation," *Int. J. Comput. Math.*, vol. 85, no. 8, pp. 1219–1233, 2008.
- [12] A. Lefohn, J. M. Kniss, R. Strzodka, S. Sengupta, and O. J. D., "Glift: Generic, Efficient, Random-Access GPU Data Structures," ACM Transactions on Graphics, vol. 25, no. 1, pp. 60–99, Jan 2006.
- [13] N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha, "Memory—A memory model for scientific algorithms on graphics processors," in SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006, p. 89.
- [14] S. Sengupta, M. Harris, Y. Zhang, and O. J. D., "Scan primitives for GPU computing," in GH 07: Proceedings of the 22nd ACM SIG-GRAPH/EUROGRAPHICS symposium on Graphics hardware, 2007, pp. 97–106.
- [15] D. P. Scarpazza, O. Villa, and F. Petrini, "Efficient Breadth-First Search on the Cell/BE Processor," *IEEE Trans. Parallel Distrib. Syst.*, vol. 19, no. 10, pp. 1381–1395, 2008.
- [16] Y. Zhang and E. Hansen, "Parallel Breadth-First Heuristic Search on a Shared-Memory Architecture," in AAAI-06 Workshop on Heuristic Search, Memory-Based Heuristics and Their Applications, 2006.
- [17] E. W. Dijkstra, "A note on two problems in connexion with graphs." *Numerische Mathematik*, vol. 1, pp. 269–271, 1959.
- [18] A. Crauser, K. Mehlhorn, and U. Meyer, "A parallelization of dijkstra's shortest path algorithm," in *In Proc. 23rd MFCS'98, Lecture Notes in Computer Science*, 1998, pp. 722–731.
- [19] A. S. Nepomniaschaya and M. A. Dvoskina, "A simple implementation of Dijkstra's shortest path algorithm on associative parallel processors," *Fundam. Inf.*, vol. 43, no. 1-4, pp. 227–243, 2000.
- [20] P. J. Narayanan, "Single Source Shortest Path Problem on Processor Arrays," in Proceedings of the Fourth IEEE Symposium on the Frontiers of Massively Parallel Computing, 1992, pp. 553–556.
- [21] A. Crauser, K. Mehlhorn, and U. Meyer, "A parallelization of Dijkstra's shortest path algorithm," in *In Proc. 23rd MFCS'98, Lecture Notes in Computer Science.* Springer, 1998, pp. 722–731.
- [22] U. Meyer and P. Sanders, "Δ-Stepping : A Parallel Single Source Shortest Path Algorithm," in *In ESA 98: Proceedings of the 6th Annual European Symposium on Algorithms*. Springer-Verlag, 1998, pp. 393– 404.
- [23] T. Takaoka, "An Efficient Parallel Algorithm for the All Pairs Shortest Path Problem," in WG '88: Proceedings of the 14th International Workshop on Graph-Theoretic Concepts in Computer Science, 1989, pp. 276–287.
- [24] M. Pal and G. P. Bhattacharjee, "An optimal parallel algorithm for all-pairs shortest paths on unweighted interval graphs," *Nordic J. of Computing*, vol. 4, no. 4, pp. 342–356, 1997.

- [25] Y. Han, V. Y. Pan, and J. H. Reif, "Efficient Parallel Algorithms for Computing all Pair Shortest Paths in Directed Graphs," in ACM Symposium on Parallel Algorithms and Architectures, 1992, pp. 353– 362.
- [26] P. Micikevicius, "General Parallel Computation on Commodity Graphics Hardware: Case Study with the All-Pairs Shortest Paths Problem." in *International Conference on Parallel and Distributed Processing Techniques and Applications*, 2004, pp. 1359–1365.
- [27] P. Harish and P. J. Narayanan, "Accelerating Large Graph Algorithms on the GPU Using CUDA." in *HiPC*, ser. Lecture Notes in Computer Science, vol. 4873, 2007, pp. 197–208.
- [28] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya, "A Blocked All-Pairs Shortest-Paths Algorithm," *Journal of Experimental Algorithmics*, vol. 8, p. 2003, 2003.
- [29] V. Volkov and J. Demmel, "LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs," EECS Department, University of California, Berkeley, Tech. Rep., May 2008.
- [30] P. D'Alberto and A. Nicolau, "R-kleene: A high-performance divideand-conquer algorithm for the all-pair shortest path for densely connected networks," *Algorithmica*, vol. 47, no. 2, pp. 203–213, 2007.
- [31] B. Chazelle, "A minimum spanning tree algorithm with inverse-Ackermann type complexity," J. ACM, vol. 47, no. 6, pp. 1028–1047, 2000.
- [32] O. Boruvka, "O Jistém Problému Minimálním (About a Certain Minimal Problem) (in Czech, German summary)," *Práce Mor. Prírodoved. Spol. v Brne III*, vol. 3, 1926.
- [33] V. King, C. K. Poon, V. Ramachandran, and S. Sinha, "An optimal EREW PRAM algorithm for minimum spanning tree verification," *Inf. Process. Lett.*, vol. 62, no. 3, pp. 153–159, 1997.
- [34] K. W. Chong, Y. Han, and T. W. Lam, "Concurrent threads and optimal parallel minimum spanning trees algorithm," *J. ACM*, vol. 48, no. 2, pp. 297–323, 2001.
- [35] D. A. Bader and G. Cong, "A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)," J. Parallel Distrib. Comput., vol. 65, no. 9, pp. 994–1006, 2005.
- [36] S. Chung and A. Condon, "Parallel Implementation of Borvka's Minimum Spanning Tree Algorithm," in *IPPS '96: Proceedings of the 10th International Parallel Processing Symposium*, 1996, pp. 302–308.
- [37] F. Dehne and S. Götz, "Practical Parallel Algorithms for Minimum Spanning Trees," in SRDS '98: Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, 1998, p. 366.
- [38] D. B. Johnson and P. Metaxas, "A parallel algorithm for computing minimum spanning trees," J. Algorithms, vol. 19, no. 3, pp. 383–401, 1995.
- [39] V. Vineet, P. Harish, S. Patidar, and P. J. Narayanan, "Fast minimum spanning tree for large graphs on the GPU," in *HPG '09: Proceedings* of the Conference on High Performance Graphics 2009, 2009, pp. 167– 171.
- [40] A. V. Goldberg and S. Rao, "Beyond the flow decomposition barrier," J. ACM, vol. 45, no. 5, pp. 783–797, 1998.
- [41] L. R. Ford and D. R. Fulkerson, "Maximal Flow through a Network," *Canadian Journal of Mathematics*, vol. 8, no. 3, pp. 399–404, 1956.
- [42] J. Edmonds and R. M. Karp, "Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems," J. ACM, vol. 19, no. 2, pp. 248–264, 1972.
- [43] A. V. Goldberg and R. E. Tarjan, "A new approach to the maximum flow problem," in STOC '86: Proceedings of the eighteenth annual ACM symposium on Theory of computing, 1986, pp. 136–146.
- [44] R. J. Anderson and J. C. Setubal, "On the parallel implementation of Goldberg's maximum flow algorithm," in SPAA '92: Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, 1992, pp. 168–177.
- [45] D. A. Bader and V. Sachdeva, "A Cache-Aware Parallel Implementation of the Push-Relabel Network Flow Algorithm and Experimental Evaluation of the Gap Relabeling Heuristic," in *ISCA PDCS*, 2005, pp. 41–48.
- [46] F. Alizadeh and A. V. Goldberg, "Experiments with the push-relabel method for the maximum flow problem on a connection machine," in DIMACS Implementation Challenge Workshop: Network Flows and Matching, Tech. Rep., vol. 92, no. 4, 1991, pp. 56–71.
- [47] M. Hussein, A. Varshney, and L. Davis, "On Implementing Graph Cuts on CUDA," in *First Workshop on General Purpose Processing* on Graphics Processing Units, 2007.
- [48] V. Vineet and P. Narayanan, "CudaCuts: Fast graph cuts on the GPU," in CVGPU08, 2008, pp. 1–8.

- [49] N. Edmonds, A. Breuer, D. Gregor, and A. Lumsdaine, "Single-Source Shortest Paths with the Parallel Boost Graph Library," in *The Ninth DIMACS Implementation Challenge: The Shortest Path Problem*, November 2006.
- [50] J. R. Crobak, J. W. Berry, K. Madduri, and D. A. Bader, "Advanced shortest paths algorithms on a massively-multithreaded architecture," *Parallel and Distributed Processing Symposium, International*, vol. 0, p. 497, 2007.
- [51] "The Ninth DIMACS implementation challange on shortest paths." [Online]. Available: http://www.dis.uniroma1.it/~challenge9/
- [52] D. A. Bader and K. Madduri, "GTgraph: A Synthetic Graph Generator Suite," Tech. Rep., 2006.
- [53] D. Chakrabarti, Y. Zhan, and C. Faloutsos, "R-MAT: A recursive model for graph mining," in *In SIAM International Conference on Data Mining*, 2004.
- [54] D. A. Bader and K. Madduri, "Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors." in *HiPC*, ser. Lecture Notes in Computer Science, vol. 3769, 2005, pp. 465–476.