Parallel Computing using CPU and GPU
Introduction
Commodity graphics hardware has become a costeffective parallel platform to solve many general problems; owing to its low cost to performance ratio (under $0.5 per GFLOP). GPUs are optimized for graphics operations and their programming model is highly restrictive. All algorithms are disguised as graphics rendering passes with the programmable shaders interpreting the data. This was the situation until the latest model of GPUs following the Shader Model 4.0 were released late in 2006. These GPUs follow a unified architecture and can be used in more flexible ways than their predecessors. Starting from the G80 series of GPUs, Nvidia offers an alternate programming model called Compute Unified Device Architecture (CUDA) to the underlying parallel processor. CUDA is highly suited for general purpose programming on the GPUs and provides a model close to the PRAM model. There has been a tremendous amount of growth in both the programmability and efficiency of GPUs over the years, leading to many GPU based solutions for general purpose problems.
Projects involving GPU's and CUDA
1. String Sort on GPUs
String sorting or variablelength key sorting has lagged in performance on the GPU even as the fixedlength key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this work, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Over 70% of the string sort time is spent on standard Thrust primitives for sort, scatter and scan. This provides high performance along with high adaptability to future GPUs. Our string sort algorithm is 10x faster compared to current GPU methods.
2. MixedResolution Patch Matching
Matching patches of a source image with patches of itself or a target image is a first step for many operations. Finding the optimum nearestneighbors of each patch using a global search of the image is expensive. Optimality is often sacrificed for speed as a result. In this work, we developed the MixedResolution PatchMatching (MRPM) algorithm that uses a pyramid representation to perform fast global search. We compare mixedresolution patches at coarser pyramid levels to alleviate the effects of smoothing. We store more matches at coarser resolutions to ensure wider search ranges and better accuracy at finer levels. Our method achieves near optimality in terms of exhaustive search. Our simple approach enables fast parallel implementations on the GPU, yielding upto 70x speedup compared to other iterative approaches.
3. Scalable KMeans Clustering
KMeans is a popular clustering algorithm with wide applications in Computer Vision, Data mining, Data Visualization, etc. Clustering large numbers of highdimensional vectors is very computation intensive. In this work, we present the design and implementation of the KMeans clustering algorithm on the modern GPU. A load balanced multinode, multiGPU implementation which can handle up to 6 million, 128dimensional vectors was also developed. Our implementation scales linearly or nearlinearly with different problem parameters. We achieve up to 2 times increase in speed compared to the best GPU implementation for KMeans on a single GPU.
4. Error Diffusion Dithering
Many image filtering operations provide ample parallelism, but progressive nonlinear processing of images is among the hardest to parallelize due to long, sequential, and nonlinear data dependency. A typical example of such an operation is error diffusion dithering, exemplified by the FloydSteinberg algorithm. In this work, we present its parallelization on multicore CPUs using a blockbased approach and on the GPU using a pixelbased approach. We also develop a hybrid approach in which the CPU and the GPU operate in parallel during the computation. Our implementation can dither an 8K x 8K image on an offtheshelf laptop with Nvidia 8600M GPU in about 400 milliseconds when the sequential implementation on its CPU took about 4 seconds.
5. Graph Algorithms on CUDA
We have developed several basic graph algorithms on the CUDA architecture including BFS, Single Source Shortest Path(SSSP), AllPairs Shortest Path(APSP), and Minimum Spanning Tree computation for large graphs consisting of millions of vertices and edges. We show results on random, scale free and almost linear graphs. Our approaches are 1050 times faster than their CPU counterparts, on random graphs with an average degree of 6 per vertex.
We have also developed efficient parallel implementations for performing GraphCuts on grid graphs in CUDA. Our original work, titled CUDACuts is based on the pushrelabel operation for computing the maxflow and hence the mincut of a graph. We have also developed a multiresolution framework for maxflow computation which depends on shrinkexpand steps in order to solve the graph fully at lower resolution levels, and uses these results to initialize higher resolution graphs closer to convergence. This improves the overall convergence time for the operation.
Related Publications
Aditya Deshpande and P. J. Narayanan  Can GPUs Sort Strings Efficiently ? Proceedings of the IEEE Conference on High Performance Computing, 1821 Dec. 2013, Bangalore, India. [PDF]
Harshit Surekha and P J Narayanan  MixedResolution PatchMatching Proceedings of 12th European Conference on Computer Vision, 713 Oct. 2012, Vol. ECCV 2012, PartVI, LNCS 7577, pp. 187198, Firenze, Italy. [PDF]
Parikshit Sakurikar and P J Narayanan  Fast Graph Cuts using ShrinkExpand Reparameterization Proceedings of IEEE Workshop on Applications of Computer Vision 911 Jan. 2012, ISSN 15505790 EISBN 9781467302326, Print ISBN 9781467302333, pp. 6571, Breckenridge, CO, USA. [PDF]
Aditya Deshpande, Ishan Misra and P J Narayanan  Hybrid Implementation of Error Diffusion Dithering Proceedings of 18th International Conference on High Performance Computing 1821 Dec. 2011, EISBN 9781457719493, Print ISBN 9781457719516, pp. 110, Bangalore, India. [PDF]
K. Wasif Mohiuddin and P.J. Narayanan  Scalable Clustering Using Multiple GPUs Proceedings of International Conference on High Performance Computing 1821 Dec. 2011, EISBN 9781457719493, Print ISBN 9781457719516, Bangalore, India. [PDF]
Vibhav Vineet and P. J. Narayanan  CUDA Cuts: Fast Graph Cuts on the GPU Proceedings of CVPR Workshop on Visual Computer Vision on GPUs, 2328th june, Anchorage, Alaska, USA. IEEE Computer Society 2008 [PDF]
Pawan Harish and P.J. Narayanan  Accelerating Large Graph Algorithms on the GPU using CUDA Proc of IEEE International Conference on High Performance Computing (HiPC 2007) Goa, December, 2007. [PDF]
Associated People


