Combining Data Parallelism and Task Parallelism for Efficient Performance on Hybrid CPU and GPU Systems
Aditya Deshpande (homepage)
In earlier times, computer systems had only a single core or processor. In these computers, the number of transistors on-chip (i.e. on the processor) doubled every two years and all applications enjoyed free speedup. Subsequently, with more and more transistors being packed on-chip, power consumption became an issue, frequency scaling reached its limits and industry leaders eventually adopted the paradigm of multi-core processors. Computing platforms of today have multiple cores and are parallel. CPUs have multiple identical cores. A GPU with dozens to hundreds of simpler cores is present on many systems. In future, other multiple core accelerators may also be used.
With the advent of multiple core processors, the responsibility of extracting high performance from these parallel platforms shifted from computer architects to application developers and parallel algorithmists. Tuned parallel implementations of several mathematical operations, algorithms on graphs or matrices on multi-core CPUs and on many-core accelerators like the GPU and CellBE, and their combinations were developed. Parallel algorithms developed for multi-core CPUs primarily focussed on decomposing the problem into a few independent chunks and using the cache efficiently. As an alternative to CPUs, Graphics Processing Units (GPUs) were the other most cost-effective and massively parallel platforms, that were widely available. Frequently used algorithmic primitives such as sort, scan, sparse matrix vector multiplication, graph traversals, image processing operations etc. among others were efficiently implemented on GPU using CUDA. These parallel algorithms on the GPU decomposed the problem into a sequence of many independent steps operating on different data elements and used shared memory effectively.
But the above operations -- statistical, or on graphs, matrices and list etc. -- constitute only portions of an end-to-end application and in most cases these operations also provide some inherent parallelism (task or data parallelism). The problems which lack such task or data parallelism are still difficult to map to any parallel platform, either CPU or GPU. In this thesis, we consider a few such difficult problems -- like Floyd-Steinberg Dithering (FSD) and String Sorting -- that do not have trivial data parallelism and exhibit strong sequential dependence or irregularity. We show that with appropriate design principles we can find data parallelism or fine-grained parallelism even for these tough problems. Our techniques to break sequentiality and addressing irregularity can be extended to solve other difficult data parallel problems in the future. On the problem of FSD, our data parallel approach achieves a speedup of 10X on high-end GPUs and a speedup of about 3-4X on low-end GPUs, whereas previous work by Zhang et al. dismiss the same algorithm as lacking enough parallelism for GPUs. On string sorting, we achieve a speedup of around 10-19X as compared to state-of-the-art GPU merge sort based methods and our code will be available as part of standard GPU Library (CUDPP).
It is not enough to have a truly fine-grained parallel alogrithm for only a few operations. Any end-to-end application consists of many operations, some of which are difficult to execute on a fine-grained parallel platform like GPU. At the same time, computing platforms consist of CPU and a GPU which have complementary attributes. CPUs are suitable for some heavy processing by only a few threads i.e. they prefer task parallelism. GPUs is more suited for applications where large amount of data parallel operations are performed. Applications can achieve optimal performance by combining data parallelism on GPU with task parallelism on CPU. In this thesis, we examine two methods of combining data parallelism and task parallelism on a hybrid CPU and GPU computer system: (i) pipelining and (ii) work sharing. For pipelining, we study the Burrows Wheeler Compression (BWC) implementation in Bzip2 and show that best performance can be achieved by pipelining its different stages effectively. In contrast, a previous GPU implementation of BWC by Patel et al. performed all the tasks (BWT, MTF and Huffman encoding) on the GPU and it was 2.78X slower than CPU. Our hybrid BWC pipeline performs about 2.9X better than CPU BWC and thus, about 8X faster than Patel et al. For work sharing, we use FSD as an example and split the data parallel step between CPU and GPU. The Handover and Hybrid FSD algorithms, which use work sharing to exploit computation resources on both CPU and GPU, are faster than the CPU alone and GPU alone parallel algorithms.
In conclusion, we develop data parallel algorithms on the GPU for difficult problems of Floyd-Steinberg Dithering, String Sorting and Burrows Wheeler Transform. In earlier literature, simpler problems which provided some degree of data parallelism were adapted to the GPUs. The problems we solve on GPU involve challenging sequential dependency and/or irregularity. We show that in addition to developing fast data parallel algorithms on GPU, application developers should also use the CPU to execute tasks in parallel with GPU. This allows an application to fully utilize all resources of an end-user's system and provides them with maximum performance. With computing platforms poised to be predominantly hetergoneous, the use of our design principles will prove critical in obtaining good application level performance on these platforms. (more...)
|Year of completion:||July 2014|
|Advisor :||Prof. P. J. Narayanan|
Aditya Deshpande, Ishan Misra and P J Narayanan - Hybrid Implementation of Error Diffusion Dithering Proceedings of 18th International Conference on High Performance Computing 18-21 Dec. 2011, E-ISBN 978-1-4577-1949-3, Print ISBN 978-1-4577-1951-6, pp. 1-10, Bangalore, India. [PDF]
- Aditya Deshpande and P. J. Narayanan - Fast Burrows Wheeler Compression Using CPU and GPU (Under Review, ACM TOPC).