Heterogeneous parallel 3D image deconvolution on a cluster of GPUs and CPUs

L. Domanski*, T. Bednarz, P. Vallotton, J. Taylor

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Citations (Scopus)

Abstract

This paper presents a heterogeneous computing algorithm for 3D Richardson-Lucy image deconvolution applicable for use on single heterogeneous workstations, all the way up to large distributed memory clusters consisting of many heterogeneous nodes. We demonstrate our solution on a cluster of nodes containing multiple CPU cores and GPUs. The algorithm uses a combination of message passing and massively-multicore programming technologies to achieve nested levels of parallelism, ranging from course grained domain decomposition across worker processes to more fine grain parallelism within worker processes utilising GPUs. The work distribution and worker framework is abstracted from the type of processor architecture used for core algorithm calculation by different worker processes. Allocation of computational resources (different processors or cores) to workers is handled collaboratively by the worker processes on each cluster node using efficient Operating System level counting semaphores, avoiding the need to manage computational resources centrally on the cluster. The tested implementation utilises MPI (Message Passing Interface) for parallelisation across the cluster, CUFFT and custom written kernels for parallelisation of algorithm components on the GPU, and the highly tuned MKL math library for computations on the CPU. Result show that utilising a collection of different processor types on available nodes can provided performance benefits over the use of a single type alone. It is common to find heterogeneous workstations with a smaller number of high performance accelerator processors than general purpose processor cores. In these cases, when considering the number of cluster nodes utilised versus performance, using all available processors on a node generally provides a performance gain whilst using the same number of nodes, or allows us to achieve similar performance using fewer nodes. We discuss situations where using multiple processor types at once can inhibit performance, and make recommendations on when such an approach would or would not be advantageous.

Original languageEnglish
Title of host publicationMODSIM 2011 - 19th International Congress on Modelling and Simulation - Sustaining Our Future
Subtitle of host publicationUnderstanding and Living with Uncertainty
Pages613-619
Number of pages7
Publication statusPublished - 2011
Externally publishedYes
Event19th International Congress on Modelling and Simulation - Sustaining Our Future: Understanding and Living with Uncertainty, MODSIM2011 - Perth, WA, Australia
Duration: 12 Dec 201116 Dec 2011

Publication series

NameMODSIM 2011 - 19th International Congress on Modelling and Simulation - Sustaining Our Future: Understanding and Living with Uncertainty

Conference

Conference19th International Congress on Modelling and Simulation - Sustaining Our Future: Understanding and Living with Uncertainty, MODSIM2011
Country/TerritoryAustralia
CityPerth, WA
Period12/12/1116/12/11

Fingerprint

Dive into the research topics of 'Heterogeneous parallel 3D image deconvolution on a cluster of GPUs and CPUs'. Together they form a unique fingerprint.

Cite this