TY - GEN
T1 - Heterogeneous parallel 3D image deconvolution on a cluster of GPUs and CPUs
AU - Domanski, L.
AU - Bednarz, T.
AU - Vallotton, P.
AU - Taylor, J.
PY - 2011
Y1 - 2011
N2 - This paper presents a heterogeneous computing algorithm for 3D Richardson-Lucy image deconvolution applicable for use on single heterogeneous workstations, all the way up to large distributed memory clusters consisting of many heterogeneous nodes. We demonstrate our solution on a cluster of nodes containing multiple CPU cores and GPUs. The algorithm uses a combination of message passing and massively-multicore programming technologies to achieve nested levels of parallelism, ranging from course grained domain decomposition across worker processes to more fine grain parallelism within worker processes utilising GPUs. The work distribution and worker framework is abstracted from the type of processor architecture used for core algorithm calculation by different worker processes. Allocation of computational resources (different processors or cores) to workers is handled collaboratively by the worker processes on each cluster node using efficient Operating System level counting semaphores, avoiding the need to manage computational resources centrally on the cluster. The tested implementation utilises MPI (Message Passing Interface) for parallelisation across the cluster, CUFFT and custom written kernels for parallelisation of algorithm components on the GPU, and the highly tuned MKL math library for computations on the CPU. Result show that utilising a collection of different processor types on available nodes can provided performance benefits over the use of a single type alone. It is common to find heterogeneous workstations with a smaller number of high performance accelerator processors than general purpose processor cores. In these cases, when considering the number of cluster nodes utilised versus performance, using all available processors on a node generally provides a performance gain whilst using the same number of nodes, or allows us to achieve similar performance using fewer nodes. We discuss situations where using multiple processor types at once can inhibit performance, and make recommendations on when such an approach would or would not be advantageous.
AB - This paper presents a heterogeneous computing algorithm for 3D Richardson-Lucy image deconvolution applicable for use on single heterogeneous workstations, all the way up to large distributed memory clusters consisting of many heterogeneous nodes. We demonstrate our solution on a cluster of nodes containing multiple CPU cores and GPUs. The algorithm uses a combination of message passing and massively-multicore programming technologies to achieve nested levels of parallelism, ranging from course grained domain decomposition across worker processes to more fine grain parallelism within worker processes utilising GPUs. The work distribution and worker framework is abstracted from the type of processor architecture used for core algorithm calculation by different worker processes. Allocation of computational resources (different processors or cores) to workers is handled collaboratively by the worker processes on each cluster node using efficient Operating System level counting semaphores, avoiding the need to manage computational resources centrally on the cluster. The tested implementation utilises MPI (Message Passing Interface) for parallelisation across the cluster, CUFFT and custom written kernels for parallelisation of algorithm components on the GPU, and the highly tuned MKL math library for computations on the CPU. Result show that utilising a collection of different processor types on available nodes can provided performance benefits over the use of a single type alone. It is common to find heterogeneous workstations with a smaller number of high performance accelerator processors than general purpose processor cores. In these cases, when considering the number of cluster nodes utilised versus performance, using all available processors on a node generally provides a performance gain whilst using the same number of nodes, or allows us to achieve similar performance using fewer nodes. We discuss situations where using multiple processor types at once can inhibit performance, and make recommendations on when such an approach would or would not be advantageous.
KW - GPU
KW - Heterogeneous computing
KW - High performance computing
KW - Image deconvolution
KW - Image restoration
KW - Parallel programming
UR - http://www.scopus.com/inward/record.url?scp=84858802846&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9780987214317
T3 - MODSIM 2011 - 19th International Congress on Modelling and Simulation - Sustaining Our Future: Understanding and Living with Uncertainty
SP - 613
EP - 619
BT - MODSIM 2011 - 19th International Congress on Modelling and Simulation - Sustaining Our Future
T2 - 19th International Congress on Modelling and Simulation - Sustaining Our Future: Understanding and Living with Uncertainty, MODSIM2011
Y2 - 12 December 2011 through 16 December 2011
ER -