Acceleration of a python-based tsunami modelling application via CUDA and OpenHMPP

Zhe Weng, Peter E. Strazdins

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    4 Citations (Scopus)

    Abstract

    Modern graphics processing units (GPUs) have became powerful and cost-effective computing platforms. Parallel programming standards (e.g. CUDA) and directive-based programming standards (like OpenHMPP and OpenACC) are available to harness this tremendous computing power to tackle largescale modelling and simulation in scientific areas. ANUGA is a tsunami modelling application which is based on unstructured triangular meshes and implemented in Python/C. This paper explores issues in porting and optimizing a Python/C-based unstructured mesh application to GPUs. Two paradigms are compared: CUDA via the PyCUDA API, involving writing GPU kernels, and OpenHMPP, involving adding directives to C code. In either case, the 'naive' approach of transferring unstructured mesh data to the GPU for each kernel resulted in an actual slowdown over single core performance on a CPU. Profiling results confirmed that this is due to data transfer times of the device to/from the host, even though all individual kernels achieved a good speedup. This necessitated an advanced approach, where all key data structures are mirrored on the host and the device. For both paradigms, this in turn involved converting all code updating these data structures to CUDA (or directive-augmented C, in the case of OpenHMPP). Furthermore, in the case of CUDA, the porting can no longer be done incrementally: all changes must be made in a single step. For debugging, this makes identifying which kernel(s) that have introduced bugs very difficult. To alleviate this, we adopted the relative debugging technique to the host-device context. Here, when in debugging mode, the mirrored data structures are updated upon each step on both the host (using the original serial code) and the device, with any discrepancy being immediately detected. We present a generic Python-based implementation of this technique. With this approach, the CUDA version achieved 28× speedup, and the OpenHMPP achieved 16×. The main optimization of unstructured mesh rearrangement to achieve coalesced memory access patterns contributed to 10% of the former. In terms of productivity, however, OpenHMPP achieved significantly better speedup per hour of programming effort.

    Original languageEnglish
    Title of host publicationProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
    PublisherIEEE Computer Society
    Pages1275-1284
    Number of pages10
    ISBN (Electronic)9780769552088
    DOIs
    Publication statusPublished - 27 Nov 2014
    Event28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014 - Phoenix, United States
    Duration: 19 May 201423 May 2014

    Publication series

    NameProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014

    Conference

    Conference28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
    Country/TerritoryUnited States
    CityPhoenix
    Period19/05/1423/05/14

    Fingerprint

    Dive into the research topics of 'Acceleration of a python-based tsunami modelling application via CUDA and OpenHMPP'. Together they form a unique fingerprint.

    Cite this