Application fault tolerance for shrinking resources via the sparse grid combination technique

Peter E. Strazdins, Md Mohsin Ali, Bert Debusschere

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    2 Citations (Scopus)

    Abstract

    The need to make large-scale scientific simulations resilient to the shrinking and growing of compute resources arises from exascale computing and adverse operating conditions (fault tolerance). It can also arise from the cloudcomputing context where the cost of these resources can fluctuate. In this paper, we describe how the Sparse Grid Combination Technique can make such applications resilient to shrinking compute resources. The solution of the non-trivial issues of dealing with data redistribution and on-the-fly malleability of process grid information and ULFM MPI communicatorsare described. Results on a 2D advection solver indicate that process recovery time is significantly reduced from the alternate strategy where failed resources are replaced, overall execution time is actually improved from this case and for checkpointing and the execution error remains small, even when multiple failures occur.

    Original languageEnglish
    Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages1232-1238
    Number of pages7
    ISBN (Electronic)9781509021406
    DOIs
    Publication statusPublished - 18 Jul 2016
    Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
    Duration: 23 May 201627 May 2016

    Publication series

    NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

    Conference

    Conference30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
    Country/TerritoryUnited States
    CityChicago
    Period23/05/1627/05/16

    Fingerprint

    Dive into the research topics of 'Application fault tolerance for shrinking resources via the sparse grid combination technique'. Together they form a unique fingerprint.

    Cite this