A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique

Md Mohsin Ali, Peter E. Strazdins, Brendan Harding, Markus Hegland, Jay W. Larson

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Citations (Scopus)

Abstract

Applications performing ultra-large scale simulations via solving PDEs require very large computational systems for their timely solution. Studies have shown the rate of failure grows with the system size and these trends are likely to worsen in future machines as less reliable components are used to reduce the energy cost. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) is a cost-effective method for solving time-evolving PDEs, especially for higher-dimensional problems. It can also be easily modified to provide algorithm-based fault tolerance for these problems. In this paper, we show how the SGCT can produce a fault-tolerant version of the GENE gyrokinetic plasma application, which evolves a 5D complex density field over time. We use an alternate component grid combination formula to recover data from lost processes. User Level Failure Mitigation (ULFM) MPI is used to recover the processes, and our implementation is robust over multiple failures and recovery for both process and node failures. An acceptable degree of modification of the application is required. Results using the SGCT on two of the fields' dimensions show competitive execution times with acceptable error (within 0.1%), compared to the same simulation with a single full resolution grid. The benefits improve when the SGCT is used over three dimensions. Our experiments show that the GENE application can successfully recover from multiple process failures, and applying the SGCT the corresponding number of times minimizes the error for the lost sub-grids. Application recovery overhead via ULFM MPI increases from ∼1.5s at 64 cores to ∼5s at 2048 cores for a one-off failure. This compares favourably to using GENE's in-built checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the backtrack overhead. An analysis for a long-running application taking into account checkpoint backtrack times indicates a reduction in overhead of over an order of magnitude.

Original languageEnglish
Title of host publicationProceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015
EditorsWaleed W. Smari, Vesna Zeljkovic
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages499-507
Number of pages9
ISBN (Electronic)9781467378123
DOIs
Publication statusPublished - 2 Sept 2015
Event13th International Conference on High Performance Computing and Simulation, HPCS 2015 - Amsterdam, Netherlands
Duration: 20 Jul 201524 Jul 2015

Publication series

NameProceedings of the 2015 International Conference on High Performance Computing and Simulation, HPCS 2015

Conference

Conference13th International Conference on High Performance Computing and Simulation, HPCS 2015
Country/TerritoryNetherlands
CityAmsterdam
Period20/07/1524/07/15

Fingerprint

Dive into the research topics of 'A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique'. Together they form a unique fingerprint.

Cite this