Application level fault recovery: Using fault-tolerant open MPI in a PDE solver

Md Mohsin Ali, James Southern, Peter Strazdins, Brendan Harding

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

24 Citations (Scopus)

Abstract

A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum's Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.

Original languageEnglish
Title of host publicationProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
PublisherIEEE Computer Society
Pages1169-1178
Number of pages10
ISBN (Electronic)9780769552088
DOIs
Publication statusPublished - 27 Nov 2014
Event28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014 - Phoenix, United States
Duration: 19 May 201423 May 2014

Publication series

NameProceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014

Conference

Conference28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
Country/TerritoryUnited States
CityPhoenix
Period19/05/1423/05/14

Fingerprint

Dive into the research topics of 'Application level fault recovery: Using fault-tolerant open MPI in a PDE solver'. Together they form a unique fingerprint.

Cite this