TY - GEN
T1 - Application level fault recovery
T2 - 28th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
AU - Ali, Md Mohsin
AU - Southern, James
AU - Strazdins, Peter
AU - Harding, Brendan
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/11/27
Y1 - 2014/11/27
N2 - A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum's Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.
AB - A fault-tolerant version of Open Message Passing Interface (Open MPI), based on the draft User Level Failure Mitigation (ULFM) proposal of the MPI Forum's Fault Tolerance Working Group, is used to create fault-tolerant applications. This allows applications and libraries to design their own recovery methods and control them at the user level. However, only a limited amount of research work on user level failure recovery (including the implementation and performance evaluation of this prototype) has been carried out. This paper contributes a fault-tolerant implementation of an application solving 2D partial differential equations (PDEs) by means of a sparse grid combination technique which is capable of surviving multiple process failures caused by the faults. Our fault recovery involves reconstructing the faulty communicators without shrinking the global size by re-spawning failed MPI processes on the same physical processors where they were before the failure (for load balancing). It also involves restoring lost data from either exact check pointed data on disk, approximated data in memory (via an alternate sparse grid combination technique) or a near-exact copy of replicated data in memory. The experimental results show that the faulty communicator reconstruction time is currently large in the draft ULFM, especially for multiple process failures. They also show that the alternate combination technique has the lowest data recovery overhead, except on a system with very low disk write latency for which checkpointing has the lowest overhead. Furthermore, the errors due to the recovery of approximated data are within a factor of 10 in all cases, with the surprising result that the alternate combination technique being more accurate than the near-exact replication method. The contributed implementation details, including the analysis of the experimental results, of this paper will help application developers to resolve different issues of design and implementation of fault-tolerant applications by means of the Open MPI ULFM standard.
KW - Approximation error
KW - Fault tolerance
KW - PDE solver
KW - Process failure recovery
KW - Sparse grid combination
KW - ULFM
UR - http://www.scopus.com/inward/record.url?scp=84918798613&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2014.132
DO - 10.1109/IPDPSW.2014.132
M3 - Conference contribution
T3 - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
SP - 1169
EP - 1178
BT - Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014
PB - IEEE Computer Society
Y2 - 19 May 2014 through 23 May 2014
ER -