Fault-tolerant grid-based solvers: Combining concepts from sparse grids and MapReduce

J. W. Larson*, M. Hegland, B. Harding, S. Roberts, L. Stals, A. P. Rendell, P. Strazdins, M. M. Ali, C. Kowitz, R. Nobes, J. Southern, N. Wilson, M. Li, Y. Oishi

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

14 Citations (Scopus)

Abstract

A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than re-scheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users' legacy codes, and the system's runtime behavior.

Original languageEnglish
Pages (from-to)130-139
Number of pages10
JournalProcedia Computer Science
Volume18
DOIs
Publication statusPublished - 2013
Event13th Annual International Conference on Computational Science, ICCS 2013 - Barcelona, Spain
Duration: 5 Jun 20137 Jun 2013

Fingerprint

Dive into the research topics of 'Fault-tolerant grid-based solvers: Combining concepts from sparse grids and MapReduce'. Together they form a unique fingerprint.

Cite this