Fault tolerant computation with the sparse grid combination technique

Brendan Harding, Markus Hegland, Jay Larson, James Southern

    Research output: Contribution to journalArticlepeer-review

    12 Citations (Scopus)

    Abstract

    This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394-C411]. This approach to fault tolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithmbased fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolation with this algorithm in the presence of faults. Numerical experiments solving the scalar advection PDE demonstrate that the algorithm is resilient to faults on a real application. It is observed that the time to solution is not significantly affected by the presence of (simulated) faults. Additionally the expected error increases with the number of faults but is relatively small even for high fault rates. A comparison with traditional checkpoint-restart methods applied to the combination technique shows that our approach is highly scalable with respect to the number of faults.

    Original languageEnglish
    Pages (from-to)C331-C353
    JournalSIAM Journal on Scientific Computing
    Volume37
    Issue number3
    DOIs
    Publication statusPublished - 2015

    Fingerprint

    Dive into the research topics of 'Fault tolerant computation with the sparse grid combination technique'. Together they form a unique fingerprint.

    Cite this