Resilient X10 over MPI user level failure Mitigation

Sara S. Hamouda, Benjamin Herta, Josh Milthorpe, David Grove, Olivier Tardieu

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    7 Citations (Scopus)

    Abstract

    Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI. This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10. We show that replacing X10's emulated collectives with ULFM's blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 30% and 51%. The per-step overhead for resilience was less than 9%. A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.

    Original languageEnglish
    Title of host publicationX10 2016 - Proceedings of the 6th ACM SIGPLAN Workshop on X10, Co-located with PLDI 2016
    EditorsClaudia Fohry, Olivier Tardieu
    PublisherAssociation for Computing Machinery, Inc
    Pages18-23
    Number of pages6
    ISBN (Electronic)9781450343862
    DOIs
    Publication statusPublished - 14 Jun 2016
    Event6th ACM SIGPLAN Workshop on X10, X10 2016 - Santa Barbara, United States
    Duration: 14 Jun 2016 → …

    Publication series

    NameX10 2016 - Proceedings of the 6th ACM SIGPLAN Workshop on X10, Co-located with PLDI 2016

    Conference

    Conference6th ACM SIGPLAN Workshop on X10, X10 2016
    Country/TerritoryUnited States
    CitySanta Barbara
    Period14/06/16 → …

    Fingerprint

    Dive into the research topics of 'Resilient X10 over MPI user level failure Mitigation'. Together they form a unique fingerprint.

    Cite this