A Resilient Framework for Iterative Linear Algebra Applications in X10

Sara S. Hamouda, Josh Milthorpe, Peter E. Strazdins, Vijay Saraswat

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    6 Citations (Scopus)

    Abstract

    The Global Matrix Library (GML) is a distributed matrix library in the X10 language. GML is designed to simplify the development of scalable linear algebra applications. By hiding the communication and parallelism details, GML programs are written in a sequential style that is easy to use and understand by non expert programmers. Resilience is becoming a major challenge for HPC applications as the number of components in a typical system continues to increase. To address this challenge, we improved GML's adaptability to process failure and provided a mechanism for automatic data recovery. As iterative algorithms are commonly used in linear algebra applications, we also created a checkpoint/restore framework for developing resilient iterative applications using GML. Using three example machine learning applications, we demonstrate that this framework supports resilient application development with minimal additional code compared to a non-resilient implementation. Performance measurements in a typical cluster environment show that the major cost of resilient execution is due to resilient X10 itself, and that the additional cost due to our framework is acceptable.

    Original languageEnglish
    Title of host publicationProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages970-979
    Number of pages10
    ISBN (Electronic)0769555101, 9780769555102
    DOIs
    Publication statusPublished - 29 Sept 2015
    Event29th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015 - Hyderabad, India
    Duration: 25 May 201529 May 2015

    Publication series

    NameProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015

    Conference

    Conference29th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015
    Country/TerritoryIndia
    CityHyderabad
    Period25/05/1529/05/15

    Fingerprint

    Dive into the research topics of 'A Resilient Framework for Iterative Linear Algebra Applications in X10'. Together they form a unique fingerprint.

    Cite this