TY - GEN
T1 - Resilient X10 over MPI user level failure Mitigation
AU - Hamouda, Sara S.
AU - Herta, Benjamin
AU - Milthorpe, Josh
AU - Grove, David
AU - Tardieu, Olivier
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/6/14
Y1 - 2016/6/14
N2 - Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI. This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10. We show that replacing X10's emulated collectives with ULFM's blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 30% and 51%. The per-step overhead for resilience was less than 9%. A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.
AB - Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not provide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI. This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10. We show that replacing X10's emulated collectives with ULFM's blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 30% and 51%. The per-step overhead for resilience was less than 9%. A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.
KW - MPI
KW - Resilience
KW - User Level Failure Mitigation
KW - X10
UR - http://www.scopus.com/inward/record.url?scp=84978540265&partnerID=8YFLogxK
U2 - 10.1145/2931028.2931030
DO - 10.1145/2931028.2931030
M3 - Conference contribution
T3 - X10 2016 - Proceedings of the 6th ACM SIGPLAN Workshop on X10, Co-located with PLDI 2016
SP - 18
EP - 23
BT - X10 2016 - Proceedings of the 6th ACM SIGPLAN Workshop on X10, Co-located with PLDI 2016
A2 - Fohry, Claudia
A2 - Tardieu, Olivier
PB - Association for Computing Machinery, Inc
T2 - 6th ACM SIGPLAN Workshop on X10, X10 2016
Y2 - 14 June 2016
ER -