TY - GEN
T1 - The analysis and optimization of collective communications on a Beowulf cluster
AU - Tan, Wi Bing
AU - Strazdins, Peter
N1 - Publisher Copyright:
© 2002 IEEE.
PY - 2002
Y1 - 2002
N2 - This paper gives a performance analysis of the all-gather, all-reduce and reduce-scatter collective communication operations on a Beowulf cluster. This cluster has a contention-free switch-based network with multiple network interface cards per node, permitting overlapping of message transmission under certain circumstances. As well as considering traditional algorithms developed previously for parallel computers with vendor-specific networks, we also examine simpler algorithms made up of repeated sub-operations, such as broadcasts. We find that for the kind of network on the Beowulf cluster, a somewhat different performance modelling of the algorithms is required, and that some simple simulation tools had to be developed in order to fully understand some of the algorithms' performance. Our results indicate that the LAM MPI implementations for these operations may be significantly improved, and the algorithms with data exchange and potential contention perform well on the cluster. Furthermore, they indicate that algorithms permitting message overlap are slightly favoured, with a new and simple algorithm which modestly out-performs the best traditional algorithms in the case of Reduce-Scatter. With the exception that the degree of overlapping proved difficult to estimate, our performance models fitted closely with the results, and together with the simulation tools, permit a detailed understanding of the cluster's communication pattern performance.
AB - This paper gives a performance analysis of the all-gather, all-reduce and reduce-scatter collective communication operations on a Beowulf cluster. This cluster has a contention-free switch-based network with multiple network interface cards per node, permitting overlapping of message transmission under certain circumstances. As well as considering traditional algorithms developed previously for parallel computers with vendor-specific networks, we also examine simpler algorithms made up of repeated sub-operations, such as broadcasts. We find that for the kind of network on the Beowulf cluster, a somewhat different performance modelling of the algorithms is required, and that some simple simulation tools had to be developed in order to fully understand some of the algorithms' performance. Our results indicate that the LAM MPI implementations for these operations may be significantly improved, and the algorithms with data exchange and potential contention perform well on the cluster. Furthermore, they indicate that algorithms permitting message overlap are slightly favoured, with a new and simple algorithm which modestly out-performs the best traditional algorithms in the case of Reduce-Scatter. With the exception that the degree of overlapping proved difficult to estimate, our performance models fitted closely with the results, and together with the simulation tools, permit a detailed understanding of the cluster's communication pattern performance.
KW - Broadcasting
KW - Clustering algorithms
KW - Communication networks
KW - Communication switching
KW - Computer networks
KW - Computer science
KW - Concurrent computing
KW - Network interfaces
KW - Pattern analysis
KW - Performance analysis
UR - http://www.scopus.com/inward/record.url?scp=20444460051&partnerID=8YFLogxK
U2 - 10.1109/ICPADS.2002.1183497
DO - 10.1109/ICPADS.2002.1183497
M3 - Conference contribution
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 659
EP - 666
BT - Proceedings - 9th International Conference on Parallel and Distributed Systems, ICPADS 2002
PB - IEEE Computer Society
T2 - 9th International Conference on Parallel and Distributed Systems, ICPADS 2002
Y2 - 17 December 2002 through 20 December 2002
ER -