TY - GEN
T1 - Scaling the hartree-fock matrix build on summit
AU - Barca, Giuseppe M.J.
AU - Poole, David L.
AU - Vallejo, Jorge L.Galvez
AU - Alkan, Melisa
AU - Bertoni, Colleen
AU - Rendell, Alistair P.
AU - Gordon, Mark S.
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/11
Y1 - 2020/11
N2 - Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.
AB - Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.
KW - GPU
KW - Hartree-Fock
KW - Summit
UR - http://www.scopus.com/inward/record.url?scp=85102340598&partnerID=8YFLogxK
U2 - 10.1109/SC41405.2020.00085
DO - 10.1109/SC41405.2020.00085
M3 - Conference contribution
AN - SCOPUS:85102340598
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2020
PB - IEEE Computer Society
T2 - 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020
Y2 - 9 November 2020 through 19 November 2020
ER -