PyTorch Distributed Data Parallel Performance on GPUs with MVAPICH-Plus

Machine Specifications: OLCF Frontier

CPU Model CPU Core Info Memory GPU Model GPU Memory Interconnects
AMD EPYC 7A53 CPU 1x64@2GHz 512 GB DDR4 AMD MI250X (4/Node) 128 GB HBM 2e HPE Slingshot (200 Gb/s)
Model Batch Size Block Size Benchmark Dataset DL Framework
GPT-2 12 1024 NanoGPT OpenWebText PyTorch 2.8.0
GPT-2 results on OLCF Frontier (1)
GPT-2 results on OLCF Frontier (2)

Machine Specifications: SDSC Cosmos

CPU Model CPU Core Info Memory GPU Model GPU Memory Interconnects
x86_64 1x96@3.7GHz 512 GB HBM3 unified memory per node Integrated CDNA3 per APU 128 GB HBM3 unified memory per APU HPE Cray Slingshot-11 (200 Gb/s)
Model Batch Size Block Size Benchmark Dataset DL Framework
GPT-2 32 1024 NanoGPT OpenWebText PyTorch 2.8.0
GPT-2 results on SDSC Cosmos (2)

Machine Specifications: TACC Vista

CPU Model CPU Core Info Memory GPU Model GPU Memory Interconnects
NVIDIA Grace CPU 1x72@3.1 GHz 116 GB DDR5 NVIDIA H200 GPU (1/Node) 96 GB HBM 3 Mellanox NDR (400 Gb/s)
Model Batch Size Block Size Benchmark Dataset DL Framework
GPT-2 24 1024 NanoGPT OpenWebText PyTorch 2.8.0
NanoGPT results on TACC Vista (1)
NanoGPT results on TACC Vista (2)