PyTorch Distributed Data Parallel Performance on GPUs with MVAPICH-Plus
Machine Specifications: OLCF Frontier
| CPU Model | CPU Core Info | Memory | GPU Model | GPU Memory | Interconnects |
|---|---|---|---|---|---|
| AMD EPYC 7A53 CPU | 1x64@2GHz | 512 GB DDR4 | AMD MI250X (4/Node) | 128 GB HBM 2e | HPE Slingshot (200 Gb/s) |
| Model | Batch Size | Block Size | Benchmark | Dataset | DL Framework |
|---|---|---|---|---|---|
| GPT-2 | 12 | 1024 | NanoGPT | OpenWebText | PyTorch 2.8.0 |
Machine Specifications: SDSC Cosmos
| CPU Model | CPU Core Info | Memory | GPU Model | GPU Memory | Interconnects |
|---|---|---|---|---|---|
| x86_64 | 1x96@3.7GHz | 512 GB HBM3 unified memory per node | Integrated CDNA3 per APU | 128 GB HBM3 unified memory per APU | HPE Cray Slingshot-11 (200 Gb/s) |
| Model | Batch Size | Block Size | Benchmark | Dataset | DL Framework |
|---|---|---|---|---|---|
| GPT-2 | 32 | 1024 | NanoGPT | OpenWebText | PyTorch 2.8.0 |
Machine Specifications: TACC Vista
| CPU Model | CPU Core Info | Memory | GPU Model | GPU Memory | Interconnects |
|---|---|---|---|---|---|
| NVIDIA Grace CPU | 1x72@3.1 GHz | 116 GB DDR5 | NVIDIA H200 GPU (1/Node) | 96 GB HBM 3 | Mellanox NDR (400 Gb/s) |
| Model | Batch Size | Block Size | Benchmark | Dataset | DL Framework |
|---|---|---|---|---|---|
| GPT-2 | 24 | 1024 | NanoGPT | OpenWebText | PyTorch 2.8.0 |