1. Overview
Distributed Deep Learning has become the default approach to train Deep Neural Networks (DNNs) on large datasets like ImageNet. Broadly, distributed training can be categorized into three strategies: 1) Data Parallelism, 2) Model Parallelism, and 3) Hybrid Parallelism. In data parallelism (DP), the DNN is replicated across multiple Processing Elements (PEs) like GPUs, with allreduce operations used to synchronize the DNN's weights across multiple replicas by reducing gradients from all replicas and distributing the result.
PyTorch's native Distributed Data Parallel (DDP) has emerged as the leading framework for scalable deep learning training, providing efficient gradient synchronization through optimized collective communication. MVAPICH-Plus provides vendor-neutral, optimized Allreduce operations that accelerate DNN training on large numbers of PEs/GPUs while delivering competitive performance against vendor-specific libraries (NCCL, RCCL). HiDL with PyTorch native DDP and MVAPICH-Plus delivers scalable, portable distributed DNN training solutions across diverse HPC ecosystems.
2. System Requirements
MVAPICH-Plus is the preferred MPI runtime for DDP training with PyTorch 2.7.1 on GPU and CPUs.
Please download MVAPICH-Plus at the following page: https://mvapich.cse.ohio-state.edu/downloads/
Follow the userguide to setup your MVAPICH-Plus installation here:
https://mvapich-docs.readthedocs.io/en/mvapich-plus/
3. Install PyTorch 2.7.1 with enhanced GPU-Aware MPI support
We direct users to our open-sourced pytorch branch: https://github.com/OSU-Nowlab/pytorch/tree/HiDL-2.0-torch2.7.1
Our current branch only supports building from source, please allocate additional time and resources for this process.
For detailed instructions, we refer to PyTorch's official guide: https://github.com/pytorch/pytorch?tab=readme-ov-file#installation.
Additionally, to enable building with GPU-Aware MPI, append USE_CUDA_MPI=1 to your setup command
3.1 Example install script on TACC Vista:
(git clean -fdx ;\ git submodule sync ;\ git submodule update --init --recursive ;\ make clean ;\ python setup.py clean ;\ export _GLIBCXX_USE_CXX11_ABI=1 ;\ export CMAKE_CXX_COMPILER=g++ ;\ export CMAKE_C_COMPILER=gcc ;\ export MPI_C_COMPILER=mpicc ;\ export MPI_CXX_COMPILER=mpicxx ;\ export MPI_HOME=${MPI_HOME} ;\ export CMAKE_PREFIX_PATH="${MPI_HOME}:$LD_LIBRARY_PATH:$PATH:$CPATH" ;\ MAX_JOBS=16 \ USE_CUFILE=0 \ USE_XNNPACK=0 \ USE_CUDA_MPI=1 \ USE_MPI=1 \ USE_DISTRIBUTED=1 \ BUILD_TEST=0 \ BUILD_MOBILE_BENCHMARK=0 \ BUILD_MOBILE_TEST=0 \ PYTORCH_CUDA_ARCH_LIST="9.0" \ USE_NCCL=1 \ USE_SYSTEM_NCCL=1 \ NCCL_INCLUDE_DIR=$TACC_NCCL_DIR/include \ NCCL_LIB_DIR=$TACC_NCCL_DIR/lib \ python setup.py develop) > $INSTALL_OUTFILE 2>&1
3.2 Install on OLCF Frontier:
3.2.1 With prebuilt wheel:
To load MVAPICH-Plus ums module on Frontier, the following is the recommend setup:
module reset module load ums ums038 PrgEnv-amd module load gcc-native/12.3 module load rocm module load craype-accel-amd-gfx90a
3.2.2 Manually:
You may install it manually with the below script:
(git clean -fdx ;\ git submodule sync ;\ git submodule update --init --recursive ;\ make clean ;\ python setup.py clean ;\ python tools/amd_build/build_amd.py ;\ export CMAKE_CXX_COMPILER=/opt/cray/pe/gcc-native/13/bin/g++ ;\ export CMAKE_C_COMPILER=/opt/cray/pe/gcc-native/13/bin/gcc ;\ export MPI_C_COMPILER=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun/bin/mpicc ;\ export MPI_CXX_COMPILER=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun/bin/mpicxx ;\ export MPI_HOME=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun ;\ export CMAKE_PREFIX_PATH="${MPI_HOME}:$LD_LIBRARY_PATH:$CPATH:$PATH:${CMAKE_PREFIX_PATH}" ;\ MAX_JOBS=16 \ USE_XNNPACK=0 \ USE_CUDA_MPI=1 \ USE_ROCM=1 \ USE_MPI=1 \ USE_DISTRIBUTED=1 \ BUILD_TEST=0 \ BUILD_MOBILE_BENCHMARK=0 \ BUILD_MOBILE_TEST=0 \ PYTORCH_ROCM_ARCH="gfx90a" \ python setup.py install) > $INSTALL_OUTFILE 2>&1
Please adjust accordingly to your environments and variables
3.3 Install prebuilt wheel on OSC Cardinal:
module reset module load gcc/12.3.0 module load cuda/12.4.1 module unload mvapich/3.0 export CPATH=$CUDA_HOME/include:$CPATH export CPATH=/fs/ess/PZS0622/mvp4.1_pytorch/mvapich-plus/mvp-plus4.1_cuda12.4.1/install/include:$CPATH export PATH=/fs/ess/PZS0622/mvp4.1_pytorch/mvapich-plus/mvp-plus4.1_cuda12.4.1/install/bin:$PATH export LD_LIBRARY_PATH=/fs/ess/PZS0622/mvp4.1_pytorch/mvapich-plus/mvp-plus4.1_cuda12.4.1/install/lib:$LD_LIBRARY_PATH export LD_PRELOAD=/fs/ess/PZS0622/mvp4.1_pytorch/mvapich-plus/mvp-plus4.1_cuda12.4.1/install/lib/libmpi.so