1. Overview

Distributed Deep Learning has become the default approach to train Deep Neural Networks (DNNs) on large datasets like ImageNet. Broadly, distributed training can be categorized into three strategies: 1) Data Parallelism, 2) Model Parallelism, and 3) Hybrid Parallelism. In data parallelism (DP), the DNN is replicated across multiple Processing Elements (PEs) like GPUs, with allreduce operations used to synchronize the DNN's weights across multiple replicas by reducing gradients from all replicas and distributing the result.

PyTorch's native Distributed Data Parallel (DDP) has emerged as the leading framework for scalable deep learning training, providing efficient gradient synchronization through optimized collective communication. MVAPICH-Plus provides vendor-neutral, optimized Allreduce operations that accelerate DNN training on large numbers of PEs/GPUs while delivering competitive performance against vendor-specific libraries (NCCL, RCCL). HiDL with PyTorch native DDP and MVAPICH-Plus delivers scalable, portable distributed DNN training solutions across diverse HPC ecosystems.

2. System Requirements

MVAPICH-Plus is the preferred MPI runtime for DDP training with PyTorch 2.0 on GPU and CPUs.
Please download MVAPICH-Plus at the following page: https://mvapich.cse.ohio-state.edu/downloads/
Follow the userguide to setup your MVAPICH-Plus installation here: https://mvapich-docs.readthedocs.io/en/mvapich-plus/

3. Install PyTorch 2.0 with enhanced GPU-Aware MPI support

We direct users to our open-sourced pytorch branch: https://github.com/OSU-Nowlab/pytorch/tree/hidl-2.0

Our current branch only supports building from source, please allocate additional time and resources for this process.

For detailed instructions, we refer to PyTorch's official guide: https://github.com/pytorch/pytorch?tab=readme-ov-file#installation.

Additionally, to enable building with GPU-Aware MPI, append USE_CUDA_MPI=1 to your setup command

3.1 Example install script on TACC Vista:

(git clean -fdx ;\
git submodule sync ;\
git submodule update --init --recursive ;\
make clean ;\
python setup.py clean ;\
export _GLIBCXX_USE_CXX11_ABI=1 ;\
export CMAKE_CXX_COMPILER=g++ ;\
export CMAKE_C_COMPILER=gcc ;\
export MPI_C_COMPILER=mpicc ;\
export MPI_CXX_COMPILER=mpicxx ;\
export MPI_HOME=${MPI_HOME} ;\
export CMAKE_PREFIX_PATH="${MPI_HOME}:$LD_LIBRARY_PATH:$PATH:$CPATH" ;\
MAX_JOBS=16 \
USE_CUFILE=0 \
USE_XNNPACK=0 \
USE_CUDA_MPI=1 \
USE_MPI=1 \
USE_DISTRIBUTED=1 \
BUILD_TEST=0 \
BUILD_MOBILE_BENCHMARK=0 \
BUILD_MOBILE_TEST=0 \
PYTORCH_CUDA_ARCH_LIST="9.0" \
USE_NCCL=1 \
USE_SYSTEM_NCCL=1 \
NCCL_INCLUDE_DIR=$TACC_NCCL_DIR/include \
NCCL_LIB_DIR=$TACC_NCCL_DIR/lib \
python setup.py develop) > $INSTALL_OUTFILE 2>&1

3.2 Example install script on OLCF Frontier:

(git clean -fdx ;\
git submodule sync ;\
git submodule update --init --recursive ;\
make clean ;\
python setup.py clean ;\
python tools/amd_build/build_amd.py ;\
export CMAKE_CXX_COMPILER=/opt/cray/pe/gcc-native/13/bin/g++ ;\
export CMAKE_C_COMPILER=/opt/cray/pe/gcc-native/13/bin/gcc ;\
export MPI_C_COMPILER=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun/bin/mpicc ;\
export MPI_CXX_COMPILER=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun/bin/mpicxx ;\
export MPI_HOME=/lustre/orion/csc549/scratch/langx/project/hidl-rccl/install/mvp4.1-hip-srun ;\
export CMAKE_PREFIX_PATH="${MPI_HOME}:$LD_LIBRARY_PATH:$CPATH:$PATH:${CMAKE_PREFIX_PATH}" ;\
MAX_JOBS=16 \
USE_XNNPACK=0 \
USE_CUDA_MPI=1 \
USE_ROCM=1 \
USE_MPI=1 \
USE_DISTRIBUTED=1 \
BUILD_TEST=0 \
BUILD_MOBILE_BENCHMARK=0 \
BUILD_MOBILE_TEST=0 \
PYTORCH_ROCM_ARCH="gfx90a" \
python setup.py install) > $INSTALL_OUTFILE 2>&1

Please adjust accordingly to your environments and variables

4. Running PyTorch DDP with MVAPICH-Plus using NanoGPT

In this section, we demonstrate how to launch training using MVAPICH-Plus using nanoGPT.
The main training file is here: https://github.com/karpathy/nanoGPT/blob/master/train.py

After the pytorch is built with GPU-Aware MPI support, we will be able to use either mpirun or srun (depending on the system) to launch the DDP training.

Important: Please make sure MVAPICH-Plus related binaries and libraries are in path before launching

4.1 Example DDP training launch using mpirun on TACC Vista:

np=$(($GPUS_PER_NODE*$SLURM_NNODES))

# MVAPICH-Plus parameters for best performance
mvp_env="-genv MVP_ENABLE_GPU_PTRCACHE=0 \
-genv MPICH_ENABLE_GPU_PTRCACHE=0 \
-genv MPIR_CVAR_ENABLE_GPU_PTRCACHE=0 \
-genv MPIR_CVAR_ENABLE_GDRCOPY=0 \
-genv UCX_TLS=^dc \
-genv UCX_IB_PCI_RELAXED_ORDERING=on \
-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=osu_gpu_direct \
-genv MPIR_CVAR_ALLREDUCE_THROTTLE=4 \
-genv MPIR_CVAR_ALLREDUCE_COMPOSITION=2 \
-genv MPIR_CVAR_ALLREDUCE_IPC_MSG_SIZE_THRESHOLD=65536"

mpirun -np $np -ppn $GPUS_PER_NODE -hostfile $HOSTFILE $mvp_env python train.py

4.2 Example DDP training launch using srun on OLCF Frontier

np=$(($GPUS_PER_NODE*$SLURM_NNODES))

# MVAPICH-Plus parameters for best performance
mvp_env="MPIR_CVAR_ENABLE_GPU=1,\
MPIR_CVAR_PMI_VERSION=2,\
MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1,\
MPIR_CVAR_CH4_OFI_ENABLE_MR_HMEM=1,\
MPIR_CVAR_ALLREDUCE_COMPOSITION=8,\
MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=osu_gpu_rsa,\
MPIR_CVAR_ALLREDUCE_GPU_RING_THRESHOLD=$((16 * 1024 * 1024))"

srun -n $np -N $SLURM_NNODES --gpus-per-node=8 --export=ALL,$mvp_env python train.py