1. Overview
Distributed Deep Learning has become the default approach to train Deep Neural Networks (DNNs) on large datasets like ImageNet. Broadly, Distributed training can be categorized into three strategies: 1) Data Parallelism, 2) Model Parallelism, and 3) Hybrid Parallelism. In data parallelism (DP), DNN is replicated across multiple Processing Elements (PEs) like GPUs. An allreduce operation is used to synchronize the DNN's weights across multiple replicas by reducing the gradients from all replicas and pushing the result to all the replicas.
MVAPICH2 provides an optimized Allreduce operation to accelerate DNN training on a large number of PEs/GPUs. Horovod is a distributed deep learning training framework, which supports popular deep learning frameworks like TensorFlow, Keras, and PyTorch. Horovod with MVAPICH2 provides scalable distributed DNN training solutions for both CPUs and GPUs
2. Recommended System Features
MVAPICH2-GDR is the preferred MPI runtime for distributed DNN training with Horovod on GPUs.
Please download MVAPICH2-GDR at the following page: Download MVAPICH2-GDR
Follow the userguide to setup your MVAPICH2-GDR installation here: MVAPICH2-GDR Userguide
MVAPICH2 is the preferred MPI runtime for distributed DNN training with Horovod on CPUs.
Please download MVAPICH2 at the following page: Download MVAPICH2
Follow the userguide to setup your MVAPICH2 installation here: MVAPICH2 Userguide
3. Installing Horovod with MVAPICH2 or MVAPICH2-GDR
The Following instructions can be used to install Horovod with either MVAPICH2 or MVAPICH2-GDR.
3.1. Installing Miniconda
We recommend using the Conda/Miniconda package management system to create environments. However, Horovod can also be installed without Conda. Please skip this section if you want to install Horovod in default python.
Download appropriate miniconda installer from https://docs.conda.io/en/latest/miniconda.html
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda $ source $HOME/miniconda/bin/activate $ conda create -n horovod_mv2 python=3.6.5 $ conda activate horovod_mv2 $ export PYTHONNOUSERSITE=true
3.2. Installing TensorFlow/PyTorch/MXNet
$ pip install tensorflow
$ pip install tensorflow-rocm
$ pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
$ pip install torch==1.12.1+rocm5.1.1 torchvision==0.13.1+rocm5.1.1 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/rocm5.1.1
3.3. Installing Horovod with MVAPICH2
Make sure the appropriate MVAPICH2 paths are in $PATH and $LD_LIBRARY_PATH. We recommend using GCC version 8.5.0 or 10.3.0 to build Horovod. Tested TensorFlow, PyTorch, and Horovod versions can be found in Section 5.
Note: Horovod will link to whichever MPI library is in your path, so ensure you only have a single library loaded before the pip install command. Example commands to check your environment: echo $PATH, which mpicc, module list
If MVAPICH2 is installed using rpm2cpio, then please update the prefix, exec_prefix, sysconfdir, includedir, and libdir paths in $MV2_HOME/bin/mpicc and $MV2_HOME/bin/mpicxx.
$ pip install --no-cache-dir horovod
Pass the path containing CUDA include and lib directories to HOROVOD_CUDA_HOME. Please see The Horovod GPU Userguide for the latest commands
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_CUDA_HOME=/opt/cuda/11.3 HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod
Pass the path containing CUDA include and lib directories to HOROVOD_CUDA_HOME. Please see The Horovod GPU Userguide for the latest commands
$ HOROVOD_GPU=ROCM HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_ROCM_HOME=/opt/rocm/5.1.1 HOROVOD_WITH_MPI=1 pip install –-no-cache-dir horovod
4. Running Horovod with MVAPICH2
Here are some examples of running Horovod. We will use example scripts from https://github.com/horovod/horovod/tree/master/examples
4.1. Example running TensorFlow
Note: To run with AMD GPUs, replace MV2_USE_CUDA=1 with MV2_USE_ROCM=1.
1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu 2: $ export MV2_USE_CUDA=1 3: $ export MV2_SUPPORT_DL=1 4: 5: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \ 6: python horovod/examples/tensorflow/tensorflow_synthetic_benchmark.py \ 7: --batch-size=64 --model=ResNet50
4.2. Example running PyTorch
Note: PyTorch is able to take advantage of MVAPICH2-GDR's advanced designs. Therefore, we recommend removing MV2_SUPPORT_DL and adding LD_PRELOAD. For more details on MV2_SUPPORT_DL, LD_PRELOAD, and MVAPICH2-GDR, please see The MVAPICH2-GDR Userguide
1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu 2: $ export MV2_USE_CUDA=1 3: $ export MV2_USE_GDRCOPY=1 4: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install/lib64/libgdrapi.so 5: 6: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \ 7: LD_PRELOAD=$MV2_PATH/lib/libmpi.so \ 8: python horovod/examples/pytorch/pytorch_synthetic_benchmark.py \ 9: --batch-size=64 --model=resnet50
Note: We recommend running PyTorch's dataloader with pin_memory and persistent_workers. See the following example:
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=args.batch_size,
sampler=train_sampler, pin_memory=True,
persistent_workers=True)
5. Tested TensorFlow versions
We have tested following DL Framework versions with Horovod. Please note that this is not a comprehensive list.
PyTorch | Horovod | CUDA | cuDNN | Python | GCC |
---|---|---|---|---|---|
v1.12.1 | v0.24.0/v0.25.0 | 10.2/11.1/11.3 | 7.6.5/8.0.5/8.4.0 | 3.7.10 | 8.5.0 |
v1.10.1 | v0.23.0/0.24.0/0.25.0 | 10.2/11.1/11.3 | 7.6.5/8.0.5/8.4.0 | 3.7.10 | 8.5.0 |
v1.8.0 | v0.23.0/0.24.0/0.25.0 | 10.2/11.1 | 7.6.5/8.0.5/8.4.0 | 3.7.10 | 8.5.0 |
TensorFlow | Horovod | CUDA | cuDNN | Python | GCC |
v2.9.1 | v0.24.0 / v0.25.0 | 11.2 | 8.2.1 | 3.10.4 | 8.5.0 |
v2.8.0 | v0.23.0 / v0.24.0 | 11.2 | 8.2.1 | 3.10.4 | 8.5.0 |
v2.4.0 | v0.24.0 | 11.0 | 8.0.5 | 3.8.0 | 8.5.0 |
PyTorch | Horovod | ROCm | Python | GCC |
---|---|---|---|---|
v1.12.1 | v0.26.1/v0.25.0 | 5.1.1/5.1.3 | 3.9.16 | 10.3.0 |
v1.11.0 | v0.26.1 /0.25.0 | 5.1.1/5.1.3 | 3.9.16 | 10.3.0 |
TensorFlow | Horovod | ROCm | Python | GCC |
v2.11.0 | v0.26.1 / v0.25.0 | 5.1.1/5.1.3 | 3.9.16 | 10.3.0 |
v2.10.1 | v0.26.1 / v0.25.0 | 5.1.1/5.1.3 | 3.9.16 | 10.3.0 |
v2.4.0 | v0.24.0 | 11.0 | 3.8.0 | 8.5.0 |