1. Overview

Distributed Deep Learning has become the default approach to train Deep Neural Networks (DNNs) on large datasets like ImageNet. Broadly, Distributed training can be categorized into three strategies: 1) Data Parallelism, 2) Model Parallelism, and 3) Hybrid Parallelism. In data parallelism (DP), DNN is replicated across multiple Processing Elements (PEs) like GPUs. An allreduce operation is used to synchronize the DNN's weights across multiple replicas by reducing the gradients from all replicas and pushing the result to all the replicas.

MVAPICH2 provides an optimized Allreduce operation to accelerate DNN training on a large number of PEs/GPUs. Horovod is a distributed deep learning training framework, which supports popular deep learning frameworks like TensorFlow, Keras, and PyTorch. Horovod with MVAPICH2 provides scalable distributed DNN training solutions for both CPUs and GPUs

MVAPICH2-GDR is the preferred MPI runtime for distributed DNN training with Horovod on GPUs.

Please download MVAPICH2-GDR at the following page: Download MVAPICH2-GDR

Follow the userguide to setup your MVAPICH2-GDR installation here: MVAPICH2-GDR Userguide


MVAPICH2 is the preferred MPI runtime for distributed DNN training with Horovod on CPUs.

Please download MVAPICH2 at the following page: Download MVAPICH2

Follow the userguide to setup your MVAPICH2 installation here: MVAPICH2 Userguide

3. Installing Horovod with MVAPICH2 or MVAPICH2-GDR

The Following instructions can be used to install Horovod with either MVAPICH2 or MVAPICH2-GDR.

3.1. Installing Miniconda

We recommend using the Conda/Miniconda package management system to create environments. However, Horovod can also be installed without Conda. Please skip this section if you want to install Horovod in default python.

Download appropriate miniconda installer from https://docs.conda.io/en/latest/miniconda.html

Install Miniconda on Linux
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
$ source $HOME/miniconda/bin/activate
$ conda create -n horovod_mv2 python=3.6.5
$ conda activate horovod_mv2
$ export PYTHONNOUSERSITE=true

3.2. Installing TensorFlow/PyTorch/MXNet

Detailed instructions to install TensorFlow can found here and PyTorch here

Install TensorFlow against CUDA 11.2 runtime
$ pip install tensorflow
Install TensorFlow against ROCm 5.1.1 runtime
$ pip install tensorflow-rocm
Install PyTorch against CUDA 11.3 runtime
$ pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113  
Install PyTorch against ROCm 5.1.1 runtime
$ pip install torch==1.12.1+rocm5.1.1 torchvision==0.13.1+rocm5.1.1 torchaudio==0.12.1 --extra-index-url  https://download.pytorch.org/whl/rocm5.1.1 

3.3. Installing Horovod with MVAPICH2

Make sure the appropriate MVAPICH2 paths are in $PATH and $LD_LIBRARY_PATH. We recommend using GCC version 8.5.0 or 10.3.0 to build Horovod. Tested TensorFlow, PyTorch, and Horovod versions can be found in Section 5.

Note: Horovod will link to whichever MPI library is in your path, so ensure you only have a single library loaded before the pip install command. Example commands to check your environment: echo $PATH, which mpicc, module list

If MVAPICH2 is installed using rpm2cpio, then please update the prefix, exec_prefix, sysconfdir, includedir, and libdir paths in $MV2_HOME/bin/mpicc and $MV2_HOME/bin/mpicxx.

Install Horovod on CPU
$ pip install --no-cache-dir horovod
Install Horovod on NVIDIA GPUs

Pass the path containing CUDA include and lib directories to HOROVOD_CUDA_HOME. Please see The Horovod GPU Userguide for the latest commands

$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_CUDA_HOME=/opt/cuda/11.3 HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod
Install Horovod on AMD GPUs

Pass the path containing CUDA include and lib directories to HOROVOD_CUDA_HOME. Please see The Horovod GPU Userguide for the latest commands

$ HOROVOD_GPU=ROCM HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_ROCM_HOME=/opt/rocm/5.1.1 HOROVOD_WITH_MPI=1 pip install –-no-cache-dir horovod 

4. Running Horovod with MVAPICH2

Here are some examples of running Horovod. We will use example scripts from https://github.com/horovod/horovod/tree/master/examples

4.1. Example running TensorFlow

Distributed DNN training using TensorFlow and Horovod

Note: To run with AMD GPUs, replace MV2_USE_CUDA=1 with MV2_USE_ROCM=1.

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu
    2: $ export MV2_USE_CUDA=1
    3: $ export MV2_SUPPORT_DL=1
    4:
    5: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \
    6:         python horovod/examples/tensorflow/tensorflow_synthetic_benchmark.py \
    7:         --batch-size=64 --model=ResNet50

4.2. Example running PyTorch

Distributed DNN training using PyTorch and Horovod

Note: PyTorch is able to take advantage of MVAPICH2-GDR's advanced designs. Therefore, we recommend removing MV2_SUPPORT_DL and adding LD_PRELOAD. For more details on MV2_SUPPORT_DL, LD_PRELOAD, and MVAPICH2-GDR, please see The MVAPICH2-GDR Userguide

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu
    2: $ export MV2_USE_CUDA=1
    3: $ export MV2_USE_GDRCOPY=1
    4: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install/lib64/libgdrapi.so
    5:
    6: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \
    7:         LD_PRELOAD=$MV2_PATH/lib/libmpi.so \
    8:         python horovod/examples/pytorch/pytorch_synthetic_benchmark.py \
    9:         --batch-size=64 --model=resnet50

Note: We recommend running PyTorch's dataloader with pin_memory and persistent_workers. See the following example:

    train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.batch_size,
            sampler=train_sampler, pin_memory=True, 
	    persistent_workers=True)

5. Tested TensorFlow versions

We have tested following DL Framework versions with Horovod. Please note that this is not a comprehensive list.

NVIDIA
PyTorch Horovod CUDA cuDNN Python GCC
v1.12.1 v0.24.0/v0.25.0 10.2/11.1/11.3 7.6.5/8.0.5/8.4.0 3.7.10 8.5.0
v1.10.1 v0.23.0/0.24.0/0.25.0 10.2/11.1/11.3 7.6.5/8.0.5/8.4.0 3.7.10 8.5.0
v1.8.0 v0.23.0/0.24.0/0.25.0 10.2/11.1 7.6.5/8.0.5/8.4.0 3.7.10 8.5.0
TensorFlow Horovod CUDA cuDNN Python GCC
v2.9.1 v0.24.0 / v0.25.0 11.2 8.2.1 3.10.4 8.5.0
v2.8.0 v0.23.0 / v0.24.0 11.2 8.2.1 3.10.4 8.5.0
v2.4.0 v0.24.0 11.0 8.0.5 3.8.0 8.5.0
AMD
PyTorch Horovod ROCm Python GCC
v1.12.1 v0.26.1/v0.25.0 5.1.1/5.1.3 3.9.16 10.3.0
v1.11.0 v0.26.1 /0.25.0 5.1.1/5.1.3 3.9.16 10.3.0
TensorFlow Horovod ROCm Python GCC
v2.11.0 v0.26.1 / v0.25.0 5.1.1/5.1.3 3.9.16 10.3.0
v2.10.1 v0.26.1 / v0.25.0 5.1.1/5.1.3 3.9.16 10.3.0
v2.4.0 v0.24.0 11.0 3.8.0 8.5.0