1. Overview

Distributed Deep Learning has become the default approach to train Deep Neural Networks (DNNs) on large datasets like ImageNet. Broadly, Distributed training can be categorized into three strategies: 1) Data Parallelism, 2) Model Parallelism, and 3) Hybrid Parallelism. In data parallelism (DP), DNN is replicated across multiple Processing Elements (PEs) like GPUs. An allreduce operation is used to synchronize the DNN's weights across multiple replicas by reducing the gradients from all replicas and pushing the result to all the replicas.

MVAPICH2 provides an optimized Allreduce operation to accelerate DNN training on a large number of PEs/GPUs. Horovod is a distributed deep learning training framework, which supports popular deep learning frameworks like TensorFlow, Keras, PyTorch, and Apache MXNet. Horovod with MVAPICH2 provides scalable distributed DNN training solutions for both CPUs and GPUs.

MVAPICH2-GDR is the preferred MPI runtime for distributed DNN training with Horovod on GPUs.

Please download MVAPICH2-GDR at the following page: Download MVAPICH2-GDR

Follow the userguide to setup your MVAPICH2-GDR installation here: MVAPICH2-GDR Userguide

MVAPICH2-X is the preferred MPI runtime for distributed DNN training with Horovod on CPUs.

Please download MVAPICH2-X at the following page: Download MVAPICH2-X

Follow the userguide to setup your MVAPICH2-X installation here: MVAPICH2-X Userguide

3. Installing Horovod with MVAPICH2-X or MVAPICH2-GDR

The Following instructions can be used to install Horovod with either MVAPICH2-X or MVAPICH2-GDR.

3.1. Installing Miniconda

We recommend using the Conda/Miniconda package management system to create environments. However, Horovod can also be installed without Conda. Please skip this section if you want to install Horovod in default python.

Download appropriate miniconda installer from https://docs.conda.io/en/latest/miniconda.html

Install Miniconda on Linux
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
$ source $HOME/miniconda/bin/activate
$ conda create -n horovod_mv2 python=3.6.5
$ conda activate horovod_mv2

3.2. Installing TensorFlow/PyTorch/MXNet

Detailed instructions to install TensorFlow can found here and PyTorch here

Install TensorFlow against CUDA 9.2 runtime
$ pip install tensorflow-gpu==1.13.1
Install MXNet against CUDA 10.2 runtime (with MKL support)
$ pip install mxnet-cu102mkl
Install PyTorch against CUDA 9.2 runtime
$ pip install torch==1.5.0+cu92 torchvision==0.6.0+cu92 -f https://download.pytorch.org/whl/torch_stable.html

3.3. Installing Horovod with MVAPICH2

Make sure the appropriate MVAPICH2 paths are in $PATH and $LD_LIBRARY_PATH. We recommend using GCC version 7.3.X to build horovod. Tested TensorFlow and Horovod versions can be found in Section 5.

Note: Horovod will link to whichever MPI library is in your path, so ensure you only have a single library loaded before the pip install command. Example commands to check your environment: echo $PATH, which mpicc, module list

If MVAPICH2 is installed using rpm2cpio, then please update the prefix, exec_prefix, sysconfdir, includedir, and libdir paths in $MV2_HOME/bin/mpicc and $MV2_HOME/bin/mpicxx.

Install Horovod on CPU
$ pip install --no-cache-dir horovod
Install Horovod on GPU

Pass the path containing CUDA include and lib directories to HOROVOD_CUDA_HOME. Please see The Horovod GPU Userguide for the latest commands

$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_CUDA_HOME=/opt/cuda/10.1 HOROVOD_WITH_MPI=1 pip install --no-cache-dir horovod

4. Running Horovod with MVAPICH2

Here are some examples of running Horovod. We will use example scripts from https://github.com/horovod/horovod/tree/master/examples

4.1. Example running TensorFlow

Distributed DNN training using TensorFlow and Horovod
    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu
    2: $ export MV2_USE_CUDA=1
    3: $ export MV2_SUPPORT_DL=1
    5: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \
    6:         python horovod/examples/tensorflow/tensorflow_synthetic_benchmark.py \
    7:         --batch-size=64 --model=ResNet50

4.2. Example running PyTorch

Distributed DNN training using PyTorch and Horovod

Note: PyTorch is able to take advantage of our caching and GDRCOPY designs. Therefore, we recommend removing MV2_SUPPORT_DL and adding LD_PRELOAD. For more details on LD_PRELOAD and GDRCOPY, please see The MVAPICH2-GDR Userguide

    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.5/gnu
    2: $ export MV2_USE_CUDA=1
    3: $ export MV2_USE_GDRCOPY=1
    4: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/install/lib64/libgdrapi.so
    6: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \
    7:         LD_PRELOAD=$MV2_PATH/lib/libmpi.so \
    8:         python horovod/examples/pytorch/pytorch_synthetic_benchmark.py \
    9:         --batch-size=64 --model=resnet50

4.3. Example running MXNet

Distributed DNN training using MXNet and Horovod
    1: $ export MV2_PATH=/opt/mvapich2/gdr/2.3.3/gnu
    2: $ export MV2_USE_CUDA=1
    3: $ export MV2_SUPPORT_DL=1
    5: $ $MV2_PATH/bin/mpirun_rsh --export-all -np 2 hostA hostB \
    6:         python mxnet_imagenet_resnet50.py --model=resnet50_v1 \
    7:         --batch-size=64 --warmup-epochs=1 --num-epochs=1 --log-interval=10

5. Tested TensorFlow versions

We have tested following DL Framework versions with Horovod. Please note that this is not a comprehensive list.

PyTorch Horovod CUDA cuDNN GCC
v1.7.0 v0.20.0/v0.21.0 10.1/10.2 7.6.5 7.3.0/8.3.0
v1.6.0 v0.19.4 10.1 7.6.5 7.3.0/8.3.0
MXNet Horovod CUDA cuDNN GCC
v1.7.0 v0.20.3 10.2 7.6.5 8.4.0
TensorFlow Horovod CUDA cuDNN GCC
v1.13.1 v0.16.4 / v0.18.0 / v0.19.4 10.0 7.4 7.3.0
v1.14.0 v0.18.0 / v0.19.4 10.0 7.4 7.3.0
v1.15.3 v0.18.0 / v0.19.4 10.0 7.6.4 8.3.0
v2.0.0 / v2.1.0 / v2.2.0 v0.18.0 / v0.19.4 10.0 7.6.5 8.3.0
v2.0.0 / v2.1.0 / v2.2.0 v0.20.0 10.0 7.6.5 7.3.0