1. Overview

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, OSU-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters.

OSU-Caffe takes a co-design approach for the popular Caffe framework and the widely used MVAPICH2-GDR MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication with multi-stage data propagation and gradient aggregation schemes. It also brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages.

The OSU-Caffe implementation is based on NVIDIA’s fork of Caffe, which supports GPU specific optimizations like CUDNN and CUB. The Co-designed MPI runtime is the MVAPICH2-GDR 2.2 version, which is an efficient CUDA-Aware MPI runtime with GPUDirect RDMA and DL-Aware optimizations.

2. System Requirements

OSU-Caffe 0.9 binary release requires the following software to be installed on your system:

MVAPICH2-GDR is the preferred MPI runtime for OSU-Caffe but it can work with all CUDA-Aware MPI runtimes.

Please download and setup MVAPICH2-GDR by following the userguide available at: http://mvapich.cse.ohio-state.edu/userguide/gdr/

4. Installing OSU-Caffe

To install OSU-Caffe you simply need to select the correct version of the tarball for your system and install it using one of the following methods.

4.1. Installing OSU-Caffe with root access

If you have the root access to your machine, you can use the following method to install everything in /opt directory.

Install OSU-Caffe built using GNU compilers against CUDA 7.5 runtime and MOFED 3.2
$ curl -O
http://hidl.cse.ohio-state.edu/download/hidl/osu-caffe/0.9/mofed-3.2/osu-caffe-0.9-cuda7.5-gnu-el7.centos.tgz
$ tar -xf osu-caffe-0.9-cuda7.5-gnu-el7.centos.tgz
$ cd osu-caffe-0.9-cuda7.5-gnu-el7.centos/
$ ./install-root.sh

If you want to install to a custom prefix, please edit the install-custom-prefix.sh and modify the </custom/install/prefix> accordingly.

4.2. Installing OSU-Caffe for non-root users

If you do not have root permission you can use the rpm2cpio based script included for convenience. Simply call the ./install-nonroot.sh and everything will be installed in ./opt under the current working directory.

Install OSU-Caffe built using GNU compilers against CUDA 7.5 runtime and MOFED 3.2
$ curl -O
http://hidl.cse.ohio-state.edu/download/hidl/osu-caffe/0.9/mofed-3.2/osu-caffe-0.9-cuda7.5-gnu-el7.centos.tgz
$ tar -xf osu-caffe-0.9-cuda7.5-gnu-el7.centos.tgz
$ cd osu-caffe-0.9-cuda7.5-gnu-el7.centos/
$ ./install-nonroot.sh
Tip
If you are using a Debian based system such as Ubuntu you can convert the rpms to a deb using a tool such as alien or follow the rpm2cpio instructions above.

5. Running OSU-Caffe

Here are some examples of OSU-Caffe with different datasets and models.

We assume that the root of your OSU-Caffe installation is $CAFFE_PATH below. Kindly download and setup your "data" and "examples" folders from the tutorials to $CAFFE_PATH directory.

5.1. Example running MNIST

To run OSU-Caffe with MNIST dataset, first download and setup the MNIST dataset using Caffe’s MNIST tutorial available at http://caffe.berkeleyvision.org/gathered/examples/mnist.html

    1: $ export CAFFE_PATH=/opt/osu-caffe/osu-caffe-gnu/0.9
    2: $ export MV2_PATH=/opt/mvapich2/gdr/2.2/gnu
    3: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/libgdrapi.so
    4: $ export MV2_USE_CUDA=1
    5: $ export LD_PRELOAD=$MV2_PATH/lib/libmpi.so
    6:
    7: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    8:         $CAFFE_PATH/build/tools/caffe.bin train -solver \
    9:         $CAFFE_PATH/examples/mnist/lenet_solver.prototxt

5.2. Example running CIFAR10

To run OSU-Caffe with CIFAR10 dataset, first download and setup the dataset using Caffe’s CIFAR10 tutorial available at http://caffe.berkeleyvision.org/gathered/examples/cifar10.html

    1: $ export CAFFE_PATH=/opt/osu-caffe/osu-caffe-gnu/0.9
    2: $ export MV2_PATH=/opt/mvapich2/gdr/2.2/gnu
    3: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/libgdrapi.so
    4: $ export MV2_USE_CUDA=1
    5: $ export LD_PRELOAD=$MV2_PATH/lib/libmpi.so
    6:
    7: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    8:         $CAFFE_PATH/build/tools/caffe.bin train -solver \
    9:         $CAFFE_PATH/examples/cifar10/cifar10_quick_solver.prototxt

5.3. Example running ImageNet dataset and AlexNet network

To run OSU-Caffe with ImageNet dataset, first download and setup the dataset using Caffe’s ImageNet tutorial available at http://caffe.berkeleyvision.org/gathered/examples/imagenet.html

    1: $ export CAFFE_PATH=/opt/osu-caffe/osu-caffe-gnu/0.9
    2: $ export MV2_PATH=/opt/mvapich2/gdr/2.2/gnu
    3: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/libgdrapi.so
    4: $ export MV2_USE_CUDA=1
    5: $ export LD_PRELOAD=$MV2_PATH/lib/libmpi.so
    6:
    7: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    8:         $CAFFE_PATH/build/tools/caffe.bin train -solver \
    9:         $CAFFE_PATH/models/bvlc_alexnet/solver.prototxt
   10: 

6. Tuning and Advanced Usage

OSU-Caffe has been tested with the default options and everything should work right out of the box. However, we do provide certain advanced usage parameters to tune for better performance.

6.1. Running OSU-Caffe with Weak Scaling

Strong scaling is the default option for OSU-Caffe. But, the users can explicity specify scaling using the command line flag -scal. See an example below for weak scaling.

    1: $ export CAFFE_PATH=/opt/osu-caffe/osu-caffe-gnu/0.9
    2: $ export MV2_PATH=/opt/mvapich2/gdr/2.2/gnu
    3: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/libgdrapi.so
    4: $ export MV2_USE_CUDA=1
    5: $ export LD_PRELOAD=$MV2_PATH/lib/libmpi.so
    6:
    7: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    8:         $CAFFE_PATH/build/tools/caffe.bin train -solver \
    9:         $CAFFE_PATH/models/bvlc_alexnet/solver.prototxt \
   10:         -scal weak

6.2. Running OSU-Caffe with Performance Optimizations

We have advanced optimizations for OSU-Caffe that can be triggered using command-line flags -amode and -pmode. We have configured OSU-Caffe so that the users do not need to worry about setting the right mode but we still allow advanced users to explicitly test performance using the following options.

  • -amode

    • 1, 2, and 3 are valid values

  • -pmode

    • 1, 2, and 3 are valid values

Example of setting a mode using -amode flag with value 3 is shown as follows. Note that you can use command-line flags together. In the example below, we have -scal and -amode used together.

    1: $ export CAFFE_PATH=/opt/osu-caffe/osu-caffe-gnu/0.9
    2: $ export MV2_PATH=/opt/mvapich2/gdr/2.2/gnu
    3: $ export MV2_GPUDIRECT_GDRCOPY_LIB=/path/to/GDRCOPY/libgdrapi.so
    4: $ export MV2_USE_CUDA=1
    5: $ export LD_PRELOAD=$MV2_PATH/lib/libmpi.so
    6:
    7: $ $MV2_PATH/bin/mpirun_rsh -export -np 2 hostA hostB \
    8:         $CAFFE_PATH/build/tools/caffe.bin train -solver \
    9:         $CAFFE_PATH/models/bvlc_alexnet/solver.prototxt \
   10:         -scal weak -amode 3

6.3. Running with Large DL Models

For Large DL models like AlexNet, the number of cuda events need to be increased explicitly.

  • MV2_CUDA_NUM_EVENTS

    • Default: 64

    • To allow larger DL models to execute with running out of CUDA events, use a higher value like 256