Overview

Welcome to the High-Performance Deep Learning project created by the Network-Based Computing Laboratory of The Ohio State University. The availability of large data sets (e.g. ImageNet, PASCAL VOC 2012) coupled with massively parallel processors in modern HPC systems (e.g. NVIDIA GPUs) have fueled a renewed interest in Deep Learning (DL) algorithms. In addition to the popularity of massively parallel DL accelerators like GPUs, the availability and memory-abundance of modern CPUs poses a viable alternative for DL training. This resurgence of DL applications has triggered the development of DL frameworks like Caffe, PyTorch, TensorFlow, Apache MXNet, and CNTK. While most DL frameworks provide experimental support for multi-node training, their distributed implementation is often suboptimal. The objective of the HiDL project is to exploit modern HPC technologies and solutions to scale out and accelerate DL frameworks.


MPI-Driven DL Training (TensorFlow, Pytorch, MXNet) with Horovod and MVAPICH2

Horovod is a distributed deep learning training framework with support for popular deep learning frameworks like TensorFlow, Keras, PyTorch, and Apache MXNet. MVAPICH2, MVAPICH2-X, and MVAPICH2-GDR provide many features to augment data parallel distributed training with Horovod on both CPUs and GPUs.

  • Build with Python 2.x or 3.x, CUDA 9.x, 10.x, or 11.x
  • Full support for Tensorflow, Pytorch, Keras, and Apache MXNet
  • Optimized support at MPI-level for deep learning workloads
    • Efficient large-message collectives (e.g. Allreduce) on CPUs and GPUs
    • GPU-Direct Algorithms for all collective operations (including those commonly used for model-parallelism, e.g. Allgather and Alltoall)
    • Support for fork safety
  • Exploits efficient large message collectives in MVAPICH2, MVAPICH2-X and MVAPICH2-GDR
  • Tested with
    • Mellanox InfiniBand adapters (e.g., EDR, FDR, HDR)
    • NVIDIA GPU K80, P100, V100, Quadro RTX 5000, A100
    • CUDA [9.x, 10.x, 11.x] and CUDNN [7.5.x, 7.6.x, 8.0.x]
    • Tensorflow [1.x, 2.x], Pytorch 1.x, Apache MXNet 1.x

Horovod Performance on MVAPICH2-X and MVAPICH2-GDR

For instructions on building Horovod with MVAPICH2-X or MVAPICH2-GDR, please refer to the Horovod Userguide

MPI-Driven ML Training with MPI4cuML

cuML is a distributed machine learning training framework with a focus on GPU acceleration and distributed computing. MVAPICH2-GDR provides many features to augment distributed training with cuML on GPUs.

  • Build with Python 3.7, CUDA 10.1, 10.2 or 11.0
  • Optimized support at MPI-level for machine learning workloads
    • Efficient large-message and small-message collectives (e.g. Allreduce and Bcast) on GPUs
    • GPU-Direct Algorithms for all collective operations (including those commonly used for model-parallelism, e.g. Allgather and Alltoall)
    • Support for fork safety
  • Exploits efficient large-message and small-message collectives in MVAPICH2-GDR
  • Tested with
    • Mellanox InfiniBand adapters (FDR and HDR)
    • NVIDIA GPU P100 and V100
    • CUDA [10.1, 10.2, 11.0]
    • Various x86-based multi-core platforms (AMD and Intel)

cuML Performance on MVAPICH2-GDR

For instructions on building cuML with MVAPICH2-GDR, please refer to the Userguide for MPI4cuML 0.1

OSU-Caffe

OSU-Caffe library is a scalable and distributed Caffe adaptation for modern multi-GPU clusters. This is designed using a co-design approach of the Caffe framework and the widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages. Major features for OSU-Caffe 0.9 are given below.

  • Based on Nvidia's Caffe fork (caffe-0.14)
  • MPI-based distributed training support
  • Efficient scale-out support for multi-GPU nodes systems
  • New workflow to overlap the compute layers and the communication
  • Efficient parallel file readers to optimize I/O and data movement
    • Takes advantage of Lustre Parallel File System
  • Exploits efficient large message collectives in MVAPICH2-GDR 2.2
  • Tested with
    • Various CUDA-aware MPI libraries
    • CUDA 7.5
    • Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters

RDMA-TensorFlow

The RDMA-TensorFlow is a derivative of Google’s popular deep learning framework TensorFlow. This package can be used to exploit performance on modern clusters with RDMA-enabled interconnects for distributed deep learning. Major features of RDMA-TensorFlow 0.9.1 are given below.

  • Based on Google TensorFlow 1.3.0
  • Build with Python 2.7, Cuda 8.0, CUDNN 5.0, gcc 4.8.5, and glibc 2.17
  • Compliant with TensorFlow 1.3.0 APIs and applications
  • High-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow
    • RDMA-based data communication
    • Adaptive communication protocols
    • Dynamic message chunking and accumulation
    • Support for RDMA device selection
  • Easily configurable for native InfiniBand and the traditional sockets based support (Ethernet and InfiniBand with IPoIB)
  • Tested with
    • Mellanox InfiniBand adapters (e.g., EDR)
    • NVIDIA GPGPU K80
    • Tested with CUDA 8.0 and CUDNN 5.0

Announcements


(NEW) MPI4cuML 0.1 (based on cuML 0.15) with support for C++ and Python APIs, built on top of mpi4py over the MVAPICH2-GDR library, handles to use MVAPICH2-GDR backend for Python cuML applications (KMeans, PCA, tSVD, RF, and LinearModels) is available. [more]

(NEW) Join us for the Upcoming Tutorials: High-Performance Deep Learning at PPoPP '21 and ASPLOS '21

MVAPICH2-GDR 2.3.5 GA (based on MVAPICH2 2.3.5 GA) with support for AMD GPUs via Radeon Open Compute (ROCm) platform; support for ROCm PeerDirect, ROCm IPC, and unified memory based device-to-device communication for AMD GPUs, enhanced designs for GPU-aware MPI_Alltoall and GPU-aware MPI_Allgather, enhanced MPI derived datatype processing via kernel fusion, architecture specific flags to improve the performance of CUDA operations, support for Apache MXNet Deep Learning Framework, tested with PyTorch and DeepSpeed framework for distributed Deep Learning and multiple bug fixes is available. [more]

RDMA-TensorFlow 0.9.1 (Based on Google TensorFlow 1.3.0) with support for high-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow. It has advanced features such as RDMA-based data communication, adaptive communication protocols, dynamic message chunking and accumulation, support for RDMA device selection, and so on. [more]

OSU-Caffe 0.9 (based on Nvidia's Caffe fork, caffe-0.14) with support for MPI-based distributed training, efficient scale-out on multi-GPU nodes, new workflow to overlap the compute layers and communication, optimizing I/O and data movement with parallel file readers, taking advantage of Luster, and exploiting large message collectives in MVAPICH2-GDR 2.2 library is available. [more]

HiDL in the News