The HIDL team members participated in multiple events at SC'19!!
The OSU booth (2094) featured leading speakers from academia and industry!!
Click here to view slides of the presentations!!


Welcome to the High-Performance Deep Learning project created by the Network-Based Computing Laboratory of The Ohio State University. Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. The objective of the HiDL project is to exploit modern HPC technologies and solutions to scale out and accelerate DL frameworks.


OSU-Caffe library is a scalable and distributed Caffe adaptation for modern multi-GPU clusters. This is designed using a co-design approach of the Caffe framework and the widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages. Major features for OSU-Caffe 0.9 are given below.

  • Based on Nvidia's Caffe fork (caffe-0.14)
  • MPI-based distributed training support
  • Efficient scale-out support for multi-GPU nodes systems
  • New workflow to overlap the compute layers and the communication
  • Efficient parallel file readers to optimize I/O and data movement
    • Takes advantage of Lustre Parallel File System
  • Exploits efficient large message collectives in MVAPICH2-GDR 2.2
  • Tested with
    • Various CUDA-aware MPI libraries
    • CUDA 7.5
    • Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters


The RDMA-TensorFlow is a derivative of Google’s popular deep learning framework TensorFlow. This package can be used to exploit performance on modern clusters with RDMA-enabled interconnects for distributed deep learning. Major features of RDMA-TensorFlow 0.9.1 are given below.

  • Based on Google TensorFlow 1.3.0
  • Build with Python 2.7, Cuda 8.0, CUDNN 5.0, gcc 4.8.5, and glibc 2.17
  • Compliant with TensorFlow 1.3.0 APIs and applications
  • High-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow
    • RDMA-based data communication
    • Adaptive communication protocols
    • Dynamic message chunking and accumulation
    • Support for RDMA device selection
  • Easily configurable for native InfiniBand and the traditional sockets based support (Ethernet and InfiniBand with IPoIB)
  • Tested with
    • Mellanox InfiniBand adapters (e.g., EDR)
    • Tested with CUDA 8.0 and CUDNN 5.0


The MVAPICH team is now on Twitter! Follow us for up to date information on our events and tutorials! #MVAPICH.

RDMA-TensorFlow 0.9.1 (Based on Google TensorFlow 1.3.0) with support for high-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow. It has advanced features such as RDMA-based data communication, adaptive communication protocols, dynamic message chunking and accumulation, support for RDMA device selection, and so on. [more]

Tutorial: High Performance Distributed Deep Learning - A Beginner’s Guide, presented at PPoPP 2018 and Hot Interconnect 2017.

OSU-Caffe 0.9 (based on Nvidia's Caffe fork, caffe-0.14) with support for MPI-based distributed training, efficient scale-out on multi-GPU nodes, new workflow to overlap the compute layers and communication, optimizing I/O and data movement with parallel file readers, taking advantage of Luster, and exploiting large message collectives in MVAPICH2-GDR 2.2 library is available. [more]

HiDL in the News