In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. MCR-DL is a thin interface between the DL framework (PyTorch) and the target communication backend such as MPI or NCCL.

List of features of MCR-DL can be found here.


In a rapidly advancing technological landscape, the demand for high-performance parallel inference techniques is critical to meet the real-time processing needs of modern deep learning applications. To achieve high-performance parallel inference, three pivotal elements are essential: highly performant GPU kernels that maximize computational throughput, intelligent scheduling strategies that ensure optimal load balancing across resources, and sophisticated distributed communication libraries that facilitate large-scale inference by enabling seamless data exchange and coordination among distributed systems. ParaInfer-X is a collection of parallel inference techniques that can facilitate the deployment of emerging AI models on edge devices and HPC clusters.

List of features of ParaInfer-X can be found here.


Deep Neural Network (DNN) training on very high-resolution images for real-world applications, such as medical and satellite images, comes with multiple challenges due to memory limitations. Parallelism strategies such as data parallelism cannot be employed for out-of-core models, layer Parallelism leads to GPU underutilization as only one GPU is utilized, and pipeline parallelism is only feasible when the model can be trained with multiple samples at a time. MPI4DL is a distributed, accelerated, and memory efficient training framework for very high-resolution images that integrates Spatial Parallelism, Bidirectional Parallelism, Layer Parallelism, and Pipeline Parallelism.

List of features of MPI4DL can be found here.

Horovod with MVAPICH2

With the advent of Deep Learning (DL) frameworks and the rise of distributed DL training, the need for a unified data parallel (DP) training framework was met by Horovod. Horovod is a distributed DL training framework with support for Tensorflow, Keras, PyTorch, and Apache MXNet. With support for both CPU and GPU training, Horovod makes distributed training generalizable, simple, and intuitive. We provide support for many MPI-level enhancements to enable efficient training of DL models with Horovod, including large-message collective designs on CPU and GPU HPC systems, pointer cache enhancements for device buffers, and fork safety support.

List of features of Horovod can be found here.


Given the rapid growth of high-volume and high-velocity datasets, the need for fast and distributed Machine Learning (ML) models has become a bottleneck for data scientists. To mitigate this shortcoming, RAPIDS cuML was created to provide a suite of GPU-accelerated and distributed machine learning algorithms. We provide support for many CUDA-Aware MPI-level enhancements to enable efficient ML training at scale.

List of features of MPI4cuML can be found here.

OSU-Caffe : Scalable Deep Learning on Modern GPU Clusters

Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, OSU-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters.

OSU-Caffe is a co-design approach of the Caffe framework and a widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It also brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages.

The OSU-Caffe implementation is based on the NVIDIA's fork of Caffe, which supports CUDNN optimizations. The Co-designed MPI runtime is MVAPICH2-GDR 2.2 , which is an efficient CUDA-Aware MPI runtime that provides support for GPUDirect RDMA and DL-Aware optimizations.

List of features of OSU-Caffe can be found here.


With the ubiquity of massive computational power Deep Learning is finally getting its momentum. The application of Deep Learning is wide - starting from self driving car to assist doctors to find out early stages of cancer. At present Google’s TensorFlow is one of the most popular Deep Learning framework in the community. Even though distributed TensorFlow can be deployed on modern HPC systems seamlessly, the performance of TensorFlow depends heavily on the efficacy of its communication engine and the underlying network. As the DL network grows in depth and in terms of number of parameters the variable updates between different nodes become a critical bottleneck. gRPC, a widely used Remote Procedure Call framework that enables client and server applications to communicate transparently, is the main communication engine of TensorFlow. After a comprehensive analysis of TensorFlow, we develop an adaptive RDMA based gRPC (i.e., AR-gRPC) that is solely capable of accelerating TensorFlow. RDMA-TenorFlow leverages AR-gRPC to accelerate Deep Learning in modern HPC clusters.

List of features of RDMA-TensorFlow can be found here.