MCR-DL
In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. MCR-DL is a thin interface between the DL framework (PyTorch) and the target communication backend such as MPI or NCCL.
ParaInfer-X
In a rapidly advancing technological landscape, the demand for high-performance parallel inference techniques is critical to meet the real-time processing needs of modern deep learning applications. To achieve high-performance parallel inference, three pivotal elements are essential: highly performant GPU kernels that maximize computational throughput, intelligent scheduling strategies that ensure optimal load balancing across resources, and sophisticated distributed communication libraries that facilitate large-scale inference by enabling seamless data exchange and coordination among distributed systems. ParaInfer-X is a collection of parallel inference techniques that can facilitate the deployment of emerging AI models on edge devices and HPC clusters.
MPI4DL
Deep Neural Network (DNN) training on very high-resolution images for real-world applications, such as medical and satellite images, comes with multiple challenges due to memory limitations. Parallelism strategies such as data parallelism cannot be employed for out-of-core models, layer Parallelism leads to GPU underutilization as only one GPU is utilized, and pipeline parallelism is only feasible when the model can be trained with multiple samples at a time. MPI4DL is a distributed, accelerated, and memory efficient training framework for very high-resolution images that integrates Spatial Parallelism, Bidirectional Parallelism, Layer Parallelism, and Pipeline Parallelism.
Horovod with MVAPICH2
With the advent of Deep Learning (DL) frameworks and the rise of distributed DL training, the need for a unified data parallel (DP) training framework was met by Horovod. Horovod is a distributed DL training framework with support for Tensorflow, Keras, PyTorch, and Apache MXNet. With support for both CPU and GPU training, Horovod makes distributed training generalizable, simple, and intuitive. We provide support for many MPI-level enhancements to enable efficient training of DL models with Horovod, including large-message collective designs on CPU and GPU HPC systems, pointer cache enhancements for device buffers, and fork safety support.
MPI4cuML
Given the rapid growth of high-volume and high-velocity datasets, the need for fast and distributed Machine Learning (ML) models has become a bottleneck for data scientists. To mitigate this shortcoming, RAPIDS cuML was created to provide a suite of GPU-accelerated and distributed machine learning algorithms. We provide support for many CUDA-Aware MPI-level enhancements to enable efficient ML training at scale.
OSU-Caffe : Scalable Deep Learning on Modern GPU Clusters
Availability of large data sets like ImageNet and massively parallel computation
support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in
Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like
Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited
to a single node. In order to scale out DL frameworks and bring HPC capabilities to
the DL arena, we propose, OSU-Caffe; a scalable and distributed Caffe adaptation
for modern multi-GPU clusters.
OSU-Caffe is a co-design approach of the Caffe framework and a widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It also brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages.
The OSU-Caffe implementation is based on the NVIDIA's fork of Caffe, which supports CUDNN optimizations. The Co-designed MPI runtime is MVAPICH2-GDR 2.2 , which is an efficient CUDA-Aware MPI runtime that provides support for GPUDirect RDMA and DL-Aware optimizations.
RDMA-TensorFlow
With the ubiquity of massive computational power Deep Learning is finally
getting its momentum. The application of Deep Learning is wide - starting from
self driving car to assist doctors to find out early stages of cancer. At
present Google’s TensorFlow is one of the most popular Deep Learning framework
in the community. Even though distributed TensorFlow can be deployed on modern
HPC systems seamlessly, the performance of TensorFlow depends heavily on the
efficacy of its communication engine and the underlying network. As the DL
network grows in depth and in terms of number of parameters the variable
updates between different nodes become a critical bottleneck. gRPC, a widely
used Remote Procedure Call framework that enables client and server
applications to communicate transparently, is the main communication engine of
TensorFlow. After a comprehensive analysis of TensorFlow, we develop an
adaptive RDMA based gRPC (i.e., AR-gRPC) that is solely capable of accelerating
TensorFlow. RDMA-TenorFlow leverages AR-gRPC to accelerate Deep Learning in
modern HPC clusters.