Overview
Welcome to the High-Performance Deep Learning project created by the Network-Based Computing Laboratory of The Ohio State University. The availability of large data sets (e.g. ImageNet, PASCAL VOC 2012) coupled with massively parallel processors in modern HPC systems (e.g. NVIDIA GPUs) have fueled a renewed interest in Deep Learning (DL) algorithms. In addition to the popularity of massively parallel DL accelerators like GPUs, the availability and memory-abundance of modern CPUs poses a viable alternative for DL training. This resurgence of DL applications has triggered the development of DL frameworks like Caffe, PyTorch, TensorFlow, Apache MXNet, and CNTK. While most DL frameworks provide experimental support for multi-node training, their distributed implementation is often suboptimal. The objective of the HiDL project is to exploit modern HPC technologies and solutions to scale out and accelerate DL frameworks. The HiDL packages are being used by more than 85 organizations worldwide in 21 countries (Current Users) to accelerate Deep Learning and Machine Learning applications. As of Jun '25, more than 5,300 downloads have taken place from this project's site. The HiDL project contains the following packages.
MPI-Driven DL Training with Native PyTorch 2.0 and MVAPICH-Plus
The HiDL software suite version 2.0 is a high-performance deep learning stack designed for native PyTorch 2.x distributed training, built on the MVAPICH-Plus high-performance CUDA-aware communication backend. HiDL provides optimized MPI communication support for large-scale PyTorch Distributed Data Parallel (DDP) training, targeting modern HPC clusters built with CPUs, dense GPUs and high-performance interconnects.
The 2.0 release of the HiDL stack introduces the following key features:
- Support for PyTorch 2.0 and later versions
- Full support for PyTorch Native Distributed Data Parallel (DDP) training
- Optimized support for MPI communication backend in model training workloads
- Efficient large-message collectives (e.g., Allreduce) on various CPUs and GPUs
- GPU-Direct Ring and Two-level multi-leader algorithms for Allreduce operations
- Support for fork safety in distributed training environments
- Exploits efficient large message collectives in MVAPICH-Plus 4.0 and later
- Open-source PyTorch version with advanced MPI backend support
- Vendor-neutral stack with competitive performance and throughput to GPU-based collective libraries (etc. NCCL, RCCL)
- Battle tested on modern HPC clusters (etc, OLCF Frontier, TACC Vista) with up-to-date accelerator generations (etc. AMD NVIDIA)
- Compatible with
- InfiniBand Networks: Mellanox InfiniBand adapters (EDR, FDR, HDR, NDR)
- Slingshot Networks: HPE Slingshot
- GPU&CPU Support:
- NVIDIA GPU A100, H100, GH200
- AMD MI200 series GPUs
- Software Stack:
- CUDA [12.x] and Latest CuDNN
- ROCm [6.x]
- (NEW)PyTorch [2.x]
- (NEW)Python [3.x]
PyTorch Native DDP Performance on MVAPICH-Plus
MCR-DL v0.1
MCR-DL v0.1 is an interface between PyTorch and communication backends such as MPI and NCCL that enables high-performance communication, simple communication extensibility, and a suite of PyTorch communication benchmarks. This helps an user to achieve the best performance and scalability for distributed DL training by mixing-and-matching communication backends.
- (NEW)Support for several communication backends, enabling MPI communication without PyTorch source builds
- (NEW)PyTorch distributed module
- (NEW)Pure MPI
- (NEW)Pure NCCL
- (NEW)PyTorch communication benchmarking suite
- (NEW)Testing suite
- Tested with
- NVIDIA GPU A100 and H100
- CUDA [11.7, 12.1]
- Python >= 3.8
- PyTorch [1.13.1 , 2.0.1]
- MVAPICH2-GDR = 2.3.7
- MVAPICH-PLUS = 3.0b
ParaInfer-X v1.0
ParaInfer-X is a collection of parallel inference techniques that can facilitate the deployment of emerging AI models on edge devices and HPC clusters. It leverages highly performant GPU kernels that maximize computational throughput, intelligent scheduling strategies that ensure optimal load balancing across resources, and sophisticated distributed communication libraries that facilitate large-scale inference by enabling seamless data exchange and coordination among distributed systems.
As a high concurrency application, the large language model serving handles multiple user requests while also targeting a low-latency and high-throughput fashion. ParaInfer-X v1.0 proposes a temporal fusion framework, named Flover, to smartly batch multiple requests during LLM generation, which is also known as temporal fusion/in-flight batching.
- Based on Faster Transformer
- (NEW)Support for inference of various large language models:
- (NEW)GPT-J 6B
- (NEW)LlaMA 7B
- (NEW)LlaMA 13B
- (NEW)LlaMA 33B
- (NEW)LlaMA 65B
- (NEW)Support for persistent model inference stream
- (NEW)Support for temporal fusion/in-flight batching of multiple requests
- (NEW)Support for multiple GPU tensor parallelism
- (NEW)Support for asynchronous memory reordering for evicting finished requests
- (NEW)Support for float32, float16, bfloat16 for model inference
- Compatible with
- (NEW)NVIDIA GPU A100 and V100
- (NEW)CUDA [11.2, 11.3, 11.4, 11.6]
- (NEW)GCC >= 8.5.0
- (NEW)CMAKE >= 3.18
- (NEW)Intel oneTBB >= v2020.0
- (NEW)Customized CUDA kernels
- (NEW)Support for visualization output of inference progress
MPI4DL v0.6
MPI4DL v0.6 is a distributed and accelerated training framework for very high-resolution images that integrates Spatial Parallelism, Layer Parallelism, and Pipeline Parallelism.
- Based on PyTorch
- (NEW)Support for training very high-resolution images
- Distributed training support for:
- Layer Parallelism (LP)
- Pipeline Parallelism (PP)
- Spatial Parallelism (SP)
- Spatial and Layer Parallelism (SP+LP)
- Spatial and Pipeline Parallelism (SP+PP)
- (NEW)Bidirectional and Layer Parallelism (GEMS+LP)
- (NEW)Bidirectional and Pipeline Parallelism (GEMS+PP)
- (NEW)Spatial, Bidirectional and Layer Parallelism (SP+GEMS+LP)
- (NEW)Spatial, Bidirectional and Pipeline Parallelism (SP+GEMS+PP)
- (NEW)Support for AmoebaNet and ResNet models
- (NEW)Support for different image sizes and custom datasets
- Exploits collective features of MVAPICH2-GDR
- Compatible with
- NVIDIA GPU A100 and V100
- CUDA [11.6, 11.7]
- Python >= 3.8
- PyTorch [1.12.1 , 1.13.1]
- MVAPICH2-GDR = 2.3.7
- MVAPICH-PLUS = 3.0b
MPI-Driven ML Training with MPI4cuML
cuML is a distributed machine learning training framework with a focus on GPU acceleration and distributed computing. MVAPICH2-GDR provides many features to augment distributed training with cuML on GPUs.
- Based on cuML 22.02.00
- Include ready-to-use examples for KMeans, Linear Regression, Nearest Neighbors, and tSVD
- MVAPICH2 support for RAFT 22.02.00
- Enabled cuML’s communication engine, RAFT, to use MVAPICH2-GDR backend for Python and C++ cuML applications
- KMeans, PCA, tSVD, RF, LinearModels
- Added switch between available communication backends (MVAPICH2 and NCCL)
- Built on top of mpi4py over the MVAPICH2-GDR library
- Tested with
- Mellanox InfiniBand adapters (FDR and HDR)
- NVIDIA GPU A100, V100 and, P100
- Various x86-based multi-core platforms (AMD and Intel)
cuML Performance on MVAPICH2-GDR
Announcements
The 13th annual MVAPICH User Group (MUG) Conference will be held during August 18-20, 2025 in Columbus, Ohio, USA. Click here for details.
The 12th Annual MVAPICH User Group (MUG) Conference was held successfully in a hybrid manner on August 19-21, 2024 with more than 220 attendees. Slides and videos of the presentations are available from here.
ParaInfer-X v1.0 with MPI and NCCL-based support for fast parallel inference of various large language models (GPT-J and LlaMA), persistent model inference stream, temporal fusion/in-flight batching of multiple requests, multiple GPU tensor parallelism, asynchronous memory reordering for evicting finished requests, and support for float32, float16, bfloat16 for model inference is available. [more]
MPI4DL 0.6 with support for distributed and accelerated training framework for very high-resolution images that integrates Spatial Parallelism, Layer Parallelism, and Pipeline Parallelism is available. [more]
HiDL 1.0 (based on Horovod) with support for TensorFlow, PyTorch, Keras and MXNet, built on top of MVAPICH2-GDR and MVAPICH2-X, providing large-scale distributed deep learning support for clusters with NVIDIA and AMD GPUs is available. [more]
MPI4cuML 0.5 (based on cuML 22.02.00) with support for RAFT 22.02.00, C++ and Python APIs, built on top of mpi4py over the MVAPICH2-GDR library, handles to use MVAPICH2-GDR backend for Python cuML applications (KMeans, PCA, tSVD, RF, and LinearModels) is available. [more]
Partnership and contribution to the NSF-Awarded $20M AI-Institute on Intelligent CyberInfrastructure (ICICLE). Details.