MCR-DL Features
MCR-DL v0.1 is an interface between PyTorch and communication backends such as MPI and NCCL that enables high-performance communication, simple communication extensibility, and a suite of PyTorch communication benchmarks.
- (NEW) Support for several communication backends, enabling MPI communication without PyTorch source builds
- (NEW)PyTorch distributed module
- (NEW)Pure MPI
- (NEW)Pure NCCL
- (NEW)PyTorch communication benchmarking suite
- (NEW)Testing suite
- Tested with with
- NVIDIA GPU A100 and H100
- CUDA [11.7, 12.1]
- Python >= 3.8
- PyTorch >=1.12.1
- MVAPICH2-GDR = 2.3.7
- MVAPICH-PLUS = 3.0b
ParaInfer-X Features
- Based on Faster Transformer
- (NEW) Support for inference of various large language models:
- (NEW)GPT-J 6B
- (NEW)LlaMA 7B
- (NEW)LlaMA 13B
- (NEW)LlaMA 33B
- (NEW)LlaMA 65B
- (NEW)Support for persistent model inference stream
- (NEW)Support for temporal fusion/in-flight batching of multiple requests
- (NEW)Support for multiple GPU tensor parallelism
- (NEW)Support for asynchronous memory reordering for evicting finished requests
- (NEW)Support for float32, float16, bfloat16 for model inference
- Compatible with
- (NEW)NVIDIA GPU A100 and V100
- (NEW)CUDA [11.2, 11.3, 11.4, 11.6]
- (NEW)GCC >= 8.5.0
- (NEW)CMAKE >= 3.18
- (NEW)Intel oneTBB >= v2020.0
- (NEW)Customized CUDA kernels
- (NEW)Support for visualization output of inference progress
MPI4DL Features
- Based on PyTorch
- (NEW) Support for training very high-resolution images
- Distributed training support for:
- Layer Parallelism (LP)
- Pipeline Parallelism (PP)
- Spatial Parallelism (SP)
- Spatial and Layer Parallelism (SP+LP)
- Spatial and Pipeline Parallelism (SP+PP)
- (NEW)Bidirectional and Layer Parallelism (GEMS+LP)
- (NEW)Bidirectional and Pipeline Parallelism (GEMS+PP)
- (NEW)Spatial, Bidirectional and Layer Parallelism (SP+GEMS+LP)
- (NEW)Spatial, Bidirectional and Pipeline Parallelism (SP+GEMS+PP)
- (NEW)Support for AmoebaNet and ResNet models
- (NEW)Support for different image sizes and custom datasets
- Exploits collective features of MVAPICH2-GDR
- Compatible with
- NVIDIA GPU A100 and V100
- CUDA [11.6, 11.7]
- Python >= 3.8
- PyTorch [1.12.1 , 1.13.1]
- MVAPICH2-GDR = 2.3.7
- MVAPICH-Plus = 3.0b
Horovod with MVAPICH2 Features
- Based on Horovod
- Full support for Tensorflow, PyTorch, Keras and Apache MXNet
- Optimized support for MPI controller in deep learning workloads
- Efficient large-message collectives (e.g. Allreduce) on various CPUs and GPUs
- GPU-Direct Algorithms for all collective operations (including those commonly used for data and model-parallelism, e.g. Allgather and Alltoall)
- Support for fork safety
- Exploits efficient large message collectives in MVAPICH2 and MVAPICH2-GDR
- Exploits efficient large message collectives in MVAPICH2 and MVAPICH2-GDR
- Compatible with
- Mellanox InfiniBand adapters (e.g., EDR, FDR, HDR)
- NVIDIA GPU K80, P100, V100, Quadro RTX 5000, A100
- CUDA [9.x, 10.x, 11.x] and CUDNN [7.5.x, 7.6.x, 8.0.x, 8.2.x, 8.4.x]
- (NEW)AMD MI100 GPUs
- (NEW)ROCm [5.1.x]
- Tensorflow [1.x, 2.x], Pytorch 1.x, Apache MXNet 1.x
- (NEW)Horovod [0.24.0, 0.25.0, 0.26.0, 0.27.0]
- (NEW)Python [3.x]
MPI4cuML Features
- Based on cuML 22.02.00
- Include ready-to-use examples for KMeans, Linear Regression, Nearest Neighbors, and tSVD
- MVAPICH2 support for RAFT 22.02.00
- Enabled cuML’s communication engine, RAFT, to use MVAPICH2-GDR backend for Python and C++ cuML applications
- KMeans, PCA, tSVD, RF, LinearModels
- Added switch between available communication backends (MVAPICH2 and NCCL)
- Built on top of mpi4py over the MVAPICH2-GDR library
- Tested with
- Mellanox InfiniBand adapters (FDR and HDR)
- Various x86-based multi-core platforms (AMD and Intel)
- NVIDIA GPU A100, V100, and P100
OSU-Caffe 0.9 Features
OSU-Caffe derives from Caffe, which is a Deep Learning Framework that provides the
flexibility to design and enhance DL models. All the features available with the NVIDIA's
fork of the BVLC Caffe are available with this release. OSU-Caffe offers
additional features and mechanisms that take advantage of the HPC resources. It is an MPI distributed version
that scales-out on multi-GPU nodes. It takes advanatge of the optimized CUDA-Aware MPI to boost its performance on GPU Clusters.
OSU-Caffe re-designs the DL workflow to provide overlap of the computation and communication. Further, it takes advantage of
efficient large message MPI collective communication operations from GPU buffers that efficiently exploit GPUDirect RDMA,
CUDA IPC, CUDA Kernels and Core-Direct features.
The list of features for supporting distributed and large scale DL frameworks.
- Based on Nvidia's Caffe fork (caffe-0.14)
- MPI-based distributed training support
- Efficient scale-out support for multi-GPU nodes systems
- New workflow to overlap the compute layers and the communication
- Efficient parallel file readers to optimize I/O and data movement
- Takes advantage of Lustre Parallel File System
- Exploits efficient large message collectives in MVAPICH2-GDR 2.2
- Tested with
- Various CUDA-aware MPI libraries
- CUDA 7.5
- Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters
RDMA-TensorFlow 0.9.1 Features
- Based on Google TensorFlow 1.3.0
- Build with Python 2.7, Cuda 8.0, CUDNN 5.0, gcc 4.8.5, and glibc 2.17
- Compliant with TensorFlow 1.3.0 APIs and applications
- High-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow
- RDMA-based data communication
- Adaptive communication protocols
- Dynamic message chunking and accumulation
- Support for RDMA device selection
- Easily configurable for native InfiniBand and the traditional sockets based support (Ethernet and InfiniBand with IPoIB)
- Tested with
- Mellanox InfiniBand adapters (e.g., EDR)
- NVIDIA GPGPU K80
- Tested with CUDA 8.0 and CUDNN 5.0