The HiDL team members participated in multiple events at SC'18!!
The OSU booth features leading speakers from academia and industry!!
Click here for more information!!
Welcome to the High-Performance Deep Learning project created by the Network-Based Computing Laboratory of The Ohio State University. Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. The objective of the HiDL project is to exploit modern HPC technologies and solutions to scale out and accelerate DL frameworks.
OSU-Caffe library is a scalable and distributed Caffe adaptation for modern multi-GPU clusters. This is designed using a co-design approach of the Caffe framework and the widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages. Major features for OSU-Caffe 0.9 are given below.
- Based on Nvidia's Caffe fork (caffe-0.14)
- MPI-based distributed training support
- Efficient scale-out support for multi-GPU nodes systems
- New workflow to overlap the compute layers and the communication
- Efficient parallel file readers to optimize I/O and data movement
- Takes advantage of Lustre Parallel File System
- Exploits efficient large message collectives in MVAPICH2-GDR 2.2
- Tested with
- Various CUDA-aware MPI libraries
- CUDA 7.5
- Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters
The RDMA-TensorFlow is a derivative of Google’s popular deep learning framework TensorFlow. This package can be used to exploit performance on modern clusters with RDMA-enabled interconnects for distributed deep learning. Major features of RDMA-TensorFlow 0.9.1 are given below.
- Based on Google TensorFlow 1.3.0
- Build with Python 2.7, Cuda 8.0, CUDNN 5.0, gcc 4.8.5, and glibc 2.17
- Compliant with TensorFlow 1.3.0 APIs and applications
- High-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow
- RDMA-based data communication
- Adaptive communication protocols
- Dynamic message chunking and accumulation
- Support for RDMA device selection
- Easily configurable for native InfiniBand and the traditional sockets based support (Ethernet and InfiniBand with IPoIB)
- Tested with
- Mellanox InfiniBand adapters (e.g., EDR)
- NVIDIA GPGPU K80
- Tested with CUDA 8.0 and CUDNN 5.0
RDMA-TensorFlow 0.9.1 (Based on Google TensorFlow 1.3.0) with support for high-performance design with native InfiniBand support at the verbs level for gRPC Runtime (AR-gRPC) and TensorFlow. It has advanced features such as RDMA-based data communication, adaptive communication protocols, dynamic message chunking and accumulation, support for RDMA device selection, and so on. [more]
OSU-Caffe 0.9 (based on Nvidia's Caffe fork, caffe-0.14) with support for MPI-based distributed training, efficient scale-out on multi-GPU nodes, new workflow to overlap the compute layers and communication, optimizing I/O and data movement with parallel file readers, taking advantage of Luster, and exploiting large message collectives in MVAPICH2-GDR 2.2 library is available. [more]