The HiDL team members participated in multiple events during SC'17!
Please refer here for the presentation slides!
Welcome to the High-Performance Deep Learning project created by the Network-Based Computing Laboratory of The Ohio State University. Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. The objective of the HiDL project is to exploit modern HPC technologies and solutions to scale out and accelerate DL frameworks.
OSU-Caffe library is a scalable and distributed Caffe adaptation for modern multi-GPU clusters. This is designed using a co-design approach of the Caffe framework and the widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages. Major features for OSU-Caffe 0.9 are given below.
- Based on Nvidia's Caffe fork (caffe-0.14)
- MPI-based distributed training support
- Efficient scale-out support for multi-GPU nodes systems
- New workflow to overlap the compute layers and the communication
- Efficient parallel file readers to optimize I/O and data movement
- Takes advantage of Lustre Parallel File System
- Exploits efficient large message collectives in MVAPICH2-GDR 2.2
- Tested with
- Various CUDA-aware MPI libraries
- CUDA 7.5
- Various HPC Clusters with K80 GPUs, varying number of GPUs/node, and InfiniBand (FDR and EDR) adapters
Upcoming Tutorial: High Performance Distributed Deep Learning for Dummies at Hot Interconnect 2017.
OSU-Caffe 0.9 (based on Nvidia's Caffe fork, caffe-0.14) with support for MPI-based distributed training, efficient scale-out on multi-GPU nodes, new workflow to overlap the compute layers and communication, optimizing I/O and data movement with parallel file readers, taking advantage of Luster, and exploiting large message collectives in MVAPICH2-GDR 2.2 library is available. [more]