OSU-Caffe : Scalable Deep Learning on Modern GPU Clusters
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, OSU-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters.
OSU-Caffe is a co-design approach of the Caffe framework and a widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It also brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages.
The OSU-Caffe implementation is based on the NVIDIA's fork of Caffe, which supports CUDNN optimizations. The Co-designed MPI runtime is MVAPICH2-GDR 2.2 , which is an efficient CUDA-Aware MPI runtime that provides support for GPUDirect RDMA and DL-Aware optimizations.