OSU-Caffe : Scalable Deep Learning on Modern GPU Clusters
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node. In order to scale out DL frameworks and bring HPC capabilities to the DL arena, we propose, OSU-Caffe; a scalable and distributed Caffe adaptation for modern multi-GPU clusters.
OSU-Caffe is a co-design approach of the Caffe framework and a widely used MVAPICH2-GDR, MPI runtime. The co-design methodology involves re-designing Caffe’s workflow to maximize the overlap of computation and communication. It also brings DL-Awareness to the MPI runtime by designing efficient CUDA-Aware collective operations for very large messages.
The OSU-Caffe implementation is based on the NVIDIA's fork of Caffe, which supports CUDNN optimizations. The Co-designed MPI runtime is MVAPICH2-GDR 2.2 , which is an efficient CUDA-Aware MPI runtime that provides support for GPUDirect RDMA and DL-Aware optimizations.
With the ubiquity of massive computational power Deep Learning is finally getting its momentum. The application of Deep Learning is wide - starting from self driving car to assist doctors to find out early stages of cancer. At present Google’s TensorFlow is one of the most popular Deep Learning framework in the community. Even though distributed TensorFlow can be deployed on modern HPC systems seamlessly, the performance of TensorFlow depends heavily on the efficacy of its communication engine and the underlying network. As the DL network grows in depth and in terms of number of parameters the variable updates between different nodes become a critical bottleneck. gRPC, a widely used Remote Procedure Call framework that enables client and server applications to communicate transparently, is the main communication engine of TensorFlow. After a comprehensive analysis of TensorFlow, we develop an adaptive RDMA based gRPC (i.e., AR-gRPC) that is solely capable of accelerating TensorFlow. RDMA-TenorFlow leverages AR-gRPC to accelerate Deep Learning in modern HPC clusters.