1. Overview

With the ubiquity of massive computational power Deep Learning is finally getting its momentum. The application of Deep Learning is wide - starting from self driving car to assist doctors to find out early stages of cancer. At present Google’s TensorFlow is one of the most popular Deep Learning framework in the community. Even though distributed TensorFlow can be deployed on modern HPC systems seamlessly, the performance of TensorFlow depends heavily on the efficacy of its communication engine and the underlying network. As the DL network grows in depth and in terms of number of parameters the variable updates between different nodes become a critical bottleneck. gRPC, a widely used Remote Procedure Call framework that enables client and server applications to communicate transparently, is the main communication engine of TensorFlow. After a comprehensive analysis of TensorFlow, we develop an adaptive RDMA based gRPC (i.e., AR-gRPC) that is solely capable of accelerating TensorFlow. RDMA-TenorFlow leverages AR-gRPC to accelerate Deep Learning in modern HPC clusters.

2. System Requirements

RDMA-TensorFlow release requires the following software to be installed on your system:

3. Installing RDMA-TensorFlow

To install RDMA-TensorFlow you need to donwload the python whl package and follow the following instructions

pip install rdma_tensorflow-0.9.1-cp27-none-linux_x86_64.whl

4. Running RDMA-TensorFlow

In this section, we show examples of running TensorFlow CNN benchmark using RDMA-TensorFlow.

We assume that you have installed RDMA-TensorFlow and cloned the TF_CNN git repo (Please use commit id:203f73adf18dc324188a276bc0f3f7c78a6f48bf while cloning this repo)

This example shows how to run resnet50 DNN on 1 Parameter Server and 2 workers. Change the model parameter to run other available DNNs in TensorFlow CNN benchmark.

Start the Parameter Server:

python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpu=1  --batch_size=32 --model=resnet50 --variable_update=parameter_server --job_name=ps --ps_hosts=10.3.1.1:50000  --worker_hosts=10.3.1.2:50001,10.3.1.9:50002 --task_index=0 --server_protocol=grpc

Start the Worker 1.

python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpu=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server --job_name=worker --ps_hosts=10.3.1.1:50000 --worker_hosts=10.3.1.2:50001,10.3.1.9:50002 --task_index=0 --server_protocol=grpc

Start the Worker 2.

python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpu=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server --job_name=worker --ps_hosts=10.3.1.1:50000 --worker_hosts=10.3.1.2:50001,10.3.1.9:50002 --task_index=1 --server_protocol=grpc

5. Configuration

RDMA-TensorFlow has been tested with the default options and everything should work right out of the box. However, we expose certain environment variables for user's flexibility. Users can export these environment variables with permitted inputs.

5.1. Enable/ Disable RDMA

By default RDMA-TensorFlow has RDMA enabled. However, the users have the flexibility to disable RDMA and use the default communication engine.

  • TF_RDMA_ENABLED

    • Default: true

    • To disable RDMA features set this to false.

      export TF_RDMA_ENABLED=false
      

5.2. Selecting HCA

By default RDMA-TensorFlow selects "mlx5_0" HCA. However, users have the option to select the HCA device.

  • TF_RDMA_DEV_NAME

    • Default: mlx5_0

    • To select the intended HCA set this to the HCA name.

      export TF_RDMA_DEV_NAME=mlx4_0

5.3. Selecting RDMA Message Chunk Size

RDMA-TensorFlow chunks large tensors and sends those in non-blocking parallel fashion to saturate the available network bandwidth. Using empirical experimentation we have selected an optimized chunk size. However, users can experiment with different chunk sizes on their cluster to tune the performance of distributed RDMA-TensorFlow training.

  • TF_RDMA_CHUNK_SIZE

    • Default: 4 MB

    • Select the required RDMA chunk size by setting this. Value must be given in Bytes.

      # Set the chunk size to 2 MB
      export TF_RDMA_CHUNK_SIZE=2097152
      

5.4. Selecting RDMA Message Accumulation Size

RDMA-TensorFlow coalesces small payloads and sends those together instead of setting those individually. Generally a size lesser than 16 KBytes is recommended, as RDMA-TensorFlow will use eager protocol to send these small messages. Users can experiment with different message accumulation sizes on their cluster to tune the performance of distributed RDMA-TensorFlow training.

  • TF_RDMA_ACC_SIZE

    • Default: 8 KB

    • Select the required RDMA message accumulation size by setting this. Value must be given in Bytes.

      # Set the chunk size to 16 KB
      export TF_RDMA_ACC_SIZE=16384
      

5.5. Running with Large DL Models

For Large DL models like VGG, AlexNet, or large scale experiments, the size of RDMA pinned buffers need to be increased explicitly.

  • TF_RDMA_BUF_SIZE

    • Default: 1 GB

    • To allow larger DL models to execute without failure use more memory. The user defined memory should be always larger and multiple of 1 GB. If you encounter RPC message parsing error, please increase the RDMA buffer size and rerun.

      # Allocate 10 GB memory 
      export TF_RDMA_BUF_SIZE=10737418240