RDMA-TensorFlow Performance

Machine Specifications

CPU Model CPU Core Info Memory IB Card OS OFED GPU CUDA
Intel E5-2680i v4 2x14 @ 2.4Ghz 512GB Mellanox EDR (100 Gbps) CentOS 7.2 MOFED 4.2-1.2.0 NVIDIA Tesla K80 CUDA 8.0/ CUDNN 5.0

For performance evaluation, we use TensorFlow CNN benchmark. This benchmark generates synthetic image data and measures the performance by the total number of images processed per second (higher is better). RDMA-TensorFlow uses AR-gRPC channel for the distributed training, while the default TensorFlow uses default gRPC channel over IPoIB. All the worker nodes use GPU, while the Parameter Server uses CPU. The graphs shows the batch size per GPU. The total batch size can be computed as (batch size per GPU * total number of GPUs).

Inception4 performance on 4 Nodes

Inception4 on 4 nodes

Inception4 performance on 8 Nodes

Inception4 on 8 nodes

Inception4 performance on 12 Nodes

Inception4 on 12 nodes

RDMA-TensorFlow improves DL training performance by a maximum of 29%, 80%, and 144% compared to default TensorFlow. For example, in 12 Nodes RDMA-TensorFlow improves performance of DL training by 80% (93 vs 51 images) for batch size 16/GPU (total 176) on 12 nodes.

Resnet152 performance on 4 Nodes

Resnet152 on 4 nodes

Resnet152 performance on 8 Nodes

Renset152 on 8 nodes

Resnet152 performance on 12 Nodes

Resnet152 on 12 nodes

RDMA-TensorFlow accelerates performance by 62% (batch size 8/GPU) more compared to default TensorFlow for 4 Nodes. Also, for 8 Nodes, RDMA-TensorFlow improves Resnet152 performance by 32% (batch size 32/GPU) to 147% (batch size 8/GPU). For 12 nodes, RDMA-TensorFlow incurs a maximum speedup of 3x (55 vs 18 images) compared to default TensorFlow. Even for higher batch size (12 nodes) of 32/GPU (total 352) RDMA-TensorFlow improves TensorFlow performance by 82%.

GoogleNet and AlexNet performance on 4 Nodes

googlenet and alexnet on 4 nodes

GoogleNet and AlexNet performance on 8 Nodes

googlenet and alexnet on 8 nodes

The above figures show the training performance comparison for GoogleNet and AlexNet among RDMA-TensorFlow and default TensorFlow on 4 and 8 nodes, respectively. GoogleNet has only 5 Million parameters, whereas AlexNet has about 60 Million parameters. The results shows that with increasing number of nodes RDMA-TensorFlow scales better. For example, for 4 nodes, RDMA-TensorFlow has almost identical performance to default TensorFlow for 16 and 32 batch size per GPU. However, as we go to 8 nodes the performance improvement over default TensorFlow become apparent. We have the same observation for AlexNet as well. Moreover, as shown for 8 nodes, RDMA-TensorFlow process a maximum of 128% (batch size 8/GPU) more images than default TensorFlow for GoogleNet. Although, for large batch size (32/GPU, total 224) the improvement is about 15% (597 vs 517). This is expected as higher batch size and less parameters in GoogleNet results in less network intensive gradient updates. In comparison, for the same batch size (32/GPU) RDMA-TensorFlow shows 89% (124 vs 65) performance improvement for Alexnet compared to default TensorFlow (8 Nodes).

These numbers were taken on OSU RI2 cluster.