High-Performance Deep Learning

Journals (6)
1	Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, Understanding and Characterizing Communication Characteristics for Distributed Transformer Models, IEEE Micro, Jan 2025.
2	A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, Optimizing Distributed DNN Training using CPUs and BlueField-2 DPUs, IEEE Micro, doi: 10.1109/MM.2021.3139027,
3	DK Panda, H. Subramoni, C. Chu, and M. Bayatpour, The MVAPICH project: Transforming Research into High-Performance MPI Library for HPC Community , Journal of Computational Science (JOCS), Special Issue on Translational Computer Science, Oct 2020.
4	Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects, IEEE Micro, vol. 40, no. 1, pp. 35-43, 1 Jan.-Feb. 2020.,
5	Ammar Awan, K. Vadambacheri Manian, C. Chu, H. Subramoni, and DK Panda, Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2?, Volume 85, July 2019, Pages 141-152, https://doi.org/10.1016/j.parco.2019.03.005,
6	X. Lu, H. Shi, R. Biswas, M. H. Javed, and DK Panda, DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters, IEEE Transactions on Multi-Scale Computing Systems, Jun 2018.

Conferences & Workshops (54)
1	Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication C. Chen, J. Yao, H. Subramoni, and DK Panda, 54th International Conference on Parallel Processing, Sep 2025 [Bib - Plain]
2	Characterizing Communication Patterns in Distributed Large Language Model Inference L. Xu, K. Suresh, Q. Anthony, N. Alnaasan, and DK Panda, IEEE Hot Interconnects Symposium 2025, Aug 2025 [Bib - Plain]
3	Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
4	Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs C. Chen, L. Xu, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2025, Jun 2025 [Research Poster] [Bib - Plain]
5	Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems C. Chen, J. Yao, L. Xu, H. Subramoni, and DK Panda, 39th IEEE International Parallel & Distributed Processing Symposium, Jun 2025 [Bib - Plain]
6	Training ultra long context language model with fully pipelined distributed transformer J. Yao, S. Jacobs, M. Tanaka, O. Ruwase, H. Subramoni, and DK Panda, The Eighth Annual Conference on Machine Learning and Systems, May 2025 [Bib - Plain]
7	HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems N. Alnaasan, B. Ramesh, J. Yao, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
8	Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning L. Xu, Q. Anthony, J. Hatef, A. Shafi, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
9	Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs C. Chen, G. Kuncham, H. Subramoni, and DK Panda, 31st IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2024 [Bib - Plain]
10	HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization N. Alnaasan, A. Potlapally, T. Chen, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'24), Nov 2024 [Research Poster] [Bib - Plain]
11	Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models N. Alnaasan, H. Huang, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Bib - Plain]
12	Demystifying the Communication Characteristics for Distributed Transformer Models Q. Anthony, B. Michalowicz, J. Hatef, L. Xu, M. Abduljabbar, A. Shafi, H. Subramoni, and DK Panda, IEEE Hot Interconnects Symposium 2024, Aug 2024 [Q. Anthony and B. Michalowicz are co-lead authors] [Bib - Plain]
13	The Case for Co-Designing Model Architectures with Hardware Q. Anthony, J. Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, A. Shafi, H. Subramoni, and DK Panda, 53rd International Conference on Parallel Processing, Aug 2024 [Bib - Plain]
14	A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC P. Kousha, V. Sathu, H. M. Han, J. Jani, N. Alnaasan, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Jul 2024] [Bib - Plain]
15	Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs C. Chen, G. Kuncham, P. Kousha, H. Subramoni, and DK Panda, Practice and Experience in Advanced Research Computing, Jul 2024 [Bib - Plain]
16	Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference J. Yao, Q. Anthony, A. Shafi, H. Subramoni, and DK Panda, 38th IEEE International Parallel & Distributed Processing Symposium, May 2024 [Bib - Plain]
17	High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Research Poster] [Best Poster Award] [Bib - Plain]
18	Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters Q. Zhou, B. Ramesh, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2024, May 2024 [Bib - Plain]
19	Accelerating Large Language Model Training with Hybrid GPU-based Compression L. Xu, Q. Anthony, Q. Zhou, N. Alnaasan, R. Gulhane, A. Shafi, H. Subramoni, and DK Panda, IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing 2024, May 2024 [Bib - Plain]
20	Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference J. Yao, N. Alnaasan, T. Chen, A. Shafi, H. Subramoni, and DK Panda, 30th IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, & ANALYTICS, Dec 2023 [Bib - Plain]
21	HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training N. Alnaasan, M. Lieber, A. Shafi, H. Subramoni, S. Shearer, and DK Panda, 2023 IEEE International Conference on Big Data, Dec 2023 [Bib - Plain]
22	MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators C. Chen, K. Khorassani, P. Kousha, Q. Zhou, J. Yao, H. Subramoni, and DK Panda, Sixth Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Nov 2023 [Bib - Plain]
23	MCR-DL: Mix-and-Match Communication Runtime for Deep Learning Q. Anthony, Ammar Awan, J. Rasley, Y. He, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
24	Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication Q. Zhou, Q. Anthony, L. Xu, A. Shafi, M. Abduljabbar, H. Subramoni, and DK Panda, 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS '23), May 2023 [Bib - Plain]
25	ScaMP: Scalable Meta-Parallelism for Deep Learning Search Q. Anthony, L. Xu, A. Shafi, H. Subramoni, and DK Panda, THE 23RD IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, May 2023 [Bib - Plain]
26	AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 29th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2022 [Bib - Plain]
27	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, 23rd Parallel and Distributed Scientific and Engineering Computing Workshop (PDSEC) at IPDPS22, May 2022 [Bib - Plain]
28	Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters A. Jain, A. Shafi, Q. Anthony, P. Kousha, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Bib - Plain]
29	OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems N. Alnaasan, A. Jain, A. Shafi, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2022, May 2022 [Research Poster] [Best Poster Award] [Bib - Plain]
30	Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems B. Ramesh, J. Hashmi, S. Xu, A. Shafi, M. Ghazimirsaeed, M. Bayatpour, H. Subramoni, and DK Panda, 28th IEEE International Conference on High Performance Computing, Data, Analytics, and Data Science, Dec 2021 [Best Paper Finalist] [Bib - Plain]
31	Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs A. Jain, N. Alnaasan, A. Shafi, H. Subramoni, and DK Panda, 28th IEEE Hot Interconnects, Aug 2021 [Bib - Plain]
32	Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences Q. Anthony, L. Xu, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel And Distributed Infrastructures, May 2021 [Bib - Plain]
33	Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters Q. Zhou, C. Chu, N. Senthil Kumar, P. Kousha, M. Ghazimirsaeed, H. Subramoni, and DK Panda, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Best Paper Finalist] [Bib - Plain]
34	SUPER: SUb-Graph Parallelism for TransformERs A. Jain, T. Moon, T. Benson, H. Subramoni, S. Jacobs, DK Panda, and B. Essen, 35th IEEE International Parallel & Distributed Processing Symposium, May 2021 [Bib - Plain]
35	Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications A. Shafi, J. Hashmi, H. Subramoni, and DK Panda, 27TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, Dec 2020 [Bib - Plain]
36	GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training A. Jain, Ammar Awan, A. Aljuhani, J. Hashmi, Q. Anthony, H. Subramoni, DK Panda, R. Machiraju, and A. Parwani, SC 2020, Nov 2020 [Bib - Plain]
37	NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems C. Chu, P. Kousha, Ammar Awan, K. Khorassani, H. Subramoni, and DK Panda, The 34th ACM International Conference on Supercomputing (ICS-2020), Jun 2020 [Bib - Plain]
38	HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow Ammar Awan, A. Jain, Q. Anthony, H. Subramoni, and DK Panda, ISC HIGH PERFORMANCE 2020, Jun 2020 [Bib - Plain]
39	Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR Q. Anthony, Ammar Awan, A. Jain, H. Subramoni, and DK Panda, Scalable Deep Learning over Parallel and Distributed Infrastructures (ScaDL) at IPDPS '20, May 2020 [Bib - Plain]
40	Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera A. Jain, Ammar Awan, H. Subramoni, and DK Panda, 3rd Deep Learning on Supercomputers Workshop (DLS) at SC19, Nov 2019 [Bib - Plain]
41	Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters A. Jain, Ammar Awan, Q. Anthony, H. Subramoni, and DK Panda, 21st IEEE International Conference on Cluster Computing, Sep 2019 [Bib - Plain]
42	Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects Ammar Awan, A. Jain, C. Chu, H. Subramoni, and DK Panda, 26th Symposium on High-Performance Interconnects (HotI '19), Aug 2019 [Bib - Plain]
43	Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation Ammar Awan, J. Bedorf, C. Chu, H. Subramoni, and DK Panda, The 19th Annual IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing (CCGRID 2019), May 2019 [Bib - Plain]
44	Accelerating TensorFlow with Adaptive RDMA-based gRPC R. Biswas, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
45	OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training Ammar Awan, C. Chu, H. Subramoni, X. Lu, and DK Panda, 25th IEEE International Conference on High Performance Computing, Data, and Analytics, Dec 2018 [Bib - Plain]
46	Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? Ammar Awan, C. Chu, H. Subramoni, and DK Panda, The EuroMPI 2018 Conference, Sep 2018 [Bib - Plain]
47	Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences R. Biswas, X. Lu, and DK Panda, The Ninth Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Mar 2018 [Bib - Plain]
48	An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures Ammar Awan, H. Subramoni, and DK Panda, 3rd Workshop on Machine Learning in High Performance Computing Environments, held in conjunction with SC17, Nov 2017 [Bib - Plain]
49	Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks X. Lu, H. Shi, M. H. Javed, R. Biswas, and DK Panda, The 25th Annual Symposium on High-Performance Interconnects (HotI), Aug 2017 [Bib - Plain]
50	MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling A. Venkatesh, C. Chu, K. Hamidouche, S. Potluri, Davide Rossetti, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
51	Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning C. Chu, X. Lu, Ammar Awan, H. Subramoni, J. Hashmi, Bracy Elton, and DK Panda, ICPP 2017 : International Conference on Parallel Processing, Aug 2017 [Bib - Plain]
52	S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters Ammar Awan, K. Hamidouche, J. Hashmi, and DK Panda, 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb 2017 [Slides] [Bib - Plain]
53	Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters D. Banerjee, K. Hamidouche, and DK Panda, 8th IEEE International Conference on Cloud Computing Technology and Science (IEEE CloudCom '16), Dec 2016 [Bib - Plain]
54	Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning Ammar Awan, K. Hamidouche, A. Venkatesh, and DK Panda, The 23rd European MPI Users' Group Meeting (EuroMPI 16), Sep 2016 [Best Paper Runner-Up] [Bib - Plain]

Ph.D. Disserations (3)
1	C. Chu, Accelerator-enabled Communication Middleware for Large-scale Heterogeneous HPC Systems with Modern Interconnects, Jul 2020
2	J. Hashmi, Designing High Performance Shared-Address-Space and Adaptive Communication Middlewares for Next-Generation HPC Systems, Apr 2020
3	Ammar Awan, Co-designing Communication Middleware and Deep Learning Frameworks for High-Performance DNN Training on HPC Systems, Apr 2020

M.S. Thesis (3)
1	S. Srivastava, MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI Library, May 2021
2	N. Senthil Kumar, Designing Optimized MPI+NCCL Hybrid Collective Communication Routines for Dense Many-GPU Clusters, May 2021
3	R. Biswas, Benchmarking and Accelerating TensorFlow-based Deep Learning on Modern HPC Systems, Jul 2018

High-Performance Deep Learning (HiDL)

This page lists publications from the group related to designing High Performance Deep Learning frameworks as well as co-designing MPI runtimes for efficient support of scalable DL.

Journals (6)

Conferences & Workshops (54)

Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication

Characterizing Communication Patterns in Distributed Large Language Model Inference

Design and Implementation of a GPU-Aware MPI Collective Library for Intel GPUs

Design and Implementation of MPI Collective Operations for Large Message Communication on AMD GPUs

Unified Designs of Multi-rail-aware MPI Allreduce and Alltoall Operations Across Diverse GPU and Interconnect Systems

Training ultra long context language model with fully pipelined distributed transformer

HyperSack: Distributed Hyperparameter Optimization for Deep Learning using Resource-Aware Scheduling on Heterogeneous GPU Systems

Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning

Design and Implementation of Kernel-based MPI Reduction Operations for Intel GPUs

HARVEST-2.0: High-Performance Vision Framework for End-to-end Preprocessing, Training, Inference, and Visualization

Characterizing Communication in Distributed Parameter-Efficient Fine-Tuning for Large Language Models

Demystifying the Communication Characteristics for Distributed Transformer Models

The Case for Co-Designing Model Architectures with Hardware

A Novel LLM-enabled Framework for Accelerating the Creation of Knowledge Graphs for HPC

Design and Implementation of an IPC-based Collective MPI Library for Intel GPUs

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

High-Performance Semi-Supervised Learning with HARVEST: A Distributed Computer Vision Framework for Expert Labeling

Accelerating MPI AllReduce Communication with Efficient GPU-Based Compression Schemes on Modern GPU Clusters

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference

HARVEST: High-Performance Artificial Vision Framework for Expert Labeling using Semi-Supervised Training

MPI-xCCL: A Portable MPI Library over Collective Communication Libraries for Various Accelerators

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning

Accelerating Distributed Deep Learning Training with Compression Assisted Allgather and Reduce-Scatter Communication

ScaMP: Scalable Meta-Parallelism for Deep Learning Search

AccDP: Accelerated Data-Parallel Distributed DNN Training for Modern GPU-Based HPC Clusters

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries and Machine Learning Applications on HPC Systems

Towards Architecture-aware Hierarchical Communication Trees on Modern HPC Systems

Accelerating CPU-based Distributed DNN Training on Modern HPC Clusters using BlueField-2 DPUs

Scaling Single-Image Super-Resolution Training on Modern HPC Clusters: Early Experiences

Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters

SUPER: SUb-Graph Parallelism for TransformERs

Blink: Towards Efficient RDMA-based Communication Coroutines for Parallel Python Applications

GEMS: GPU Enabled Memory Aware Model Parallelism System for Distributed DNN Training

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense GPU Systems

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow

Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR

Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

Communication Profiling and Characterization of Deep-Learning Workloads on Clusters With High-Performance Interconnects

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Accelerating TensorFlow with Adaptive RDMA-based gRPC

OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training

Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences

An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures

Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks

MPI-GDS: High Performance MPI Designs with GPUDirect-aSync for CPU-GPU Control Flow Decoupling

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters

Re-designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters

Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning

Ph.D. Disserations (3)

M.S. Thesis (3)