MVAPICH-Plus Changelog ----------------------- This file briefly describes the changes to the MVAPICH-Plus software package. The logs are arranged in the "most recent first" order. MVAPICH-Plus 4.0 (4.0a released 07/26/2024) (4.0b released 08/16/2024) (4.0rc released 11/15/2024) * Overall Features and Enhancements - Based on MPICH 4.3.0a1 - Supports all features of MPI 4.1 standard - Includes enhanced OFI provider for IB systems, "mverbs;ofi_ucr" - Support for * Major CPUs (x86-Intel, x86-AMD, and ARM) * Major Interconnects (IB, Slingshot, OPX, Omni-Path, ROCE, and Ethernet/ iWARP) * Major GPUs (from NVIDIA, AMD, and Intel) - Optimized support for pt2pt inter-node and intra-node communication - CMA support for intra-node pt2pt operations - Optimized algorithms for collectives - CUDA-aware MPI (pt2pt and collective) support - Support for NVIDIA GDRCOPY and AMD LARGEBAR GPU copy operations - Optimized IPC-based support for collectives on Intel GPU - Allreduce and reduce - Support kernel-based Allreduce on NVIDIA/AMD/Intel GPUs - On-the fly compression support for collectives using GPU buffers on NVIDIA GPUs - Multi-stream ZFP-based compression - Allgather, Alltoall, Allreduce, and Reduce_Scatter - On-the fly compression support for collectives using GPU buffers on AMD GPUs - ZFP-based compression - Allgather, Alltoall - On-the fly compression support for point-to-point operations using GPU buffers on NVIDIA GPUs MVAPICH-Plus 3.0 (3.0 GA Released 03/08/2024) (3.0rc Released 12/22/2023) (3.0b Released 11/01/2023) (3.0a2 Released 07/19/2023) (3.0a Released 11/10/2022) * Features and Enhancements - Based on MVAPICH 3.0 - Support for various high-performance communication fabrics - InfiniBand, Slingshot-10/11, Omni-Path, OPX, RoCE, and Ethernet - Support naive CPU staging approach for collectives for small messages - Tune naive limits for the following systems - Frontier@OLCF, Pitzer@OSC, Owens@OSC, Ascend@OSC, Frontera@TACC, Lonestar6@TACC, ThetaGPU@ALCF, Polaris@ALCF, Tioga@LLNL - Initial support for blocking collectives on NVIDIA and AMD GPUs - Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gather, Gatherv, Reduce, Reduce_scatter, Scatter, Scatterv, Reduce_local, Reduce_scatter_block - Initial support for non-blocking GPU collectives on NVIDIA and AMD GPUs - Iallgather, Iallgatherv, Iallreduce, Ialltoall, Ialltoallv, Ibcast, Igather, Igatherv, Ireduce, Ireduce_scatter, Iscatter, Iscatterv - Enhanced collective and pt2pt tuning for NVIDIA Grace-Hopper systems - Enhanced collective tuning for NVIDIA V100, A100, H100 GPUs - Enhanced collective tuning for AMD MI100, and MI250x GPUs - Enhanced support for blocking and non-blocking GPU to GPU point-to-point operations on NVIDIA and AMD GPUs taking advantage of: - NVIDIA GDRCopy, AMD LargeBar support - CUDA and ROCM IPC support - Enhanced CPU tuning on various HPC systems and architectures - Stampede3@TACC, Frontier@OLCF, Lonestar6@TACC - AMD Rome, AMD Millan, Intel Sapphire Rapids - Tested with - Various HPC applications, mini-applications, and benchmarks - MPI4cuML (a custom cuML package with MPI support) - Tested with CUDA <= 12.3 - Tested with ROCM <= 5.6.0