MVAPICH-Plus Changelog
-----------------------
This file briefly describes the changes to the MVAPICH-Plus software package.
The logs are arranged in the "most recent first" order.

MVAPICH-Plus 4.1
(4.1rc released 06/09/2025)
(4.1GA released 08/01/2025)

* Features and Enhancements (since 4.0)
	- Based on MPICH 4.3.0
    - Added support for adaptive dynamic collective tuning
        - Performed on the fly at runtime for both CPU and GPU
    - Enhanced GPU tuning support
    - Optimized HIP kernel-based collective performance for AMD GPUs
    - Optimized algorithms for GPU Allreduce
        - GPU aware RSA, RD, Direct, and Ring algorithms
    - Optimized direct-throttling algorithms for GPU Allgather, Allgatherv, and Reduce-Scatter
    - Optimized ring algorithm for GPU Reduce-Scatter
    - Added support for dynamic GPU initialization after MPI_Init
    - Added support for unified GPU memory models for AMD MI300A APUs
    - Improved rndv protocol performance in point-to-point operations
    - Improved MPIT PVAR support

* Bug Fixes (since 4.0)
    - Fixed pointer caching for ROCM 6
    - Fixed memory leaks
    - Fixed issues with GPU binding in UCX enabled builds
    - Added small message fallback support to compression algorithms
    - Fixed potential memory overflow on NVIDIA GPUs

MVAPICH-Plus 4.0
(4.0a released 07/26/2024)
(4.0b released 08/16/2024)
(4.0rc released 11/08/2024)
(4.0 GA released 12/20/2024)

* Overall Features and Enhancements
	- Based on MPICH 4.3.0a1
    - Supports all features of MPI 4.1 standard
    - Includes enhanced OFI provider for IB systems, "mverbs;ofi_ucr"
    - Support for 
    	* Major CPUs (x86-Intel, x86-AMD, and ARM)
    	* Major Interconnects (IB, Slingshot, OPX, Omni-Path, ROCE, and Ethernet/ iWARP)
    	* Major GPUs (from NVIDIA, AMD, and Intel)
    - Optimized support for pt2pt inter-node and intra-node communication
    - CMA support for intra-node pt2pt operations
    - Optimized algorithms for collectives
    - CUDA-aware MPI (pt2pt and collective) support
    - Support for NVIDIA GDRCOPY and AMD LARGEBAR GPU copy operations
    - Optimized IPC-based support for collectives on Intel GPU 
    	- Allreduce and reduce
    - Support kernel-based Allreduce on NVIDIA/AMD/Intel GPUs
    - On-the fly compression support for collectives using GPU buffers on NVIDIA GPUs
    	- Multi-stream ZFP-based compression
    	- Allgather, Alltoall, Allreduce, and Reduce_Scatter
    - On-the fly compression support for collectives using GPU buffers on AMD GPUs
    	- ZFP-based compression
    	- Allgather, Alltoall
    - On-the fly compression support for point-to-point operations using GPU buffers on NVIDIA GPUs
    - On-the fly compression support for point-to-point operations using GPU buffers on AMD GPUs