From a8ee57a8ca6584e50f864fa97076fdd943e9e0ce Mon Sep 17 00:00:00 2001
From: Augustin Bussy <augustin.bussy@cscs.ch>
Date: Thu, 30 Apr 2026 11:55:00 +0200
Subject: [PATCH 1/2] Update GPU-aware MPI docs

---
 docs/src/usage.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index ee14cc31e..b4021bfc0 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -115,6 +115,20 @@ your ROCm-aware MPI implementation to use multiple AMD GPUs (one GPU per rank).
 If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
+### Safe MPI communication
+When using GPU-aware MPI (CUDA or ROCm), it is recommended to synchronize the device before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results.
+
+With CUDA:
+```
+CUDA.synchronize()
+MPI.Isend(my_CuArray, mpi_comm, dest, tag)
+```
+And with AMDGPU:
+```
+AMDGPU.synchronize()
+MPI.Allreduce!(my_ROCArray, +, mpi_comm)
+```
+
 ### Multiple GPUs per node
 
 In a configuration with multiple GPUs per node, mapping GPU ID to node local MPI rank can be achieved either (1) on the application side using node-local communicator (`MPI.COMM_TYPE_SHARED`) or (2) on the system side setting device visibility accordingly.

From 97e10d067d13240e524f70672d421dc980dd0929 Mon Sep 17 00:00:00 2001
From: Augustin Bussy <augustin.bussy@cscs.ch>
Date: Thu, 30 Apr 2026 14:14:53 +0200
Subject: [PATCH 2/2] Update docs/src/usage.md

Co-authored-by: Valentin Churavy <v.churavy@gmail.com>
---
 docs/src/usage.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/src/usage.md b/docs/src/usage.md
index b4021bfc0..3c87d1916 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -116,7 +116,7 @@ If using OpenMPI, the status of ROCm support can be checked via the
 [`MPI.has_rocm()`](@ref) function.
 
 ### Safe MPI communication
-When using GPU-aware MPI (CUDA or ROCm), it is recommended to synchronize the device before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results.
+When using GPU-aware MPI (CUDA or ROCm), it is required to synchronize the (task-local) stream, before initiating MPI communication. Because GPU operations are asynchronous with respect to the host, GPU buffers may not be fully updated when a MPI call is issued. This can lead to race conditions and incorrect results.
 
 With CUDA:
 ```