Skip to content

[AUTOGENERATED] [release/2.8] ProcessGroupGloo: fix CUDA tensor stream handling with futures (#170812)#3128

Open
rocm-repo-management-api-6[bot] wants to merge 1 commit intorelease/2.8from
autogenerated/release/2.8_cherry-pick_pr-3073
Open

[AUTOGENERATED] [release/2.8] ProcessGroupGloo: fix CUDA tensor stream handling with futures (#170812)#3128
rocm-repo-management-api-6[bot] wants to merge 1 commit intorelease/2.8from
autogenerated/release/2.8_cherry-pick_pr-3073

Conversation

@rocm-repo-management-api-6
Copy link
Copy Markdown

Cherry-pick of #3073

… futures (pytorch#170812) (#3073)

Fixes pytorch#155714

There's a very subtle bug in Gloo where CUDA future streams aren't
preserved correctly leading to silent corruption when using Gloo with a
CUDA model using the DDP reducer.

Test plan:

```python
import os

RANK = int(os.environ["RANK"])
WORLD_SIZE = int(os.environ["WORLD_SIZE"])
os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["LOCAL_RANK"]

import torch
import torch.distributed as dist

torch.manual_seed(0)

dist.init_process_group("gloo")

N = 10
expected = torch.sum(torch.arange(0, WORLD_SIZE, dtype=torch.float)).item()
t = torch.full((1000000,), RANK, device="cuda", dtype=torch.float)
tensors = [
    t.clone()
    for _ in range(N)
]

futs = []
for tensor in tensors:
    work = dist.all_reduce(tensor, async_op=True)
    futs.append(work.get_future())

# create high priority stream to do the CPU copy and preempt the default stream
stream = torch.cuda.Stream(priority=-1)

for fut, tensor in zip(futs, tensors):
    with torch.cuda.stream(stream):
        fut.wait()
        val = tensor[-1].item()
        assert val == expected, f"Expected {expected}, got {val}"
```

```
torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/gloo_future_stream.py
```

```
BACKEND=gloo WORLD_SIZE=4 TEMP_DIR=/tmp/foo pytest test/distributed/test_distributed_spawn.py -v -s -x  -k 'test_ddp_apply_optim_in_backward'
```
Pull Request resolved: pytorch#170812
Approved by: https://github.com/fduwjj, https://github.com/jeffdaily

(cherry picked from commit 398d338)

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Tristan Rice <rice@fn.lc>
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api bot commented Apr 2, 2026

Jenkins build for d79c7cebb01313352b1acf9e91ef6d2c886950bb commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant