[AUTOGENERATED] [release/2.8] ProcessGroupGloo: fix CUDA tensor stream handling with futures (#170812)#3128
Open
rocm-repo-management-api-6[bot] wants to merge 1 commit intorelease/2.8from
Conversation
… futures (pytorch#170812) (#3073) Fixes pytorch#155714 There's a very subtle bug in Gloo where CUDA future streams aren't preserved correctly leading to silent corruption when using Gloo with a CUDA model using the DDP reducer. Test plan: ```python import os RANK = int(os.environ["RANK"]) WORLD_SIZE = int(os.environ["WORLD_SIZE"]) os.environ["CUDA_VISIBLE_DEVICES"] = os.environ["LOCAL_RANK"] import torch import torch.distributed as dist torch.manual_seed(0) dist.init_process_group("gloo") N = 10 expected = torch.sum(torch.arange(0, WORLD_SIZE, dtype=torch.float)).item() t = torch.full((1000000,), RANK, device="cuda", dtype=torch.float) tensors = [ t.clone() for _ in range(N) ] futs = [] for tensor in tensors: work = dist.all_reduce(tensor, async_op=True) futs.append(work.get_future()) # create high priority stream to do the CPU copy and preempt the default stream stream = torch.cuda.Stream(priority=-1) for fut, tensor in zip(futs, tensors): with torch.cuda.stream(stream): fut.wait() val = tensor[-1].item() assert val == expected, f"Expected {expected}, got {val}" ``` ``` torchrun --nnodes 1 --nproc_per_node=gpu ~/scripts/gloo_future_stream.py ``` ``` BACKEND=gloo WORLD_SIZE=4 TEMP_DIR=/tmp/foo pytest test/distributed/test_distributed_spawn.py -v -s -x -k 'test_ddp_apply_optim_in_backward' ``` Pull Request resolved: pytorch#170812 Approved by: https://github.com/fduwjj, https://github.com/jeffdaily (cherry picked from commit 398d338) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Tristan Rice <rice@fn.lc>
|
Jenkins build for d79c7cebb01313352b1acf9e91ef6d2c886950bb commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of #3073