[EP] Low-latency mode batch tokens per destination by MaoZiming · Pull Request #749 · uccl-project/uccl

MaoZiming · 2026-02-23T00:46:39Z

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

My code follows the style guidelines, e.g. format.sh.
I have run build_and_install.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

Add batched RDMA send handling and related low-latency path updates, including dispatch-side token-count signaling and receiver-side immediate data handling for batched sends.

MaoZiming · 2026-02-23T01:23:44Z

@laochonlam

    int num_tokens_to_send = 0;
    if (lane_id == 0) {
      for (int e = 0; e < num_local_experts; ++e) {
        int expert_idx = responsible_rank * num_local_experts + e;
        while (ld_acquire_global(atomic_finish_counter_per_expert + expert_idx) !=
               FINISHED_SUM_TAG * 2)
#if defined(__HIP_PLATFORM_AMD__) || defined(__HIPCC__)
            __builtin_amdgcn_s_sleep(1);
#else
            ;
#endif
        num_tokens_to_send += atomic_counter_per_expert[expert_idx];
      }
    }
    num_tokens_to_send = __shfl_sync(WARP_MASK, num_tokens_to_send, 0);

The bottleneck seems to be waiting for tokens of all experts (for a given rank) to be ready.

MaoZiming · 2026-02-23T23:15:37Z

Current performance:

[rank 5] Dispatch bandwidth: 18.57 GB/s, avg_t=404.47 us | Combine bandwidth: 41.53 GB/s, avg_t=350.07 us
[rank 4] Dispatch bandwidth: 18.51 GB/s, avg_t=405.92 us | Combine bandwidth: 41.73 GB/s, avg_t=348.38 us
[rank 0] Dispatch bandwidth: 17.17 GB/s, avg_t=437.42 us | Combine bandwidth: 45.94 GB/s, avg_t=316.45 us
[rank 3] Dispatch bandwidth: 16.43 GB/s, avg_t=457.28 us | Combine bandwidth: 48.99 GB/s, avg_t=296.73 us
[rank 1] Dispatch bandwidth: 17.26 GB/s, avg_t=435.27 us | Combine bandwidth: 45.59 GB/s, avg_t=318.87 us
[rank 2] Dispatch bandwidth: 17.77 GB/s, avg_t=422.75 us | Combine bandwidth: 43.88 GB/s, avg_t=331.29 us
[rank 7] Dispatch bandwidth: 17.42 GB/s, avg_t=431.19 us | Combine bandwidth: 45.05 GB/s, avg_t=322.71 us
[rank 6] Dispatch bandwidth: 17.18 GB/s, avg_t=437.27 us | Combine bandwidth: 45.83 GB/s, avg_t=317.15 us

laochonlam and others added 17 commits February 22, 2026 00:56

[EP] Add batched low-latency RDMA send path

532fc52

Add batched RDMA send handling and related low-latency path updates, including dispatch-side token-count signaling and receiver-side immediate data handling for batched sends.

merge main

00aed70

Merge remote-tracking branch 'origin/main' into pr/batch_low_latency

3e3bf61

Cleanup code and comments

2cbed00

Cleanup code and comments

38a54df

Cleanup code and comments

247af56

Cleanup code and comments

45d2412

Merge branch 'main' into pr/batch_low_latency

ef7782b

seems to work for gb10

ea2ae4f

making reordering buffer on EFA work

2c90a1b

update data layout

1af1558

adding more bits to cmd.bytes

e079700

adding debugging check

70b739e

fix bug transfercmd

144a814

remove meta checks

5b8fe51

fix counter poll

edd5374

revert

623fb7b

MaoZiming force-pushed the ep-per-dst-batching branch from 77de801 to 623fb7b Compare February 23, 2026 01:22

MaoZiming mentioned this pull request Feb 23, 2026

[Bug] vLLM hangs when running DeepEP low-latency mode over EFA #734

Closed

MaoZiming added 4 commits February 23, 2026 23:15

improve perf

45a4e84

fix occasional torch run stuck issue

c429d3e

remove torch.cuda.set_device

77eb3a6

separate out Gloo object group

07e25e1

Base automatically changed from pr/batch_low_latency to main February 26, 2026 05:24

tmp checkpoint

2f86484

MaoZiming added the wont-merge label Feb 27, 2026

YangZhou1997 closed this May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EP] Low-latency mode batch tokens per destination#749

[EP] Low-latency mode batch tokens per destination#749
MaoZiming wants to merge 22 commits into
mainfrom
ep-per-dst-batching

MaoZiming commented Feb 23, 2026

Uh oh!

MaoZiming commented Feb 23, 2026

Uh oh!

MaoZiming commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MaoZiming commented Feb 23, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

MaoZiming commented Feb 23, 2026

Uh oh!

MaoZiming commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants