Skip to content

[EP] Low-latency mode batch tokens per destination#749

Closed
MaoZiming wants to merge 22 commits into
mainfrom
ep-per-dst-batching
Closed

[EP] Low-latency mode batch tokens per destination#749
MaoZiming wants to merge 22 commits into
mainfrom
ep-per-dst-batching

Conversation

@MaoZiming
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue.

Fixes # (issue)

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • My code follows the style guidelines, e.g. format.sh.
  • I have run build_and_install.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@MaoZiming
Copy link
Copy Markdown
Member Author

@laochonlam

    int num_tokens_to_send = 0;
    if (lane_id == 0) {
      for (int e = 0; e < num_local_experts; ++e) {
        int expert_idx = responsible_rank * num_local_experts + e;
        while (ld_acquire_global(atomic_finish_counter_per_expert + expert_idx) !=
               FINISHED_SUM_TAG * 2)
#if defined(__HIP_PLATFORM_AMD__) || defined(__HIPCC__)
            __builtin_amdgcn_s_sleep(1);
#else
            ;
#endif
        num_tokens_to_send += atomic_counter_per_expert[expert_idx];
      }
    }
    num_tokens_to_send = __shfl_sync(WARP_MASK, num_tokens_to_send, 0);

The bottleneck seems to be waiting for tokens of all experts (for a given rank) to be ready.

@MaoZiming
Copy link
Copy Markdown
Member Author

Current performance:

[rank 5] Dispatch bandwidth: 18.57 GB/s, avg_t=404.47 us | Combine bandwidth: 41.53 GB/s, avg_t=350.07 us
[rank 4] Dispatch bandwidth: 18.51 GB/s, avg_t=405.92 us | Combine bandwidth: 41.73 GB/s, avg_t=348.38 us
[rank 0] Dispatch bandwidth: 17.17 GB/s, avg_t=437.42 us | Combine bandwidth: 45.94 GB/s, avg_t=316.45 us
[rank 3] Dispatch bandwidth: 16.43 GB/s, avg_t=457.28 us | Combine bandwidth: 48.99 GB/s, avg_t=296.73 us
[rank 1] Dispatch bandwidth: 17.26 GB/s, avg_t=435.27 us | Combine bandwidth: 45.59 GB/s, avg_t=318.87 us
[rank 2] Dispatch bandwidth: 17.77 GB/s, avg_t=422.75 us | Combine bandwidth: 43.88 GB/s, avg_t=331.29 us
[rank 7] Dispatch bandwidth: 17.42 GB/s, avg_t=431.19 us | Combine bandwidth: 45.05 GB/s, avg_t=322.71 us
[rank 6] Dispatch bandwidth: 17.18 GB/s, avg_t=437.27 us | Combine bandwidth: 45.83 GB/s, avg_t=317.15 us

Base automatically changed from pr/batch_low_latency to main February 26, 2026 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants