Skip to content

pd nixl upgrade write mode to transfer kv#1324

Merged
hiworldwzj merged 22 commits into
mainfrom
wzj_pd
Jun 7, 2026
Merged

pd nixl upgrade write mode to transfer kv#1324
hiworldwzj merged 22 commits into
mainfrom
wzj_pd

Conversation

@hiworldwzj
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces GPU timing measurements for memory page copy operations and logs the elapsed time. It also adds latency tracking for adding remote agents in the NIXL KV transporter. The feedback suggests defensively checking that both copy_start_event and copy_end_event are not None before calculating the elapsed time to prevent potential AttributeError exceptions.

Comment on lines +362 to +364
if copy_end_event is not None:
copy_end_event.synchronize()
read_page_gpu_time_ms = copy_start_event.elapsed_time(copy_end_event)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure robust defensive programming, verify that both copy_end_event and copy_start_event are not None before calling elapsed_time to prevent potential AttributeError exceptions.

Suggested change
if copy_end_event is not None:
copy_end_event.synchronize()
read_page_gpu_time_ms = copy_start_event.elapsed_time(copy_end_event)
if copy_end_event is not None and copy_start_event is not None:
copy_end_event.synchronize()
read_page_gpu_time_ms = copy_start_event.elapsed_time(copy_end_event)

@hiworldwzj hiworldwzj changed the title add log for pd nixl pd nixl upgrade write mode to transfer kv Jun 7, 2026
@hiworldwzj
Copy link
Copy Markdown
Collaborator Author

多跑了几次,取 3 次完整 GSM8K 的平均值后,结论仍然是:NIXL PD 吞吐稳定高于 NCCL PD,平均高约 5.29%。

结果:

NIXL PD:
run1: 21.43 req/s
run2: 21.55 req/s
run3: 21.36 req/s
avg : 21.45 req/s
std : 0.08

NCCL PD:
run1: 20.38 req/s
run2: 20.46 req/s
run3: 20.27 req/s
avg : 20.37 req/s
std : 0.08
对比:

NIXL - NCCL = +1.08 req/s
相对 NCCL 提升 = +5.29%

@hiworldwzj hiworldwzj merged commit 3863844 into main Jun 7, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_pd branch June 7, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant