Skip to content

[UK] p2p performance#875

Open
derekwin wants to merge 28 commits intomainfrom
uk-uccl-performance-2
Open

[UK] p2p performance#875
derekwin wants to merge 28 commits intomainfrom
uk-uccl-performance-2

Conversation

@derekwin
Copy link
Copy Markdown
Collaborator

Description

In the PCIe bottleneck scenario, i observed that the current P2P bandwidth consistently lags behind NCCL P2P. I attempted to chunk large messages(in uccl_adapter) across multiple engines for parallel transmission, but this failed to break through the limit.

1. UCCL_CACHE_QP_MAX_MSG=0

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.068 0.03 0.157 0.01 0.078 0.03
4096 B 0.095 0.09 0.137 0.06 0.079 0.10
16384 B 0.189 0.17 0.156 0.21 0.087 0.38
65536 B 0.237 0.55 0.213 0.61 0.152 0.86
262144 B 0.411 1.28 0.435 1.21 0.369 1.42
1048576 B 1.248 1.68 1.296 1.62 1.231 1.70
4194304 B 5.061 1.66 5.118 1.64 4.488 1.87
16777216 B 19.946 1.68 19.813 1.69 17.523 1.91
67108864 B 79.261 1.69 78.974 1.70 69.650 1.93
268435456 B 320.500 1.68 316.012 1.70 277.580 1.93

2. UCCL_CACHE_QP_MAX_MSG=65536

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.073 0.03 0.156 0.01 0.094 0.02
4096 B 0.108 0.08 0.138 0.06 0.096 0.09
16384 B 0.202 0.16 0.157 0.21 0.134 0.24
65536 B 0.256 0.51 0.209 0.63 0.156 0.84
262144 B 0.430 1.22 0.438 1.20 0.372 1.41
1048576 B 1.242 1.69 1.282 1.64 1.230 1.70
4194304 B 5.137 1.63 5.038 1.67 4.484 1.87
16777216 B 19.873 1.69 19.871 1.69 17.501 1.92
67108864 B 79.380 1.69 78.755 1.70 69.564 1.93
268435456 B 317.376 1.69 316.324 1.70 276.766 1.94

3. UCCL_IB_RELAXED_ORDERING_MODE=1

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.072 0.03 0.152 0.01 0.082 0.03
4096 B 0.101 0.08 0.320 0.03 0.113 0.07
16384 B 0.197 0.17 0.151 0.22 0.113 0.29
65536 B 0.236 0.56 0.224 0.59 0.154 0.85
262144 B 0.420 1.25 0.596 0.88 0.374 1.40
1048576 B 1.253 1.67 1.290 1.63 1.265 1.66
4194304 B 5.253 1.60 5.115 1.64 4.429 1.89
16777216 B 19.866 1.69 19.857 1.69 17.188 1.95
67108864 B 79.283 1.69 78.790 1.70 68.494 1.96
268435456 B 320.550 1.67 316.213 1.70 273.671 1.96

4. UCCL_IB_RELAXED_ORDERING_MODE=0

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.074 0.03 0.155 0.01 0.097 0.02
4096 B 0.098 0.08 0.138 0.06 0.099 0.08
16384 B 0.196 0.17 0.154 0.21 0.108 0.30
65536 B 0.241 0.54 0.208 0.63 0.155 0.84
262144 B 0.415 1.26 0.526 1.00 0.375 1.40
1048576 B 1.257 1.67 1.613 1.30 1.230 1.71
4194304 B 5.407 1.55 5.074 1.65 4.463 1.88
16777216 B 19.953 1.68 20.071 1.67 17.492 1.92
67108864 B 79.230 1.69 78.775 1.70 69.357 1.94
268435456 B 320.293 1.68 316.145 1.70 277.250 1.94

5. UCCL_TX_INLINE_THRESHOLD=4096

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.075 0.03 0.162 0.01 0.079 0.03
4096 B 0.104 0.08 0.139 0.06 0.079 0.10
16384 B 0.197 0.17 0.154 0.21 0.083 0.40
65536 B 0.237 0.55 0.211 0.62 0.154 0.85
262144 B 0.432 1.21 0.432 1.21 0.369 1.42
1048576 B 1.249 1.68 1.299 1.61 1.231 1.70
4194304 B 4.970 1.69 5.046 1.66 4.473 1.88
16777216 B 20.179 1.66 19.979 1.68 17.473 1.92
67108864 B 79.694 1.68 78.928 1.70 69.466 1.93
268435456 B 317.262 1.69 316.579 1.70 277.572 1.93

6. UCCL_TX_INLINE_THRESHOLD=8192

Size ukernel (ms) ukernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.069 0.03 0.154 0.01 0.100 0.02
4096 B 0.099 0.08 0.140 0.06 0.102 0.08
16384 B 0.192 0.17 0.157 0.21 0.101 0.33
65536 B 0.244 0.54 0.212 0.62 0.184 0.71
262144 B 0.435 1.20 0.529 0.99 0.373 1.40
1048576 B 1.285 1.63 1.282 1.64 1.228 1.71
4194304 B 5.076 1.65 4.978 1.69 4.475 1.87
16777216 B 19.977 1.68 19.856 1.69 17.463 1.92
67108864 B 79.135 1.70 78.751 1.70 69.453 1.93
268435456 B 320.356 1.68 316.246 1.70 276.946 1.94

@derekwin derekwin added the WIP Work In Progress label Apr 10, 2026
@derekwin derekwin changed the title [P2P] p2p performance [UK] p2p performance Apr 10, 2026
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch 3 times, most recently from eff04b3 to 1ac24dc Compare April 11, 2026 12:54
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 1ac24dc to d5486bf Compare April 11, 2026 12:55
@derekwin
Copy link
Copy Markdown
Collaborator Author

After various optimization strategies without seeing any gains, I have decided to put further optimizations for uccl_p2p in this PCIe bottleneck scenario on hold for now.

@derekwin derekwin force-pushed the uk-uccl-performance-2 branch 5 times, most recently from ca5f3cd to 220a403 Compare April 13, 2026 08:26
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 220a403 to 4856d45 Compare April 13, 2026 08:35
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch 5 times, most recently from c2628c1 to 50e5221 Compare April 14, 2026 05:54
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch 13 times, most recently from 3db49f9 to d5884fb Compare April 20, 2026 07:25
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch 3 times, most recently from b2d7857 to 71a5272 Compare April 20, 2026 08:59
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 71a5272 to 6501951 Compare April 20, 2026 08:59
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 38557c3 to 59269d6 Compare April 20, 2026 11:37
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 7507963 to 25676b1 Compare April 21, 2026 12:00
@derekwin derekwin force-pushed the uk-uccl-performance-2 branch from 886111e to f55b6fd Compare April 21, 2026 13:03
@derekwin
Copy link
Copy Markdown
Collaborator Author

derekwin commented Apr 22, 2026

The changes in this PR are too extensive. I'm considering splitting it into multiple PRs to ensure maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Work In Progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants