[P2P] Added congestion control support (Timely, Swift) by andrzej-k · Pull Request #837 · uccl-project/uccl

andrzej-k · 2026-03-24T12:40:01Z

Description

The intention is to allow RoCE P2P to use congestion control algorithms in addition to currently supported flow control. That would be useful in environments without PFC support.

This PR adds Timely and Swift support in P2P, as part of that moved CC algorithms (Timely, Swift, EQDS) out of collectives to shared location.

Test results - two node setup:

export UCCL_P2P_RDMA_CC=swift

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38717
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 41241
Connected to <IP>:41241 (fd=63)
Accepted connection fd=64 from <IP>:57828
[Client] Connected to <IP>:41241 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.03 Gbps |   1.00 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.32 Gbps |   3.54 GB/s  | 0.000019 s
[Client] 256.0 KB :  76.63 Gbps |   9.58 GB/s  | 0.000027 s
[Client]   1.0 MB : 139.22 Gbps |  17.40 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.99 Gbps |  23.12 GB/s  | 0.000453 s
[Client]  16.0 MB : 186.98 Gbps |  23.37 GB/s  | 0.000718 s
[Client] 100.0 MB : 186.41 Gbps |  23.30 GB/s  | 0.004500 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

export UCCL_P2P_RDMA_CC=timely

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0]
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38567
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 34689
Connected to <IP>:34689 (fd=59)
Accepted connection fd=64 from <IP>:36800
[Client] Connected to <IP>:34689 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.53 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.10 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.11 Gbps |   3.51 GB/s  | 0.000019 s
[Client] 256.0 KB :  75.48 Gbps |   9.44 GB/s  | 0.000028 s
[Client]   1.0 MB : 140.82 Gbps |  17.60 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.47 Gbps |  23.06 GB/s  | 0.000455 s
[Client]  16.0 MB : 187.26 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.34 Gbps |  23.29 GB/s  | 0.004502 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

unset UCCL_P2P_RDMA_CC

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1]
  [2]
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 46773
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 36325
Connected to <IP>:36325 (fd=63)
Accepted connection fd=64 from <IP>:35360
[Client] Connected to <IP>:36325 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.06 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.07 Gbps |   0.26 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.05 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.73 Gbps |   3.59 GB/s  | 0.000018 s
[Client] 256.0 KB :  76.76 Gbps |   9.59 GB/s  | 0.000027 s
[Client]   1.0 MB : 141.41 Gbps |  17.68 GB/s  | 0.000059 s
[Client]  10.0 MB : 185.05 Gbps |  23.13 GB/s  | 0.000453 s
[Client]  16.0 MB : 187.25 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.45 Gbps |  23.31 GB/s  | 0.004499 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

Type of Change

Bug fix
New feature
Documentation update

How Has This Been Tested?

Include any tests here.

Unit tests
Integration tests
Manual testing

Checklist

I have run format.sh to follow the style guidelines.
I have run build.sh to verify compilation.
I have removed redundant variables and comments.
I have updated the documentation.
I have added tests.

YangZhou1997 · 2026-04-08T23:41:11Z

Hi @zhongjiechen , perhaps you should review this code, as it marries the eqds into p2p. But I am not sure how we handle chunking in p2p to control the speed? cc @MaoZiming

zhongjiechen · 2026-04-09T03:54:28Z

Great work! I do have the following issues (issue 2 and issue 3 basically both correspond to the interaction with p2p's chunking design):

Tx timestamp recording may be a bit too early？Per Swift, ideally the timestamp should be taken when the hardware actually transmits the data. If we are using a software timestamp instead, should cc_.recordSendTsc be moved closer to ibv_post_send?
P2P splits messages into small chunks for a large request, but I think the current congestion window is only enforced on the entire request, not on posted chunks?
When a message is chunked to small chunks, these chunks will share the same wr_id, and the tracker records the number of chunks associated with this wr_id(see tracker_->acknowledge(cq_data.wr_id);, https://github.com/uccl-project/uccl/blob/main/p2p/rdma/seq_num.h#L243). However, it appears that cc_.onAck(cq_data.wr_id, cq_data.len); only accounts for the first completed chunk. CongestionControlState::onAck() clears the stored send timestamp after the first CQE, and later chunks are ignored by onACK() due to send_tsc = 0.

zhongjiechen · 2026-04-09T04:36:14Z

My guess is that the benchmark works because it does not hit the processSendRequests() path. Instead, it goes through processSendRequests(std::shared_ptr<RDMASendRequest> req). The call chain of this benchmark is:

benchmark_uccl.py -> ep.send() -> Endpoint::send() -> uccl_send_async() -> sendWithoutInnerQueue() -> SendConnection::processSendRequests(req)

CC logic is implemented only in processSendRequests(), which is invoked only when auto_start_polling_ is set to true (false by default:

uccl/p2p/engine.cc

Lines 203 to 204 in 03f1958

    
           ep_ = std::shared_ptr<NICEndpoint>( 
        
               new NICEndpoint(local_gpu_idx_, INVALID_RANK_ID, 0, false));

).

andrzej-k · 2026-04-09T06:54:35Z

Thanks @zhongjiechen for your comments!

Yes, agree to all observations - let me work on implementing per-chunk CC control rather the per whole request. Would also move tsc recording closer to actual send time (ibv_post_send) and start polling thread if using Timely or Swift.

andrzej-k · 2026-04-09T09:50:40Z

New test results:

UCCL_P2P_RDMA_CC unset

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] 
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 44319
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 35943
Connected to <IP:35943 (fd=88)
Accepted connection fd=89 from <IP>:56120
[Client] Connected to 10.173.44.110:35943 (GPU 0) conn_id=0
[Client]    256 B :   0.04 Gbps |   0.01 GB/s  | 0.000047 s
[Client]   1.0 KB :   0.18 Gbps |   0.02 GB/s  | 0.000046 s
[Client]   4.0 KB :   0.70 Gbps |   0.09 GB/s  | 0.000047 s
[Client]  16.0 KB :   2.79 Gbps |   0.35 GB/s  | 0.000047 s
[Client]  64.0 KB :  10.63 Gbps |   1.33 GB/s  | 0.000049 s
[Client] 256.0 KB :  30.77 Gbps |   3.85 GB/s  | 0.000068 s
[Client]   1.0 MB :  84.69 Gbps |  10.59 GB/s  | 0.000099 s
[Client]  10.0 MB : 170.47 Gbps |  21.31 GB/s  | 0.000492 s
[Client]  16.0 MB : 177.77 Gbps |  22.22 GB/s  | 0.000755 s
[Client] 100.0 MB : 185.59 Gbps |  23.20 GB/s  | 0.004520 s
[Client] Benchmark complete
Server closed connection: <IP>:35943
Destroying Engine...
Engine destroyed


$ env | grep P2P_RDMA_CC
UCCL_P2P_RDMA_CC=timely

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] 
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 42957
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 46169
Connected to <IP>:46169 (fd=89)
Accepted connection fd=88 from <IP>:42114
[Client] Connected to <IP>:46169 (GPU 0) conn_id=0
[Client]    256 B :   0.05 Gbps |   0.01 GB/s  | 0.000038 s
[Client]   1.0 KB :   0.23 Gbps |   0.03 GB/s  | 0.000036 s
[Client]   4.0 KB :   0.90 Gbps |   0.11 GB/s  | 0.000036 s
[Client]  16.0 KB :   3.52 Gbps |   0.44 GB/s  | 0.000037 s
[Client]  64.0 KB :  13.46 Gbps |   1.68 GB/s  | 0.000039 s
[Client] 256.0 KB :  38.45 Gbps |   4.81 GB/s  | 0.000055 s
[Client]   1.0 MB :  98.97 Gbps |  12.37 GB/s  | 0.000085 s
[Client]  10.0 MB : 175.60 Gbps |  21.95 GB/s  | 0.000478 s
[Client]  16.0 MB : 181.03 Gbps |  22.63 GB/s  | 0.000741 s
[Client] 100.0 MB : 186.11 Gbps |  23.26 GB/s  | 0.004507 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed


$ env | grep P2P_RDMA_CC
UCCL_P2P_RDMA_CC=swift

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0]
  [1] 
  [2]
  [3] 
  [4]
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 42191
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 33141
Connected to <IP>:33141 (fd=86)
Accepted connection fd=87 from <IP>:55612
[Client] Connected to <IP>:33141 (GPU 0) conn_id=0
[Client]    256 B :   0.05 Gbps |   0.01 GB/s  | 0.000038 s
[Client]   1.0 KB :   0.22 Gbps |   0.03 GB/s  | 0.000037 s
[Client]   4.0 KB :   0.88 Gbps |   0.11 GB/s  | 0.000037 s
[Client]  16.0 KB :   3.44 Gbps |   0.43 GB/s  | 0.000038 s
[Client]  64.0 KB :  13.16 Gbps |   1.65 GB/s  | 0.000040 s
[Client] 256.0 KB :  39.37 Gbps |   4.92 GB/s  | 0.000053 s
[Client]   1.0 MB :  98.21 Gbps |  12.28 GB/s  | 0.000085 s
[Client]  10.0 MB : 175.33 Gbps |  21.92 GB/s  | 0.000478 s
[Client]  16.0 MB : 181.75 Gbps |  22.72 GB/s  | 0.000738 s
[Client] 100.0 MB : 186.14 Gbps |  23.27 GB/s  | 0.004507 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

YangZhou1997 · 2026-04-22T05:23:42Z

@zhongjiechen , would be great if you could take a look

zhongjiechen · 2026-04-22T05:24:59Z

@zhongjiechen , would be great if you could take a look

No problem, I will take a look later

zhongjiechen

The following paragraph is generated by AI, just for reference.

Previously Raised StatusIssues

Tx timestamp too early Fixed. cc_.recordSendTsc(tsc_id) is now right before channel->submitRequest(req) in postRequestOnChannel().
CC window not enforced per-chunk Fixed. With CC enabled, auto_start_polling_ is now true, routing through the polling thread. The polling thread checks currentInflightLimitBytes() before popping each request. The window enforcement is per-request (not per-chunk), but since all chunks within a request share a wr_id, this is the right granularity for the tracker model.
Per-chunk CC acknowledgment / tsc_id reuse Fixed. Each chunk now gets a unique tsc_id via chunk_tsc_counter_, encoded in the upper 32 bits of wr_id. Since all sends use IBV_SEND_SIGNALED, every chunk generates a CQE, and each CQE produces an independent RTT sample via cc_.onAck(tsc_id, cq_data.len).

New Issues

freq_ghz not stored as member in CongestionControlState (cc_state.h:77)

onAck() uses the global uccl::freq_ghz (from timer.h) instead of the freq_ghz parameter passed to the constructor. This works because timer.h defines it per-TU via static double freq_ghz = measure_rdtsc_freq(), but it's inconsistent with the constructor API that takes freq_ghz explicitly. Consider storing freq_ghz as a member:
```
double freq_ghz_ = 0.0;
```
and using it in onAck().
Both timely_ and swift_ always constructed

The constructor initializes both TimelyCC and SwiftCC regardless of mode. Consider using std::variant or only initializing the active one to avoid wasting memory and CPU on the unused CC instance's initialization.
postWriteOrRead() and read() bypass CC entirely

These paths call tracker_->sendPacket(req->getLocalLen()) and postChunkedRequest() directly without checking the CC window or assigning per-chunk tsc_ids. If CC should not apply to RDMA Write/Read (only to Send), a brief comment explaining this would help. Otherwise, these paths need CC integration too.
Large requests can overshoot CC window

The CC window is checked before dequeuing a request, but once a request is dequeued, all its chunks are posted at once via postChunkedRequest(). For a 100MB message with a 1MB CC window, this means 100MB of chunks are posted in one shot. The window only gates the next request. This is typical for message-level CC and probably intentional, but worth a brief comment explaining this design choice (window controls inter-message pacing, not intra-message chunking).

Minor

Consider adding a brief log message when CC mode is activated (e.g., in SendConnection constructor) so users can confirm their UCCL_P2P_RDMA_CC env var took effect.

zhongjiechen · 2026-04-22T10:19:14Z

@andrzej-k Thanks for the great effort! After discussing with my AI and looking at the code, I've confirmed there are two issues worth addressing (6 and 7 from the last comment) before merging this PR:

We currently only apply CC for two-sided operations. I believe we should also apply it to WRITE (READ? not sure).
The CC window is enforced at the per-request level, but I think it really needs to operate at the per-chunk level to make CC work.

The other items are less important.

Thanks again for all your work on this! Feel free to let me know if you'd prefer I take these on myself as I'm quite familiar with this part of the codebase :)

cc @YangZhou1997

YangZhou1997 · 2026-04-23T19:00:34Z

Agree. CC on the request level does not make sense, as a request can be arbitrarily long

* Moved CC algos to shared location. * In P2P added support for Timely and Swift. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

…r to the send. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

andrzej-k · 2026-04-28T09:58:20Z

Hi @zhongjiechen thank you for review, just pushed new changes to (hopefully) address the issues.

zhongjiechen · 2026-05-01T13:27:02Z

@andrzej-k Thanks for addressing my issues!

This PR generally looks good to me. My only concern is that the current CC implementation depends on the polling thread (controlled by auto_start_polling_), which is a legacy design and may be deprecated in the future.

Given that this PR has been pending for quite a while, should we merge it first and address this in another PR?

CC @YangZhou1997

YangZhou1997 · 2026-05-01T17:42:18Z

@zhongjiechen @andrzej-k let's merge it! auto_start_polling_ can be fixed later.

andrzej-k force-pushed the ak_cc_mod branch from d155dda to 1a7f370 Compare March 30, 2026 07:43

andrzej-k changed the title ~~Allow EP and P2P to use congestion control - part 1 (extract CC algos)~~ [P2P] Added congestion control support (Timely, Swift) Mar 30, 2026

MaoZiming requested review from MaoZiming and YangZhou1997 April 8, 2026 05:21

YangZhou1997 requested a review from zhongjiechen April 8, 2026 23:41

andrzej-k force-pushed the ak_cc_mod branch from 06d0718 to cc88f92 Compare April 9, 2026 09:36

zhongjiechen reviewed Apr 22, 2026

View reviewed changes

andrzej-k force-pushed the ak_cc_mod branch from cc88f92 to 79a07fa Compare April 27, 2026 10:03

andrzej-k added 3 commits April 28, 2026 05:45

Congestion control chages

9df9ce0

* Moved CC algos to shared location. * In P2P added support for Timely and Swift. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

Review fixes. Running polling thread, per chunk CC, tsc capture close…

9068098

…r to the send. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

added CC, per chunk, for one-sided ops

35af412

Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>

andrzej-k force-pushed the ak_cc_mod branch from 0d5f2fc to 35af412 Compare April 28, 2026 09:55

Merge branch 'main' into ak_cc_mod

ed33f53

YangZhou1997 approved these changes May 1, 2026

View reviewed changes

YangZhou1997 merged commit fb4147a into uccl-project:main May 1, 2026
4 checks passed

Conversation

andrzej-k commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

YangZhou1997 commented Apr 8, 2026

Uh oh!

zhongjiechen commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhongjiechen commented Apr 9, 2026

Uh oh!

andrzej-k commented Apr 9, 2026

Uh oh!

andrzej-k commented Apr 9, 2026

Uh oh!

YangZhou1997 commented Apr 22, 2026

Uh oh!

zhongjiechen commented Apr 22, 2026

Uh oh!

zhongjiechen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Previously Raised StatusIssues

New Issues

Minor

Uh oh!

zhongjiechen commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YangZhou1997 commented Apr 23, 2026

Uh oh!

andrzej-k commented Apr 28, 2026

Uh oh!

zhongjiechen commented May 1, 2026

Uh oh!

YangZhou1997 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrzej-k commented Mar 24, 2026 •

edited

Loading

zhongjiechen commented Apr 9, 2026 •

edited

Loading

zhongjiechen left a comment •

edited

Loading

zhongjiechen commented Apr 22, 2026 •

edited

Loading