Skip to content

[P2P] Added congestion control support (Timely, Swift)#837

Merged
YangZhou1997 merged 4 commits intouccl-project:mainfrom
andrzej-k:ak_cc_mod
May 1, 2026
Merged

[P2P] Added congestion control support (Timely, Swift)#837
YangZhou1997 merged 4 commits intouccl-project:mainfrom
andrzej-k:ak_cc_mod

Conversation

@andrzej-k
Copy link
Copy Markdown
Contributor

@andrzej-k andrzej-k commented Mar 24, 2026

Description

The intention is to allow RoCE P2P to use congestion control algorithms in addition to currently supported flow control. That would be useful in environments without PFC support.

This PR adds Timely and Swift support in P2P, as part of that moved CC algorithms (Timely, Swift, EQDS) out of collectives to shared location.

Test results - two node setup:

export UCCL_P2P_RDMA_CC=swift

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38717
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 41241
Connected to <IP>:41241 (fd=63)
Accepted connection fd=64 from <IP>:57828
[Client] Connected to <IP>:41241 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.03 Gbps |   1.00 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.32 Gbps |   3.54 GB/s  | 0.000019 s
[Client] 256.0 KB :  76.63 Gbps |   9.58 GB/s  | 0.000027 s
[Client]   1.0 MB : 139.22 Gbps |  17.40 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.99 Gbps |  23.12 GB/s  | 0.000453 s
[Client]  16.0 MB : 186.98 Gbps |  23.37 GB/s  | 0.000718 s
[Client] 100.0 MB : 186.41 Gbps |  23.30 GB/s  | 0.004500 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

export UCCL_P2P_RDMA_CC=timely

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0]
  [1] 
  [2] 
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 38567
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 34689
Connected to <IP>:34689 (fd=59)
Accepted connection fd=64 from <IP>:36800
[Client] Connected to <IP>:34689 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.53 Gbps |   0.07 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.03 Gbps |   0.25 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.10 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.11 Gbps |   3.51 GB/s  | 0.000019 s
[Client] 256.0 KB :  75.48 Gbps |   9.44 GB/s  | 0.000028 s
[Client]   1.0 MB : 140.82 Gbps |  17.60 GB/s  | 0.000060 s
[Client]  10.0 MB : 184.47 Gbps |  23.06 GB/s  | 0.000455 s
[Client]  16.0 MB : 187.26 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.34 Gbps |  23.29 GB/s  | 0.004502 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

unset UCCL_P2P_RDMA_CC

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 5 RDMA device(s)
  [0] 
  [1]
  [2]
  [3] 
  [4] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 4 (irdma-mkp0)
System assigned port: 46773
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 36325
Connected to <IP>:36325 (fd=63)
Accepted connection fd=64 from <IP>:35360
[Client] Connected to <IP>:36325 (GPU 0) conn_id=0
[Client]    256 B :   0.13 Gbps |   0.02 GB/s  | 0.000016 s
[Client]   1.0 KB :   0.52 Gbps |   0.06 GB/s  | 0.000016 s
[Client]   4.0 KB :   2.07 Gbps |   0.26 GB/s  | 0.000016 s
[Client]  16.0 KB :   8.05 Gbps |   1.01 GB/s  | 0.000016 s
[Client]  64.0 KB :  28.73 Gbps |   3.59 GB/s  | 0.000018 s
[Client] 256.0 KB :  76.76 Gbps |   9.59 GB/s  | 0.000027 s
[Client]   1.0 MB : 141.41 Gbps |  17.68 GB/s  | 0.000059 s
[Client]  10.0 MB : 185.05 Gbps |  23.13 GB/s  | 0.000453 s
[Client]  16.0 MB : 187.25 Gbps |  23.41 GB/s  | 0.000717 s
[Client] 100.0 MB : 186.45 Gbps |  23.31 GB/s  | 0.004499 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

Type of Change

  • Bug fix
  • New feature
  • Documentation update

How Has This Been Tested?

Include any tests here.

  • Unit tests
  • Integration tests
  • Manual testing

Checklist

  • I have run format.sh to follow the style guidelines.
  • I have run build.sh to verify compilation.
  • I have removed redundant variables and comments.
  • I have updated the documentation.
  • I have added tests.

@andrzej-k andrzej-k changed the title Allow EP and P2P to use congestion control - part 1 (extract CC algos) [P2P] Added congestion control support (Timely, Swift) Mar 30, 2026
@YangZhou1997
Copy link
Copy Markdown
Member

Hi @zhongjiechen , perhaps you should review this code, as it marries the eqds into p2p. But I am not sure how we handle chunking in p2p to control the speed? cc @MaoZiming

@zhongjiechen
Copy link
Copy Markdown
Member

zhongjiechen commented Apr 9, 2026

Great work! I do have the following issues (issue 2 and issue 3 basically both correspond to the interaction with p2p's chunking design):

  1. Tx timestamp recording may be a bit too early?Per Swift, ideally the timestamp should be taken when the hardware actually transmits the data. If we are using a software timestamp instead, should cc_.recordSendTsc be moved closer to ibv_post_send?

  2. P2P splits messages into small chunks for a large request, but I think the current congestion window is only enforced on the entire request, not on posted chunks?

  3. When a message is chunked to small chunks, these chunks will share the same wr_id, and the tracker records the number of chunks associated with this wr_id(see tracker_->acknowledge(cq_data.wr_id);, https://github.com/uccl-project/uccl/blob/main/p2p/rdma/seq_num.h#L243). However, it appears that cc_.onAck(cq_data.wr_id, cq_data.len); only accounts for the first completed chunk. CongestionControlState::onAck() clears the stored send timestamp after the first CQE, and later chunks are ignored by onACK() due to send_tsc = 0.

@zhongjiechen
Copy link
Copy Markdown
Member

My guess is that the benchmark works because it does not hit the processSendRequests() path. Instead, it goes through processSendRequests(std::shared_ptr<RDMASendRequest> req). The call chain of this benchmark is:

benchmark_uccl.py -> ep.send() -> Endpoint::send() -> uccl_send_async() -> sendWithoutInnerQueue() -> SendConnection::processSendRequests(req)

CC logic is implemented only in processSendRequests(), which is invoked only when auto_start_polling_ is set to true (false by default:

uccl/p2p/engine.cc

Lines 203 to 204 in 03f1958

ep_ = std::shared_ptr<NICEndpoint>(
new NICEndpoint(local_gpu_idx_, INVALID_RANK_ID, 0, false));
).

@andrzej-k
Copy link
Copy Markdown
Contributor Author

Thanks @zhongjiechen for your comments!

Yes, agree to all observations - let me work on implementing per-chunk CC control rather the per whole request. Would also move tsc recording closer to actual send time (ibv_post_send) and start polling thread if using Timely or Swift.

@andrzej-k
Copy link
Copy Markdown
Contributor Author

New test results:

UCCL_P2P_RDMA_CC unset

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] 
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 44319
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 35943
Connected to <IP:35943 (fd=88)
Accepted connection fd=89 from <IP>:56120
[Client] Connected to 10.173.44.110:35943 (GPU 0) conn_id=0
[Client]    256 B :   0.04 Gbps |   0.01 GB/s  | 0.000047 s
[Client]   1.0 KB :   0.18 Gbps |   0.02 GB/s  | 0.000046 s
[Client]   4.0 KB :   0.70 Gbps |   0.09 GB/s  | 0.000047 s
[Client]  16.0 KB :   2.79 Gbps |   0.35 GB/s  | 0.000047 s
[Client]  64.0 KB :  10.63 Gbps |   1.33 GB/s  | 0.000049 s
[Client] 256.0 KB :  30.77 Gbps |   3.85 GB/s  | 0.000068 s
[Client]   1.0 MB :  84.69 Gbps |  10.59 GB/s  | 0.000099 s
[Client]  10.0 MB : 170.47 Gbps |  21.31 GB/s  | 0.000492 s
[Client]  16.0 MB : 177.77 Gbps |  22.22 GB/s  | 0.000755 s
[Client] 100.0 MB : 185.59 Gbps |  23.20 GB/s  | 0.004520 s
[Client] Benchmark complete
Server closed connection: <IP>:35943
Destroying Engine...
Engine destroyed


$ env | grep P2P_RDMA_CC
UCCL_P2P_RDMA_CC=timely

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0] 
  [1] 
  [2] 
  [3] 
  [4] 
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 42957
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 46169
Connected to <IP>:46169 (fd=89)
Accepted connection fd=88 from <IP>:42114
[Client] Connected to <IP>:46169 (GPU 0) conn_id=0
[Client]    256 B :   0.05 Gbps |   0.01 GB/s  | 0.000038 s
[Client]   1.0 KB :   0.23 Gbps |   0.03 GB/s  | 0.000036 s
[Client]   4.0 KB :   0.90 Gbps |   0.11 GB/s  | 0.000036 s
[Client]  16.0 KB :   3.52 Gbps |   0.44 GB/s  | 0.000037 s
[Client]  64.0 KB :  13.46 Gbps |   1.68 GB/s  | 0.000039 s
[Client] 256.0 KB :  38.45 Gbps |   4.81 GB/s  | 0.000055 s
[Client]   1.0 MB :  98.97 Gbps |  12.37 GB/s  | 0.000085 s
[Client]  10.0 MB : 175.60 Gbps |  21.95 GB/s  | 0.000478 s
[Client]  16.0 MB : 181.03 Gbps |  22.63 GB/s  | 0.000741 s
[Client] 100.0 MB : 186.11 Gbps |  23.26 GB/s  | 0.004507 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed


$ env | grep P2P_RDMA_CC
UCCL_P2P_RDMA_CC=swift

$ torchrun --nnodes=2 --nproc_per_node=1 --node-rank=0   --master_addr=<IP> --master_port=12355 p2p/benchmarks/benchmark_uccl.py
UCCL P2P Benchmark — mode: Standard | API: Sync | role: client
Number of key-value blocks per message: 1
Message sizes: 256 B, 1.0 KB, 4.0 KB, 16.0 KB, 64.0 KB, 256.0 KB, 1.0 MB, 10.0 MB, 16.0 MB, 100.0 MB
Device: gpu | Local GPU idx: 0 | Iterations: 10
Creating Engine with GPU index: 0
RdmaDeviceManager: Found 6 RDMA device(s)
  [0]
  [1] 
  [2]
  [3] 
  [4]
  [5] irdma-mkp0
RdmaDeviceManager: Initialization complete
GPU 0 uses device 5 (irdma-mkp0)
System assigned port: 42191
Engine initialized for GPU 0
Endpoint initialized successfully
Attempting to connect to <IP>:0 via port 33141
Connected to <IP>:33141 (fd=86)
Accepted connection fd=87 from <IP>:55612
[Client] Connected to <IP>:33141 (GPU 0) conn_id=0
[Client]    256 B :   0.05 Gbps |   0.01 GB/s  | 0.000038 s
[Client]   1.0 KB :   0.22 Gbps |   0.03 GB/s  | 0.000037 s
[Client]   4.0 KB :   0.88 Gbps |   0.11 GB/s  | 0.000037 s
[Client]  16.0 KB :   3.44 Gbps |   0.43 GB/s  | 0.000038 s
[Client]  64.0 KB :  13.16 Gbps |   1.65 GB/s  | 0.000040 s
[Client] 256.0 KB :  39.37 Gbps |   4.92 GB/s  | 0.000053 s
[Client]   1.0 MB :  98.21 Gbps |  12.28 GB/s  | 0.000085 s
[Client]  10.0 MB : 175.33 Gbps |  21.92 GB/s  | 0.000478 s
[Client]  16.0 MB : 181.75 Gbps |  22.72 GB/s  | 0.000738 s
[Client] 100.0 MB : 186.14 Gbps |  23.27 GB/s  | 0.004507 s
[Client] Benchmark complete
Destroying Engine...
Engine destroyed

@YangZhou1997
Copy link
Copy Markdown
Member

@zhongjiechen , would be great if you could take a look

@zhongjiechen
Copy link
Copy Markdown
Member

@zhongjiechen , would be great if you could take a look

No problem, I will take a look later

Copy link
Copy Markdown
Member

@zhongjiechen zhongjiechen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following paragraph is generated by AI, just for reference.

Previously Raised StatusIssues

  1. Tx timestamp too early Fixed. cc_.recordSendTsc(tsc_id) is now right before channel->submitRequest(req) in postRequestOnChannel().

  2. CC window not enforced per-chunk Fixed. With CC enabled, auto_start_polling_ is now true, routing through the polling thread. The polling thread checks currentInflightLimitBytes() before popping each request. The window enforcement is per-request (not per-chunk), but since all chunks within a request share a wr_id, this is the right granularity for the tracker model.

  3. Per-chunk CC acknowledgment / tsc_id reuse Fixed. Each chunk now gets a unique tsc_id via chunk_tsc_counter_, encoded in the upper 32 bits of wr_id. Since all sends use IBV_SEND_SIGNALED, every chunk generates a CQE, and each CQE produces an independent RTT sample via cc_.onAck(tsc_id, cq_data.len).

New Issues

  1. freq_ghz not stored as member in CongestionControlState (cc_state.h:77)

    onAck() uses the global uccl::freq_ghz (from timer.h) instead of the freq_ghz parameter passed to the constructor. This works because timer.h defines it per-TU via static double freq_ghz = measure_rdtsc_freq(), but it's inconsistent with the constructor API that takes freq_ghz explicitly. Consider storing freq_ghz as a member:

    double freq_ghz_ = 0.0;

    and using it in onAck().

  2. Both timely_ and swift_ always constructed

    The constructor initializes both TimelyCC and SwiftCC regardless of mode. Consider using std::variant or only initializing the active one to avoid wasting memory and CPU on the unused CC instance's initialization.

  3. postWriteOrRead() and read() bypass CC entirely

    These paths call tracker_->sendPacket(req->getLocalLen()) and postChunkedRequest() directly without checking the CC window or assigning per-chunk tsc_ids. If CC should not apply to RDMA Write/Read (only to Send), a brief comment explaining this would help. Otherwise, these paths need CC integration too.

  4. Large requests can overshoot CC window

    The CC window is checked before dequeuing a request, but once a request is dequeued, all its chunks are posted at once via postChunkedRequest(). For a 100MB message with a 1MB CC window, this means 100MB of chunks are posted in one shot. The window only gates the next request. This is typical for message-level CC and probably intentional, but worth a brief comment explaining this design choice (window controls inter-message pacing, not intra-message chunking).

Minor

  • Consider adding a brief log message when CC mode is activated (e.g., in SendConnection constructor) so users can confirm their UCCL_P2P_RDMA_CC env var took effect.

@zhongjiechen
Copy link
Copy Markdown
Member

zhongjiechen commented Apr 22, 2026

@andrzej-k Thanks for the great effort! After discussing with my AI and looking at the code, I've confirmed there are two issues worth addressing (6 and 7 from the last comment) before merging this PR:

  • We currently only apply CC for two-sided operations. I believe we should also apply it to WRITE (READ? not sure).
  • The CC window is enforced at the per-request level, but I think it really needs to operate at the per-chunk level to make CC work.

The other items are less important.

Thanks again for all your work on this! Feel free to let me know if you'd prefer I take these on myself as I'm quite familiar with this part of the codebase :)

cc @YangZhou1997

@YangZhou1997
Copy link
Copy Markdown
Member

Agree. CC on the request level does not make sense, as a request can be arbitrarily long

* Moved CC algos to shared location.
* In P2P added support for Timely and Swift.

Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
…r to the send.

Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
@andrzej-k
Copy link
Copy Markdown
Contributor Author

Hi @zhongjiechen thank you for review, just pushed new changes to (hopefully) address the issues.

@zhongjiechen
Copy link
Copy Markdown
Member

@andrzej-k Thanks for addressing my issues!

This PR generally looks good to me. My only concern is that the current CC implementation depends on the polling thread (controlled by auto_start_polling_), which is a legacy design and may be deprecated in the future.

Given that this PR has been pending for quite a while, should we merge it first and address this in another PR?

CC @YangZhou1997

@YangZhou1997
Copy link
Copy Markdown
Member

@zhongjiechen @andrzej-k let's merge it! auto_start_polling_ can be fixed later.

@YangZhou1997 YangZhou1997 merged commit fb4147a into uccl-project:main May 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants