[P2P] Added congestion control support (Timely, Swift)#837
[P2P] Added congestion control support (Timely, Swift)#837YangZhou1997 merged 4 commits intouccl-project:mainfrom
Conversation
|
Hi @zhongjiechen , perhaps you should review this code, as it marries the eqds into p2p. But I am not sure how we handle chunking in p2p to control the speed? cc @MaoZiming |
|
Great work! I do have the following issues (issue 2 and issue 3 basically both correspond to the interaction with p2p's chunking design):
|
|
My guess is that the benchmark works because it does not hit the CC logic is implemented only in Lines 203 to 204 in 03f1958 |
|
Thanks @zhongjiechen for your comments! Yes, agree to all observations - let me work on implementing per-chunk CC control rather the per whole request. Would also move tsc recording closer to actual send time (ibv_post_send) and start polling thread if using Timely or Swift. |
|
New test results: |
|
@zhongjiechen , would be great if you could take a look |
No problem, I will take a look later |
There was a problem hiding this comment.
The following paragraph is generated by AI, just for reference.
Previously Raised StatusIssues
-
Tx timestamp too early Fixed.
cc_.recordSendTsc(tsc_id)is now right beforechannel->submitRequest(req)inpostRequestOnChannel(). -
CC window not enforced per-chunk Fixed. With CC enabled,
auto_start_polling_is nowtrue, routing through the polling thread. The polling thread checkscurrentInflightLimitBytes()before popping each request. The window enforcement is per-request (not per-chunk), but since all chunks within a request share a wr_id, this is the right granularity for the tracker model. -
Per-chunk CC acknowledgment / tsc_id reuse Fixed. Each chunk now gets a unique
tsc_idviachunk_tsc_counter_, encoded in the upper 32 bits ofwr_id. Since all sends useIBV_SEND_SIGNALED, every chunk generates a CQE, and each CQE produces an independent RTT sample viacc_.onAck(tsc_id, cq_data.len).
New Issues
-
freq_ghznot stored as member inCongestionControlState(cc_state.h:77)onAck()uses the globaluccl::freq_ghz(fromtimer.h) instead of thefreq_ghzparameter passed to the constructor. This works becausetimer.hdefines it per-TU viastatic double freq_ghz = measure_rdtsc_freq(), but it's inconsistent with the constructor API that takesfreq_ghzexplicitly. Consider storingfreq_ghzas a member:double freq_ghz_ = 0.0;
and using it in
onAck(). -
Both
timely_andswift_always constructedThe constructor initializes both
TimelyCCandSwiftCCregardless of mode. Consider usingstd::variantor only initializing the active one to avoid wasting memory and CPU on the unused CC instance's initialization. -
postWriteOrRead()andread()bypass CC entirelyThese paths call
tracker_->sendPacket(req->getLocalLen())andpostChunkedRequest()directly without checking the CC window or assigning per-chunk tsc_ids. If CC should not apply to RDMA Write/Read (only to Send), a brief comment explaining this would help. Otherwise, these paths need CC integration too. -
Large requests can overshoot CC window
The CC window is checked before dequeuing a request, but once a request is dequeued, all its chunks are posted at once via
postChunkedRequest(). For a 100MB message with a 1MB CC window, this means 100MB of chunks are posted in one shot. The window only gates the next request. This is typical for message-level CC and probably intentional, but worth a brief comment explaining this design choice (window controls inter-message pacing, not intra-message chunking).
Minor
- Consider adding a brief log message when CC mode is activated (e.g., in
SendConnectionconstructor) so users can confirm theirUCCL_P2P_RDMA_CCenv var took effect.
|
@andrzej-k Thanks for the great effort! After discussing with my AI and looking at the code, I've confirmed there are two issues worth addressing (6 and 7 from the last comment) before merging this PR:
The other items are less important. Thanks again for all your work on this! Feel free to let me know if you'd prefer I take these on myself as I'm quite familiar with this part of the codebase :) |
|
Agree. CC on the request level does not make sense, as a request can be arbitrarily long |
* Moved CC algos to shared location. * In P2P added support for Timely and Swift. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
…r to the send. Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
Signed-off-by: Andrzej Kuriata <andrzej.kuriata@intel.com>
|
Hi @zhongjiechen thank you for review, just pushed new changes to (hopefully) address the issues. |
|
@andrzej-k Thanks for addressing my issues! This PR generally looks good to me. My only concern is that the current CC implementation depends on the polling thread (controlled by Given that this PR has been pending for quite a while, should we merge it first and address this in another PR? |
|
@zhongjiechen @andrzej-k let's merge it! |
Description
The intention is to allow RoCE P2P to use congestion control algorithms in addition to currently supported flow control. That would be useful in environments without PFC support.
This PR adds Timely and Swift support in P2P, as part of that moved CC algorithms (Timely, Swift, EQDS) out of collectives to shared location.
Test results - two node setup:
Type of Change
How Has This Been Tested?
Include any tests here.
Checklist
format.shto follow the style guidelines.build.shto verify compilation.