[UK] Imporve ipc performance#876
Conversation
There was a problem hiding this comment.
Pull request overview
Improves UKernel IPC performance by reducing per-request control-plane overhead and enabling stable, versioned, direct IPC metadata paths (plus a new per-peer completion fast path).
Changes:
- Cache local IPC exports by allocation/range and deduplicate steady-state IPC buffer publication (versioned) to avoid repeated handle export and exchanger writes.
- Add a per-peer
done_seqPOSIX SHM completion channel and reduce worker/control-path latency (split CVs, spin-before-sleep, bypass multi-request wait path for single IPC). - Update C++/Python benchmarks and Python bindings/build to exercise the direct IPC path and support ROCm builds.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| experimental/ukernel/src/transport/memory/ipc_manager.h | Track allocation size and add range-based local cache hit helper. |
| experimental/ukernel/src/transport/memory/ipc_manager.cc | Reuse exported IPC handles for sub-slices via range checks; update local lookup/delete to avoid address-range queries. |
| experimental/ukernel/src/transport/communicator.h | Add helper for consistent local IPC export; track published IPC buffers; add eager-open hook. |
| experimental/ukernel/src/transport/communicator.cc | Deduplicate IPC buffer publication, add eager remote IPC open, and add fast-path single-request wait handling. |
| experimental/ukernel/src/transport/adapter/ipc_adapter.h | Introduce per-peer done channel structs; split send/recv condition variables; store ring namespace. |
| experimental/ukernel/src/transport/adapter/ipc_adapter.cc | Implement done_seq POSIX SHM channel, reduce wakeup latency, and use done_seq for completion signaling with fallback to legacy ACK. |
| experimental/ukernel/py/ukernel_p2p.cpp | Expose direct IPC send and IPC buffer notify/wait APIs to Python. |
| experimental/ukernel/py/setup.py | Add ROCm-aware build logic and dependency/linking adjustments. |
| experimental/ukernel/py/bench_p2p.py | Update Python benchmark to publish/version IPC recv buffers and use direct IPC sends. |
| experimental/ukernel/benchmarks/bench_transport.cc | Publish/version IPC recv buffers at setup and prefetch remote IPC metadata for direct-path benchmarking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| IPCItem opened = state; | ||
| opened.direct_ptr = direct_ptr; | ||
| opened.ipc_id = state.ipc_id; | ||
| (void)ipc_manager_.register_remote_ipc(remote_rank, opened); | ||
| } |
There was a problem hiding this comment.
try_open_remote_ipc_buffer() re-registers an IPCItem containing an opened direct_ptr via ipc_manager_.register_remote_ipc(...). When a newer binding_version/handle is later registered for the same (rank, ipc_id) (e.g., via wait_ipc_buffer/fetch_ipc_buffer), the previous cached entry can be overwritten without calling gpuIpcCloseMemHandle on the existing direct_ptr, leaking VA mappings. Consider updating IPCManager::register_remote_ipc (or adding an explicit update API) to detect replacement of an existing cached item and close any previous direct_ptr before overwriting, and/or preserve the existing direct_ptr when handle+device are unchanged.
| << creq->id << " match_seq " << creq->match_seq << std::endl; | ||
| return false; | ||
| } | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
| std::this_thread::yield(); // no sleep — yield to other threads only | ||
| } |
There was a problem hiding this comment.
These control-plane polling loops now call std::this_thread::yield() with no backoff. Given kIpcControlTimeoutMs is very large, this can burn a full CPU core per in-flight recv while waiting for peer progress (especially on slow/error paths). Consider adding a small timed sleep/backoff after some spins (e.g., sleep for ~50–100µs after N iterations) or using an adaptive strategy so long waits don’t busy-spin indefinitely.
| << std::endl; | ||
| return false; | ||
| } | ||
| std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
| std::this_thread::yield(); // no sleep | ||
| } |
There was a problem hiding this comment.
Same as above: this second wait loop also yields with no backoff, which can cause sustained high CPU usage for up to the full control timeout. Adding a bounded backoff (or reusing kIpcControlPollTimeoutMs with a small sleep) would reduce operational impact without regressing latency for short waits.
| if (done_ch.remote_ptr) { | ||
| if (done_ch.remote_ptr->load(std::memory_order_acquire) >= | ||
| creq->match_seq) { | ||
| return true; | ||
| } | ||
| } else { | ||
| // Fallback when done_seq channel is unavailable: poll SHM ring ack. | ||
| uint32_t status = 0; | ||
| int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status); | ||
| if (ack_result < 0) return false; | ||
| if (ack_result > 0) return true; | ||
| } |
There was a problem hiding this comment.
The new done_seq fast path is a substantial behavioral change (per-peer shared-memory channel + fallback to SHM ring ACK). There don’t appear to be any unit/integration tests covering IPC adapter control-path behavior, so regressions (e.g., fallback when SHM setup fails, monotonic seq handling, shutdown/teardown) may be hard to catch. Consider adding a transport unit test that runs two IpcAdapters in-process (or via fork) and validates: (1) done_seq completion works, and (2) disabling/teardown correctly falls back to the legacy ACK path without deadlock.
| if (done_ch.remote_ptr) { | |
| if (done_ch.remote_ptr->load(std::memory_order_acquire) >= | |
| creq->match_seq) { | |
| return true; | |
| } | |
| } else { | |
| // Fallback when done_seq channel is unavailable: poll SHM ring ack. | |
| uint32_t status = 0; | |
| int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status); | |
| if (ack_result < 0) return false; | |
| if (ack_result > 0) return true; | |
| } | |
| if (done_ch.remote_ptr && | |
| done_ch.remote_ptr->load(std::memory_order_acquire) >= | |
| creq->match_seq) { | |
| return true; | |
| } | |
| // Always poll the legacy SHM ring ack path as a non-blocking fallback. | |
| // This avoids deadlock if the done_seq fast path was provisioned but | |
| // becomes unavailable during shutdown/teardown or falls back after setup. | |
| uint32_t status = 0; | |
| int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status); | |
| if (ack_result < 0) return false; | |
| if (ack_result > 0) return true; |
Problem
UKernel IPC was 5–7× slower than UCCL on small messages and 2–3× slower on
large messages. The root cause was a per-request three-message control
round-trip through the SHM ring exchanger (
ipc_cache_req → ipc_cache → ack)plus worker threads that could sleep for up to 1 ms waiting for each step.
What This PR Does
The stack eliminates most of the per-request control-plane overhead by moving
the IPC path toward stable pre-published metadata.
1. Cache IPC handle exports by allocation, not by exact pointer
Files:
src/transport/memory/ipc_manager.{cc,h},src/transport/communicator.{cc,h}create_local_ipcnow caches exported handles by allocation base address andchecks by range, so multiple sub-slice requests from the same GPU allocation
reuse the same exported handle without calling
gpuMemGetAddressRange/gpuIpcGetMemHandleagain.export_local_ipc_bufferhelper is used consistently by both thetransport and the recv-side IPC cache path, replacing duplicated
handle-export logic.
eagerly opened (
gpuIpcOpenMemHandle) before timed iterations begin,so the first timed send does not pay the open cost.
irecv()deduplicates steady-state metadata publication: if the same bufferaddress, size, and binding version were already published, the exchanger write
is skipped.
handle export/import entirely; the raw GPU virtual address is passed directly,
matching UCCL's intra-process shortcut.
2. Reduce control-path latency in the IPC adapter
Files:
src/transport/adapter/ipc_adapter.{cc,h}counters. Previously both workers shared one CV, so a send wakeup could
spuriously wake the recv thread and vice versa.
yielditerations) before sleeping on theCV, reducing wakeup latency for bursts.
binding_versionis threaded fromisendinto the IPC send path so thedirect metadata lookup is versioned rather than "latest wins".
wait_finishfor single IPC requests bypasses the generic multi-requestcompletion path and spins directly on the request slot state.
3. Wire the C++ benchmark to the direct IPC path
File:
benchmarks/bench_transport.ccmem_iddirect path;it fell back to the
ipc_cache_req/cachehandshake on every iteration.and prefetches the peer's metadata before the timed phase starts, so timed
iterations use the direct path.
4. Python benchmark and ROCm build
Files:
py/ukernel_p2p.cpp,py/bench_p2p.py,py/setup.pynotify_ipc_tensor,wait_ipc_buffer,isend_direct,send_directto Python so
bench_p2p.pycan exercise the direct IPC path.bench_p2p.pynow exchanges pinned MR IDs, publishes versioned recv buffers,prefetches peer metadata, and uses
send_directon the IPC path.setup.py.5. Per-peer
done_seqcompletion channelFiles:
src/transport/adapter/ipc_adapter.{cc,h}done_seq_lo_to_hidone_seq_hi_to_loopener_readyopener_ready = 1.done_seqfast path. If the peer never confirms readiness, the creator tears the
channel down and prints a warning so both sides consistently fall back to
the legacy SHM ring ACK path.
send_one()writesdone_seqinstead of sending a direct ACK.recv_one()pollsdone_seqfirst, then falls back to the existingipc_cache_req/cacheand relay handling.Benchmark Results
Environment: GPU 2 (NUMA 0) ↔ GPU 6 (NUMA 1), two separate processes,
ROCm, AMD MI325X.
C++ bench_transport — IPC direct path sweep
10 iterations / 2 warmup,
--transport ipc --ipc-path auto.C++ bench_transport —
done_seqanchor50 iterations / 5 warmup,
--transport ipc --ipc-path auto.The first
done_seqversion timed out in1 KB, but the currentopener_readyhandshake version now completescleanly after a forced rebuild.
Python bench_p2p — UKernel IPC vs UCCL vs NCCL/RCCL
Relay path (regression anchor)
50 iterations / 5 warmup,
--transport ipc --ipc-path relay.Relay latency is unchanged from before this PR (expected; relay path was not
modified).
Build