Skip to content

[UK] Imporve ipc performance#876

Merged
DanielDanyang merged 22 commits intouccl-project:mainfrom
DanielDanyang:ipc-opt
Apr 22, 2026
Merged

[UK] Imporve ipc performance#876
DanielDanyang merged 22 commits intouccl-project:mainfrom
DanielDanyang:ipc-opt

Conversation

@DanielDanyang
Copy link
Copy Markdown
Collaborator

@DanielDanyang DanielDanyang commented Apr 10, 2026

Problem

UKernel IPC was 5–7× slower than UCCL on small messages and 2–3× slower on
large messages. The root cause was a per-request three-message control
round-trip through the SHM ring exchanger (ipc_cache_req → ipc_cache → ack)
plus worker threads that could sleep for up to 1 ms waiting for each step.

What This PR Does

The stack eliminates most of the per-request control-plane overhead by moving
the IPC path toward stable pre-published metadata.

1. Cache IPC handle exports by allocation, not by exact pointer

Files: src/transport/memory/ipc_manager.{cc,h},
src/transport/communicator.{cc,h}

  • create_local_ipc now caches exported handles by allocation base address and
    checks by range, so multiple sub-slice requests from the same GPU allocation
    reuse the same exported handle without calling gpuMemGetAddressRange /
    gpuIpcGetMemHandle again.
  • A shared export_local_ipc_buffer helper is used consistently by both the
    transport and the recv-side IPC cache path, replacing duplicated
    handle-export logic.
  • Remote versioned IPC metadata can be prefetched during setup and
    eagerly opened (gpuIpcOpenMemHandle) before timed iterations begin,
    so the first timed send does not pay the open cost.
  • irecv() deduplicates steady-state metadata publication: if the same buffer
    address, size, and binding version were already published, the exchanger write
    is skipped.
  • Same-process transfers (sender and receiver in the same OS process) skip IPC
    handle export/import entirely; the raw GPU virtual address is passed directly,
    matching UCCL's intra-process shortcut.

2. Reduce control-path latency in the IPC adapter

Files: src/transport/adapter/ipc_adapter.{cc,h}

  • Recv-side control polling sleep reduced from 1 ms to 50 µs.
  • Send and recv workers now use separate condition variables and pending
    counters. Previously both workers shared one CV, so a send wakeup could
    spuriously wake the recv thread and vice versa.
  • Workers now do a short spin (256 yield iterations) before sleeping on the
    CV, reducing wakeup latency for bursts.
  • binding_version is threaded from isend into the IPC send path so the
    direct metadata lookup is versioned rather than "latest wins".
  • wait_finish for single IPC requests bypasses the generic multi-request
    completion path and spins directly on the request slot state.

3. Wire the C++ benchmark to the direct IPC path

File: benchmarks/bench_transport.cc

  • Before this PR the C++ benchmark never exercised the by-mem_id direct path;
    it fell back to the ipc_cache_req/cache handshake on every iteration.
  • The benchmark now publishes stable versioned IPC recv buffers at setup time
    and prefetches the peer's metadata before the timed phase starts, so timed
    iterations use the direct path.

4. Python benchmark and ROCm build

Files: py/ukernel_p2p.cpp, py/bench_p2p.py, py/setup.py

  • Exposed notify_ipc_tensor, wait_ipc_buffer, isend_direct, send_direct
    to Python so bench_p2p.py can exercise the direct IPC path.
  • bench_p2p.py now exchanges pinned MR IDs, publishes versioned recv buffers,
    prefetches peer metadata, and uses send_direct on the IPC path.
  • Fixed ROCm setup.py.

5. Per-peer done_seq completion channel

Files: src/transport/adapter/ipc_adapter.{cc,h}

  • Added a per-peer POSIX SHM channel with:
    • done_seq_lo_to_hi
    • done_seq_hi_to_lo
    • opener_ready
  • Lower rank creates the SHM region before the IPC peer handshake.
  • Higher rank opens the region after the handshake and stores
    opener_ready = 1.
  • Lower rank waits for that ready signal before activating the done_seq
    fast path. If the peer never confirms readiness, the creator tears the
    channel down and prints a warning so both sides consistently fall back to
    the legacy SHM ring ACK path.
  • On the fast path:
    • send_one() writes done_seq instead of sending a direct ACK.
    • recv_one() polls done_seq first, then falls back to the existing
      ipc_cache_req/cache and relay handling.

Benchmark Results

Environment: GPU 2 (NUMA 0) ↔ GPU 6 (NUMA 1), two separate processes,
ROCm, AMD MI325X.

C++ bench_transport — IPC direct path sweep

10 iterations / 2 warmup, --transport ipc --ipc-path auto.

Size p50 (µs) Uni (GB/s)
1 KB 258.13 0.06
4 KB 186.10 0.23
16 KB 250.02 0.84
64 KB 336.91 3.69
256 KB 338.33 11.75
1 MB 357.30 21.60
4 MB 255.95 39.53
16 MB 795.26 42.69
64 MB 2808.17 44.65
256 MB 11086.86 45.24

C++ bench_transport — done_seq anchor

50 iterations / 5 warmup, --transport ipc --ipc-path auto.

Size p50 (µs) p99 (µs) Uni (GB/s)
1 KB 40.88 195.51 0.04
4 KB 39.53 279.06 0.24
64 KB 49.12 435.46 2.96
1 MB 81.89 389.30 28.91
16 MB 738.23 1166.28 43.30

The first done_seq version timed out in 1 KB, but the current opener_ready handshake version now completes
cleanly after a forced rebuild.

Python bench_p2p — UKernel IPC vs UCCL vs NCCL/RCCL

Size UKernel (ms) UKernel (GB/s) UCCL (ms) UCCL (GB/s) NCCL (ms) NCCL (GB/s)
1024 B 0.047 0.04 0.039 0.05 0.108 0.02
4096 B 0.051 0.16 0.039 0.21 0.091 0.09
16384 B 0.052 0.62 0.039 0.83 0.092 0.36
65536 B 0.047 2.78 0.030 4.34 0.092 1.42
262144 B 0.057 9.19 0.041 12.94 0.091 5.76
1048576 B 0.098 21.47 0.072 28.99 0.117 17.92
4194304 B 0.246 34.15 0.254 32.97 0.276 30.37
16777216 B 0.737 45.54 0.779 43.06 0.759 44.23
67108864 B 2.755 48.73 2.868 46.80 2.735 49.08
268435456 B 10.812 49.65 11.216 47.86 10.644 50.44

Relay path (regression anchor)

50 iterations / 5 warmup, --transport ipc --ipc-path relay.

Size p50 (µs)
64 KB 2379.6
1 MB 3609.5

Relay latency is unchanged from before this PR (expected; relay path was not
modified).

Build

# C++ transport and benchmarks
cd experimental/ukernel && make -f Makefile.rocm bench -j4

# Python extension (ROCm)
cd experimental/ukernel/py && \
LD_LIBRARY_PATH=/home/yangzhou/miniconda3/envs/shawn/lib/python3.12/site-packages/torch/lib:/opt/rocm/lib:/opt/rocm/lib64:/home/yangzhou/miniconda3/lib:${LD_LIBRARY_PATH} \
python setup.py build_ext --inplace

# Device benchmarks
cd experimental/ukernel/src/device && make -C . -f Makefile.rocm bench -j4

@DanielDanyang DanielDanyang added the WIP Work In Progress label Apr 11, 2026
@DanielDanyang DanielDanyang changed the title [UK-draft] Imporve ipc performance [UK] Imporve ipc performance Apr 21, 2026
@DanielDanyang DanielDanyang removed the WIP Work In Progress label Apr 22, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves UKernel IPC performance by reducing per-request control-plane overhead and enabling stable, versioned, direct IPC metadata paths (plus a new per-peer completion fast path).

Changes:

  • Cache local IPC exports by allocation/range and deduplicate steady-state IPC buffer publication (versioned) to avoid repeated handle export and exchanger writes.
  • Add a per-peer done_seq POSIX SHM completion channel and reduce worker/control-path latency (split CVs, spin-before-sleep, bypass multi-request wait path for single IPC).
  • Update C++/Python benchmarks and Python bindings/build to exercise the direct IPC path and support ROCm builds.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
experimental/ukernel/src/transport/memory/ipc_manager.h Track allocation size and add range-based local cache hit helper.
experimental/ukernel/src/transport/memory/ipc_manager.cc Reuse exported IPC handles for sub-slices via range checks; update local lookup/delete to avoid address-range queries.
experimental/ukernel/src/transport/communicator.h Add helper for consistent local IPC export; track published IPC buffers; add eager-open hook.
experimental/ukernel/src/transport/communicator.cc Deduplicate IPC buffer publication, add eager remote IPC open, and add fast-path single-request wait handling.
experimental/ukernel/src/transport/adapter/ipc_adapter.h Introduce per-peer done channel structs; split send/recv condition variables; store ring namespace.
experimental/ukernel/src/transport/adapter/ipc_adapter.cc Implement done_seq POSIX SHM channel, reduce wakeup latency, and use done_seq for completion signaling with fallback to legacy ACK.
experimental/ukernel/py/ukernel_p2p.cpp Expose direct IPC send and IPC buffer notify/wait APIs to Python.
experimental/ukernel/py/setup.py Add ROCm-aware build logic and dependency/linking adjustments.
experimental/ukernel/py/bench_p2p.py Update Python benchmark to publish/version IPC recv buffers and use direct IPC sends.
experimental/ukernel/benchmarks/bench_transport.cc Publish/version IPC recv buffers at setup and prefetch remote IPC metadata for direct-path benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1807 to 1811
IPCItem opened = state;
opened.direct_ptr = direct_ptr;
opened.ipc_id = state.ipc_id;
(void)ipc_manager_.register_remote_ipc(remote_rank, opened);
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_open_remote_ipc_buffer() re-registers an IPCItem containing an opened direct_ptr via ipc_manager_.register_remote_ipc(...). When a newer binding_version/handle is later registered for the same (rank, ipc_id) (e.g., via wait_ipc_buffer/fetch_ipc_buffer), the previous cached entry can be overwritten without calling gpuIpcCloseMemHandle on the existing direct_ptr, leaking VA mappings. Consider updating IPCManager::register_remote_ipc (or adding an explicit update API) to detect replacement of an existing cached item and close any previous direct_ptr before overwriting, and/or preserve the existing direct_ptr when handle+device are unchanged.

Copilot uses AI. Check for mistakes.
Comment on lines 903 to 907
<< creq->id << " match_seq " << creq->match_seq << std::endl;
return false;
}
std::this_thread::sleep_for(std::chrono::milliseconds(1));
std::this_thread::yield(); // no sleep — yield to other threads only
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These control-plane polling loops now call std::this_thread::yield() with no backoff. Given kIpcControlTimeoutMs is very large, this can burn a full CPU core per in-flight recv while waiting for peer progress (especially on slow/error paths). Consider adding a small timed sleep/backoff after some spins (e.g., sleep for ~50–100µs after N iterations) or using an adaptive strategy so long waits don’t busy-spin indefinitely.

Copilot uses AI. Check for mistakes.
Comment on lines 961 to 965
<< std::endl;
return false;
}
std::this_thread::sleep_for(std::chrono::milliseconds(1));
std::this_thread::yield(); // no sleep
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: this second wait loop also yields with no backoff, which can cause sustained high CPU usage for up to the full control timeout. Adding a bounded backoff (or reusing kIpcControlPollTimeoutMs with a small sleep) would reduce operational impact without regressing latency for short waits.

Copilot uses AI. Check for mistakes.
Comment on lines +876 to 887
if (done_ch.remote_ptr) {
if (done_ch.remote_ptr->load(std::memory_order_acquire) >=
creq->match_seq) {
return true;
}
} else {
// Fallback when done_seq channel is unavailable: poll SHM ring ack.
uint32_t status = 0;
int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);
if (ack_result < 0) return false;
if (ack_result > 0) return true;
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new done_seq fast path is a substantial behavioral change (per-peer shared-memory channel + fallback to SHM ring ACK). There don’t appear to be any unit/integration tests covering IPC adapter control-path behavior, so regressions (e.g., fallback when SHM setup fails, monotonic seq handling, shutdown/teardown) may be hard to catch. Consider adding a transport unit test that runs two IpcAdapters in-process (or via fork) and validates: (1) done_seq completion works, and (2) disabling/teardown correctly falls back to the legacy ACK path without deadlock.

Suggested change
if (done_ch.remote_ptr) {
if (done_ch.remote_ptr->load(std::memory_order_acquire) >=
creq->match_seq) {
return true;
}
} else {
// Fallback when done_seq channel is unavailable: poll SHM ring ack.
uint32_t status = 0;
int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);
if (ack_result < 0) return false;
if (ack_result > 0) return true;
}
if (done_ch.remote_ptr &&
done_ch.remote_ptr->load(std::memory_order_acquire) >=
creq->match_seq) {
return true;
}
// Always poll the legacy SHM ring ack path as a non-blocking fallback.
// This avoids deadlock if the done_seq fast path was provisioned but
// becomes unavailable during shutdown/teardown or falls back after setup.
uint32_t status = 0;
int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);
if (ack_result < 0) return false;
if (ack_result > 0) return true;

Copilot uses AI. Check for mistakes.
@DanielDanyang DanielDanyang merged commit 5ec6286 into uccl-project:main Apr 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants