[UK] Imporve ipc performance by DanielDanyang · Pull Request #876 · uccl-project/uccl

DanielDanyang · 2026-04-10T21:32:23Z

Problem

UKernel IPC was 5–7× slower than UCCL on small messages and 2–3× slower on
large messages. The root cause was a per-request three-message control
round-trip through the SHM ring exchanger (ipc_cache_req → ipc_cache → ack)
plus worker threads that could sleep for up to 1 ms waiting for each step.

What This PR Does

The stack eliminates most of the per-request control-plane overhead by moving
the IPC path toward stable pre-published metadata.

1. Cache IPC handle exports by allocation, not by exact pointer

Files: src/transport/memory/ipc_manager.{cc,h},
src/transport/communicator.{cc,h}

create_local_ipc now caches exported handles by allocation base address and
checks by range, so multiple sub-slice requests from the same GPU allocation
reuse the same exported handle without calling gpuMemGetAddressRange /
gpuIpcGetMemHandle again.
A shared export_local_ipc_buffer helper is used consistently by both the
transport and the recv-side IPC cache path, replacing duplicated
handle-export logic.
Remote versioned IPC metadata can be prefetched during setup and
eagerly opened (gpuIpcOpenMemHandle) before timed iterations begin,
so the first timed send does not pay the open cost.
irecv() deduplicates steady-state metadata publication: if the same buffer
address, size, and binding version were already published, the exchanger write
is skipped.
Same-process transfers (sender and receiver in the same OS process) skip IPC
handle export/import entirely; the raw GPU virtual address is passed directly,
matching UCCL's intra-process shortcut.

2. Reduce control-path latency in the IPC adapter

Files: src/transport/adapter/ipc_adapter.{cc,h}

Recv-side control polling sleep reduced from 1 ms to 50 µs.
Send and recv workers now use separate condition variables and pending
counters. Previously both workers shared one CV, so a send wakeup could
spuriously wake the recv thread and vice versa.
Workers now do a short spin (256 yield iterations) before sleeping on the
CV, reducing wakeup latency for bursts.
binding_version is threaded from isend into the IPC send path so the
direct metadata lookup is versioned rather than "latest wins".
wait_finish for single IPC requests bypasses the generic multi-request
completion path and spins directly on the request slot state.

3. Wire the C++ benchmark to the direct IPC path

File: benchmarks/bench_transport.cc

Before this PR the C++ benchmark never exercised the by-mem_id direct path;
it fell back to the ipc_cache_req/cache handshake on every iteration.
The benchmark now publishes stable versioned IPC recv buffers at setup time
and prefetches the peer's metadata before the timed phase starts, so timed
iterations use the direct path.

4. Python benchmark and ROCm build

Files: py/ukernel_p2p.cpp, py/bench_p2p.py, py/setup.py

Exposed notify_ipc_tensor, wait_ipc_buffer, isend_direct, send_direct
to Python so bench_p2p.py can exercise the direct IPC path.
bench_p2p.py now exchanges pinned MR IDs, publishes versioned recv buffers,
prefetches peer metadata, and uses send_direct on the IPC path.
Fixed ROCm setup.py.

5. Per-peer `done_seq` completion channel

Files: src/transport/adapter/ipc_adapter.{cc,h}

Added a per-peer POSIX SHM channel with:
- done_seq_lo_to_hi
- done_seq_hi_to_lo
- opener_ready
Lower rank creates the SHM region before the IPC peer handshake.
Higher rank opens the region after the handshake and stores
opener_ready = 1.
Lower rank waits for that ready signal before activating the done_seq
fast path. If the peer never confirms readiness, the creator tears the
channel down and prints a warning so both sides consistently fall back to
the legacy SHM ring ACK path.
On the fast path:
- send_one() writes done_seq instead of sending a direct ACK.
- recv_one() polls done_seq first, then falls back to the existing
  ipc_cache_req/cache and relay handling.

Benchmark Results

Environment: GPU 2 (NUMA 0) ↔ GPU 6 (NUMA 1), two separate processes,
ROCm, AMD MI325X.

C++ bench_transport — IPC direct path sweep

10 iterations / 2 warmup, --transport ipc --ipc-path auto.

Size	p50 (µs)	Uni (GB/s)
1 KB	258.13	0.06
4 KB	186.10	0.23
16 KB	250.02	0.84
64 KB	336.91	3.69
256 KB	338.33	11.75
1 MB	357.30	21.60
4 MB	255.95	39.53
16 MB	795.26	42.69
64 MB	2808.17	44.65
256 MB	11086.86	45.24

C++ bench_transport — `done_seq` anchor

50 iterations / 5 warmup, --transport ipc --ipc-path auto.

Size	p50 (µs)	p99 (µs)	Uni (GB/s)
1 KB	40.88	195.51	0.04
4 KB	39.53	279.06	0.24
64 KB	49.12	435.46	2.96
1 MB	81.89	389.30	28.91
16 MB	738.23	1166.28	43.30

The first done_seq version timed out in 1 KB, but the current opener_ready handshake version now completes
cleanly after a forced rebuild.

Python bench_p2p — UKernel IPC vs UCCL vs NCCL/RCCL

Size	UKernel (ms)	UKernel (GB/s)	UCCL (ms)	UCCL (GB/s)	NCCL (ms)	NCCL (GB/s)
1024 B	0.047	0.04	0.039	0.05	0.108	0.02
4096 B	0.051	0.16	0.039	0.21	0.091	0.09
16384 B	0.052	0.62	0.039	0.83	0.092	0.36
65536 B	0.047	2.78	0.030	4.34	0.092	1.42
262144 B	0.057	9.19	0.041	12.94	0.091	5.76
1048576 B	0.098	21.47	0.072	28.99	0.117	17.92
4194304 B	0.246	34.15	0.254	32.97	0.276	30.37
16777216 B	0.737	45.54	0.779	43.06	0.759	44.23
67108864 B	2.755	48.73	2.868	46.80	2.735	49.08
268435456 B	10.812	49.65	11.216	47.86	10.644	50.44

Relay path (regression anchor)

50 iterations / 5 warmup, --transport ipc --ipc-path relay.

Size	p50 (µs)
64 KB	2379.6
1 MB	3609.5

Relay latency is unchanged from before this PR (expected; relay path was not
modified).

Build

# C++ transport and benchmarks
cd experimental/ukernel && make -f Makefile.rocm bench -j4

# Python extension (ROCm)
cd experimental/ukernel/py && \
LD_LIBRARY_PATH=/home/yangzhou/miniconda3/envs/shawn/lib/python3.12/site-packages/torch/lib:/opt/rocm/lib:/opt/rocm/lib64:/home/yangzhou/miniconda3/lib:${LD_LIBRARY_PATH} \
python setup.py build_ext --inplace

# Device benchmarks
cd experimental/ukernel/src/device && make -C . -f Makefile.rocm bench -j4

Copilot

Pull request overview

Improves UKernel IPC performance by reducing per-request control-plane overhead and enabling stable, versioned, direct IPC metadata paths (plus a new per-peer completion fast path).

Changes:

Cache local IPC exports by allocation/range and deduplicate steady-state IPC buffer publication (versioned) to avoid repeated handle export and exchanger writes.
Add a per-peer done_seq POSIX SHM completion channel and reduce worker/control-path latency (split CVs, spin-before-sleep, bypass multi-request wait path for single IPC).
Update C++/Python benchmarks and Python bindings/build to exercise the direct IPC path and support ROCm builds.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
experimental/ukernel/src/transport/memory/ipc_manager.h	Track allocation size and add range-based local cache hit helper.
experimental/ukernel/src/transport/memory/ipc_manager.cc	Reuse exported IPC handles for sub-slices via range checks; update local lookup/delete to avoid address-range queries.
experimental/ukernel/src/transport/communicator.h	Add helper for consistent local IPC export; track published IPC buffers; add eager-open hook.
experimental/ukernel/src/transport/communicator.cc	Deduplicate IPC buffer publication, add eager remote IPC open, and add fast-path single-request wait handling.
experimental/ukernel/src/transport/adapter/ipc_adapter.h	Introduce per-peer done channel structs; split send/recv condition variables; store ring namespace.
experimental/ukernel/src/transport/adapter/ipc_adapter.cc	Implement done_seq POSIX SHM channel, reduce wakeup latency, and use done_seq for completion signaling with fallback to legacy ACK.
experimental/ukernel/py/ukernel_p2p.cpp	Expose direct IPC send and IPC buffer notify/wait APIs to Python.
experimental/ukernel/py/setup.py	Add ROCm-aware build logic and dependency/linking adjustments.
experimental/ukernel/py/bench_p2p.py	Update Python benchmark to publish/version IPC recv buffers and use direct IPC sends.
experimental/ukernel/benchmarks/bench_transport.cc	Publish/version IPC recv buffers at setup and prefetch remote IPC metadata for direct-path benchmarking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T00:20:36Z

+  IPCItem opened = state;
+  opened.direct_ptr = direct_ptr;
+  opened.ipc_id = state.ipc_id;
+  (void)ipc_manager_.register_remote_ipc(remote_rank, opened);
 }


try_open_remote_ipc_buffer() re-registers an IPCItem containing an opened direct_ptr via ipc_manager_.register_remote_ipc(...). When a newer binding_version/handle is later registered for the same (rank, ipc_id) (e.g., via wait_ipc_buffer/fetch_ipc_buffer), the previous cached entry can be overwritten without calling gpuIpcCloseMemHandle on the existing direct_ptr, leaking VA mappings. Consider updating IPCManager::register_remote_ipc (or adding an explicit update API) to detect replacement of an existing cached item and close any previous direct_ptr before overwriting, and/or preserve the existing direct_ptr when handle+device are unchanged.

Copilot · 2026-04-22T00:20:36Z

          << creq->id << " match_seq " << creq->match_seq << std::endl;
      return false;
    }
-    std::this_thread::sleep_for(std::chrono::milliseconds(1));
+    std::this_thread::yield();  // no sleep — yield to other threads only
  }


These control-plane polling loops now call std::this_thread::yield() with no backoff. Given kIpcControlTimeoutMs is very large, this can burn a full CPU core per in-flight recv while waiting for peer progress (especially on slow/error paths). Consider adding a small timed sleep/backoff after some spins (e.g., sleep for ~50–100µs after N iterations) or using an adaptive strategy so long waits don’t busy-spin indefinitely.

Copilot · 2026-04-22T00:20:36Z

                << std::endl;
      return false;
    }
-    std::this_thread::sleep_for(std::chrono::milliseconds(1));
+    std::this_thread::yield();  // no sleep
  }


Same as above: this second wait loop also yields with no backoff, which can cause sustained high CPU usage for up to the full control timeout. Adding a bounded backoff (or reusing kIpcControlPollTimeoutMs with a small sleep) would reduce operational impact without regressing latency for short waits.

Copilot · 2026-04-22T00:20:36Z

+    if (done_ch.remote_ptr) {
+      if (done_ch.remote_ptr->load(std::memory_order_acquire) >=
+          creq->match_seq) {
+        return true;
+      }
+    } else {
+      // Fallback when done_seq channel is unavailable: poll SHM ring ack.
+      uint32_t status = 0;
+      int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);
+      if (ack_result < 0) return false;
+      if (ack_result > 0) return true;
    }


The new done_seq fast path is a substantial behavioral change (per-peer shared-memory channel + fallback to SHM ring ACK). There don’t appear to be any unit/integration tests covering IPC adapter control-path behavior, so regressions (e.g., fallback when SHM setup fails, monotonic seq handling, shutdown/teardown) may be hard to catch. Consider adding a transport unit test that runs two IpcAdapters in-process (or via fork) and validates: (1) done_seq completion works, and (2) disabling/teardown correctly falls back to the legacy ACK path without deadlock.

Suggested change

if (done_ch.remote_ptr) {

if (done_ch.remote_ptr->load(std::memory_order_acquire) >=

creq->match_seq) {

return true;

}

} else {

// Fallback when done_seq channel is unavailable: poll SHM ring ack.

uint32_t status = 0;

int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);

if (ack_result < 0) return false;

if (ack_result > 0) return true;

}

if (done_ch.remote_ptr &&

done_ch.remote_ptr->load(std::memory_order_acquire) >=

creq->match_seq) {

return true;

}

// Always poll the legacy SHM ring ack path as a non-blocking fallback.

// This avoids deadlock if the done_seq fast path was provisioned but

// becomes unavailable during shutdown/teardown or falls back after setup.

uint32_t status = 0;

int ack_result = wait_sender_ack(/*timeout_ms=*/0, &status);

if (ack_result < 0) return false;

if (ack_result > 0) return true;

DanielDanyang and others added 17 commits April 9, 2026 02:34

support ipc on amd

ee1887d

Merge branch 'uccl-project:main' into ipc-opt

01a38d3

support ukernel device bench on amd

d9be409

format

cd949c1

Merge branch 'uccl-project:main' into ipc-opt

8974432

bug fix

e39e57c

format

ff83f69

cache local ipc exports by allocation

da887e7

use versioned ipc metadata in transport bench

e2b6874

document ukernel ipc tuning progress

c6b92a5

prefetch remote ipc mappings during setup

8b31a70

reduce ipc recv poll interval

75ba875

Merge branch 'uccl-project:main' into ipc-opt

b20f8bd

reduce ipc worker wakeup overhead

f044f03

dedupe steady-state ipc recv metadata publish

66c4262

fast-path single-request ipc waits

b3d8ae2

bug fix and format

5844c89

DanielDanyang added the WIP Work In Progress label Apr 11, 2026

DanielDanyang and others added 5 commits April 13, 2026 22:12

fix rocm python p2p extension build

6dce8ca

enable ipc direct path in python p2p bench

68c0be3

bug fix and format

252fcee

stage per-peer done_seq SHM completion with opener handshake

ccf6a38

Merge branch 'uccl-project:main' into ipc-opt

ad6072e

DanielDanyang changed the title ~~[UK-draft] Imporve ipc performance~~ [UK] Imporve ipc performance Apr 21, 2026

DanielDanyang requested review from Copilot and derekwin April 22, 2026 00:14

Copilot started reviewing on behalf of DanielDanyang April 22, 2026 00:15 View session

DanielDanyang removed the WIP Work In Progress label Apr 22, 2026

Copilot AI reviewed Apr 22, 2026

View reviewed changes

derekwin approved these changes Apr 22, 2026

View reviewed changes

DanielDanyang merged commit 5ec6286 into uccl-project:main Apr 22, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UK] Imporve ipc performance#876

[UK] Imporve ipc performance#876
DanielDanyang merged 22 commits intouccl-project:mainfrom
DanielDanyang:ipc-opt

DanielDanyang commented Apr 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DanielDanyang commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What This PR Does

1. Cache IPC handle exports by allocation, not by exact pointer

2. Reduce control-path latency in the IPC adapter

3. Wire the C++ benchmark to the direct IPC path

4. Python benchmark and ROCm build

5. Per-peer done_seq completion channel

Benchmark Results

C++ bench_transport — IPC direct path sweep

C++ bench_transport — done_seq anchor

Python bench_p2p — UKernel IPC vs UCCL vs NCCL/RCCL

Relay path (regression anchor)

Build

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DanielDanyang commented Apr 10, 2026 •

edited

Loading

5. Per-peer `done_seq` completion channel

C++ bench_transport — `done_seq` anchor