feat(ipc): async server handlers + server-side per-fork ordering#24210
Closed
charlielye wants to merge 13 commits into
Closed
feat(ipc): async server handlers + server-side per-fork ordering#24210charlielye wants to merge 13 commits into
charlielye wants to merge 13 commits into
Conversation
A sequential-insert response can run to ~1.5 MB, but the default SHM rings were 1 MB, so the response could never be sent and the client hung forever (no error surfaced: the server crashed to a discarded stderr and the SHM client has no peer-disconnect signal). UDS streamed it fine, which is why this only bit the SHM path, and bb.js never hit it (tiny responses, SPSC only). - Size the spawned aztec-wsdb SHM rings to 32 MiB request+response (an SHM frame is capped at half the ring, so this gives ~16 MiB headroom). Keep the WSDB_TRANSPORT env gate (default uds) for opting into SHM. - SpscShm::create: gate the pre-fault memset to small rings so large rings stay demand-paged instead of forcing the whole mapping resident. - IpcServer::run(): catch handler/send failures and shut down cleanly with a logged reason instead of letting an uncaught exception reach std::terminate. - Generated spawned-service backend: capture the child's stdout/stderr to a temp logfile (an fd, not a pipe, to preserve clean process exit) and, on unexpected child exit, reject in-flight calls surfacing the log path — so a server crash is a clear error rather than a silent hang over SHM. - Add single-client MPSC pipelined-flood and burst tests; the SPSC grind never exercised the MPSC path. Verified: world-state native suite 50/50 over both SHM and UDS; avm_bulk (TS world state client 0 + C++ AVM client 1 on one aztec-wsdb) passes over SHM; no perf regression vs UDS.
The IPC-backed wsdb served all requests through a single-threaded run() loop, so parallel reads were capped at the single-thread dispatch rate (~15-16k reads/s regardless of concurrency) where the in-process world state scaled with cores. AVM proving runs N simulator processes against the wsdb, so this serialization is a throughput regression. Add IpcServer::run_reactor(handler, executor): a reactor that owns all ring/socket I/O and never blocks on a handler. It copies each request out, releases the slot, assigns a per-connection sequence, and hands the handler to a caller-supplied executor (a thread pool for wsdb, inline for serial callers). Workers post completed responses to a per-connection reorder stash and notify() the reactor, which is the sole sender, so the response ring stays single-producer/lock-free and per-connection FIFO is preserved with no wire/request-id change. ipc-runtime owns no pool and spawns no thread; concurrency is purely the caller's executor choice, so the serial run() path (e.g. bb) is untouched. notify() wakes the reactor without a timeout poll: sockets use a self-pipe in the epoll/kqueue set; MPSC-SHM bumps the doorbell seq then futex_wakes (mirroring publish, so the wake is not lost against the consumer's armed futex), and the consumer's spin/wait checks the completion predicate inside the same seq-latched window. The wsdb wires a dispatch ThreadPool distinct from WorldState's intra-op pool (mutating handlers wait on the latter; sharing could deadlock) and raises max_shm_clients to cover the AVM pool. Parallel-read throughput now scales ~5x (UDS to ~29k, SHM to ~32k at concurrency 16) instead of staying flat. Single in-flight reads pay a thread-handoff latency (crossover ~concurrency 3).
run_reactor's dispatch model pays a thread-handoff + wakeup per request, which is pure latency when there is no concurrency to exploit: a single in-flight / sequential reader measured ~25-35% slower than the serial run() and the in-process baseline (crossover around concurrency 3). Add an inline fast path: when nothing is in flight AND no further request is already pending, run the handler on the reactor thread instead of dispatching it. The `inflight == 0` test short-circuits the pending check, so a burst (which keeps inflight > 0 once started) never pays for it and stays fully on the dispatch path — preserving the high-concurrency scaling. The pending check is has_pending_request(): sockets poll via wait_for_data(0) (epoll is stateless), while MPSC-SHM uses a new side-effect-free MpscConsumer::has_data() so the frequent peek doesn't disturb the round-robin cursor or adaptive-spin state (a wait_for_data(0) peek there measurably depressed high-concurrency throughput). This restores ~in-process parity at concurrency 1-2 (~7.5-8.4k reads/s, up from ~5.5k) while keeping concurrent throughput unchanged (SHM ~30-33k, UDS ~28-29k at concurrency 16).
parallel_read.bench pipelines N reads down one connection (the TS world-state client's pattern). The AVM proving path is different: one aztec-wsdb with N independent sim connections, each issuing sequential reads. Add a bench that models that — N AsyncApi clients, one in-flight read each — to measure what the simulator pool actually sees. The N-connection topology scales better than one pipelined connection, since load spreads across N request/response rings instead of contending on one: SHM reaches ~40-47k reads/s at 7 connections (at/above the in-process baseline), UDS ~30k. IPC-only; skips on the in-process build.
…-process comparison The raw AsyncApi path skips the WorldStateOpsQueue + facade (~40% overhead) that the in-process numbers go through, so these figures measure server capability, not an in-process win. Document the 7-connection cap as the real SHM ceiling (8 slots minus the world-state client's).
…/Rust) Move the server dispatch to an asynchronous handler pattern so the database, not the client, owns request ordering. A handler receives the decoded command plus a `respond` callback and may run inline or defer to a thread pool, calling respond when the result is ready. The wire protocol is unchanged. Codegen (ipc-codegen): - C++: dispatch is now async — `void handle_x(Ctx&, Cmd&&, Responder<Resp>)`; make_*_handler returns `void(span, RawRespond)`; serve() drives it via run_reactor. Responder encodes ok()/error() frames. - Rust: Handler trait methods take `Responder<Resp>`; handle_request collects a synchronous response (runtime/echo glue unchanged) while leaving the handler free to defer with an async transport. - TS already returns Promises (no change). Zig still to do. wsdb: each of the 40 handlers declares its own ordering — schedule_read (reads run concurrently; committed reads bypass ordering) or schedule_write (exclusive per fork) — extracting its fork from the typed command. WsdbScheduler implements the read-batch / write-barrier model per fork with an inline fast path when idle; the hand-rolled request classifier is gone (read/write + fork come from the typed handler). Distinct dispatch pool from WorldState's intra-op pool. Validated: ipc-runtime reactor/echo gtests green; native_world_state.test.ts (50 tests incl. Concurrent requests) green against the async server.
Zig server dispatch now hands each handler a typed Responder (ok/err) instead of returning !Resp, matching the async pattern adopted for C++ and Rust. The handleRequest entry still collects the response synchronously, so the Zig runtime glue and echo main() are unchanged. Wire format is unchanged; the cross-language echo matrix (goldens + interop) passes.
multi_connection_read.bench.test.ts imports @aztec/ipc-runtime to open extra connections; declare it so eslint import-x/no-extraneous-dependencies passes.
The dispatch pool was sized from std::thread::hardware_concurrency(), which ignores cgroup CPU limits and returns the host core count (192 in a 2-CPU CI container). Each aztec-wsdb spawned ~192 dispatch threads instead of the intended ~16, exhausting the per-UID thread limit (pthread_create EAGAIN -> libc++abi abort) under the AVM simulator e2e. Size it from the caller-provided `threads` budget (= getWsdbThreadCount, capped at 16), matching the WorldState pool.
Read/write ordering is now enforced server-side by the wsdb per-fork scheduler, so the IPC client no longer needs its own per-fork ordering queue: every op is sent immediately and the server serializes writes per fork while running reads concurrently. IpcWorldState.execute() now sends directly; the per-fork queue map, its lifecycle (stop on close/deleteFork), and the WorldStateOpsQueue class are deleted. The WorldStateOperationName label type moves to world_state_operation.ts (still used for instrumentation). The A-1055 delayed-close-fork regression test drops its now-obsolete per-fork-queue-cleanup assertion (the silent-dispose check remains). native_world_state.test.ts (50 tests) passes against the async server, including "Concurrent requests" — which, with the client queue gone, now genuinely exercises server-side per-fork read/write ordering.
f407002 to
317a72f
Compare
Contributor
Author
|
Superseded by the restack. The reactor (#24198) and ordering work have been combined into a single machinery commit on #24198, which now targets |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Moves request ordering from the client into the database. Previously each TS client serialized its own reads/writes per fork (
WorldStateOpsQueue); a second async client could corrupt state by skipping that contract. This makes the wsdb server own per-fork ordering, built on an async server-handler codegen pattern adopted symmetrically across all four languages.Stacked on #24198 (concurrent reactor).
Codegen — async server handlers (C++, Rust, Zig; TS already async)
The generated server dispatch hands each handler the decoded command plus a
respondcallback, instead of returning a value. A handler may run inline or defer to a thread pool and callrespondwhen ready. The wire protocol is unchanged, so the cross-language interop matrix and golden fixtures are untouched.void handle_x(Ctx&, Cmd&&, Responder<Resp>);make_*_handlerreturnsvoid(span, RawRespond);serve()drives it viarun_reactor.Handlertrait methods takeResponder<Resp>;handle_requestcollects synchronously so the runtime/echo glue is unchanged.Responder(ok/err);handleRequestunchanged.Promise-returning — no change.Each language's echo server updated to the new shape (~6 handlers each).
wsdb — server-side per-fork ordering
Each of the 40 handlers declares its own ordering by calling
schedule_read(reads run concurrently; committed reads bypass ordering) orschedule_write(exclusive on its fork), extracting its fork from the typed command.WsdbSchedulerimplements the read-batch / write-barrier model per fork, with an inline fast path when idle and a dispatch pool distinct from WorldState's intra-op pool. The hand-rolled request classifier is gone — read/write + fork come from the typed handler.Testing
aztec-wsdbbuilds; ipc-runtime reactor/echo gtests green.native_world_state.test.ts(50 tests incl. Concurrent requests, which asserts read-your-writes ordering) green against the async server.parallel_read.benchSHM: ~3.7x read scaling (8.1k → 30.4k reads/s, c=1 → c=16).Client queue removed
With the server owning per-fork ordering, the TS client's per-fork
WorldStateOpsQueueis now redundant and has been deleted:IpcWorldStatesends every op immediately and the server serializes writes / parallelizes reads. With the client queue gone,native_world_state.test.ts's Concurrent requests test now genuinely exercises server-side read/write ordering (and passes), and the TS-consumer no longer pays the queue's per-op overhead.🤖 Generated with Claude Code
https://claude.ai/code/session_01XfznXp9wzGWLU6LC8EHtpi