Skip to content

CS-11125: per-realm advisory lock on data-plane write paths#4840

Open
lukemelia wants to merge 3 commits into
withwritelock-refactorfrom
cs-11125-per-realm-advisory-lock-for-data-plane-write-paths
Open

CS-11125: per-realm advisory lock on data-plane write paths#4840
lukemelia wants to merge 3 commits into
withwritelock-refactorfrom
cs-11125-per-realm-advisory-lock-for-data-plane-write-paths

Conversation

@lukemelia
Copy link
Copy Markdown
Contributor

Summary

  • Threads the per-realm withWriteLock (now on DBAdapter, via Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839) through the data-plane mutations in packages/runtime-common/realm.ts, so two replicas behind a load balancer can no longer race on the same realm URL.
  • Pre-CS-11125 the lock from CS-10891 only wrapped realm-lifecycle handlers (create / publish / unpublish / delete). POST / PATCH / DELETE <realm>/<path> and POST <realm>/_atomic — the routes real client traffic actually hits — did not take it. Single-replica deploys were fine because the Node event loop was the implicit serializer; the moment a second replica appears, last-writer-wins on EFS and interleaved boxel_index rows.

Stacks on #4839.

Linear: CS-11125.

What's in

  • write / writeMany / delete / deleteAll wrap their inner work in dbAdapter.withWriteLock(this.url, ...). Inner methods are renamed _batchWriteUnlocked / _deleteUnlocked / _deleteAllUnlocked so handlers that need the lock around a wider critical section reuse the inner without re-entering the lock (a second pg_advisory_xact_lock on the same key would block on a different pinned pool connection).
  • handleAtomicOperations takes the lock at the handler boundary so the add / update precheck is in the same critical section as the writeMany — fixes the /_atomic TOCTOU spelled out in the ticket.
  • patchCardInstance takes the lock around the indexEntry read + merge + write, so two concurrent PATCHes from different replicas can't both read the same original and clobber each other's merge.

Lock granularity stays per-realm — concurrent writes to different realms remain fully parallel. SQLite (host) is a passthrough, so this is a no-op there.

Regression test

packages/realm-server/tests/data-plane-write-lock-test.ts:

  • Two concurrent PATCHes against the same card with non-overlapping attribute changes — both must persist. Without the lock the second writer computes its merge from pre-first state and drops the first writer's field.
  • Two concurrent /_atomic add ops on the same href — must yield 201 + 409, not 201 + 201. Without the lock both pass the precheck and both write.

Compatibility

  • Single-replica deploys: the lock is a no-op without contention — pure overhead is one pg_advisory_xact_lock round-trip per data-plane mutation.
  • Lands before CS-10899 (atomicity follow-up) and the cache-invalidation tickets that assume writes are already serialized.

Test plan

  • realm-advisory-locks-test (7/7) + realm-cleanup-transaction-test (3/3) pass after the rebase on Refactor: move per-realm advisory lock into DBAdapter.withWriteLock #4839.
  • pnpm exec tsc -p packages/runtime-common --noEmit and the realm-server / postgres / host projects — zero new type errors vs. main.
  • pnpm test:wait-for-servers against a running stack to drive the new data-plane-write-lock-test end-to-end. (Requires Synapse + base realm-server up — not run locally, deferred to CI.)
  • Eyeballed patchCardInstance / handleAtomicOperations re-indentation diffs (heavy because everything is wrapped in a closure) — please review the structural changes carefully.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the per-realm advisory write lock (added to DBAdapter.withWriteLock in #4839) to the data-plane mutation paths in runtime-common so that concurrent writers on different realm-server replicas can’t race on the same realm URL (preventing lost updates and TOCTOU issues around PATCH and /_atomic).

Changes:

  • Wrap Realm.write, writeMany, delete, and deleteAll in dbAdapter.withWriteLock(this.url, …) and introduce unlocked inner helpers (e.g. _batchWriteUnlocked) for callers that already hold the lock.
  • Take the lock at the handler boundary for /_atomic and PATCH so reads + validation + writes are serialized in one critical section.
  • Add a new realm-server regression test module and register it in the test index.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
packages/runtime-common/realm.ts Threads withWriteLock through public mutation entry points and wraps /_atomic + PATCH critical sections; renames inner write/delete helpers to avoid lock re-entrancy.
packages/realm-server/tests/index.ts Registers the new data-plane lock regression test module.
packages/realm-server/tests/data-plane-write-lock-test.ts Adds regression tests for concurrent PATCH lost-update and /_atomic add TOCTOU serialization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/runtime-common/realm.ts
Comment thread packages/realm-server/tests/data-plane-write-lock-test.ts
@lukemelia lukemelia force-pushed the withwritelock-refactor branch from 450e521 to 87eb4d8 Compare May 14, 2026 19:48
The advisory lock from CS-10891 wrapped only the realm-lifecycle handlers
(create / publish / unpublish / delete realm). The data-plane mutation
routes that normal client traffic actually hits (POST/PATCH/DELETE
<realm>/<path>, POST <realm>/_atomic) didn't take it — fine on a single
replica because the Node event loop is the implicit serializer, but a
real lost-update race the moment a second replica appears behind an LB.

This PR threads the per-realm `withWriteLock` (newly exposed on
`DBAdapter` by the parent refactor) through the data-plane mutations in
`packages/runtime-common/realm.ts`:

- `write` / `writeMany` / `delete` / `deleteAll` wrap their inner work
  in the lock. Inner methods are renamed `_batchWriteUnlocked` /
  `_deleteUnlocked` / `_deleteAllUnlocked` so callers that need the
  lock around a wider critical section can re-use the inner without
  re-entering the lock (which would block on a different pinned pool
  connection).
- `handleAtomicOperations` takes the lock at the handler boundary so
  the `add`/`update` precheck is in the same critical section as the
  writeMany — fixes the `/_atomic` TOCTOU spelled out in the ticket.
- `patchCardInstance` takes the lock around the indexEntry read +
  merge + write, so two concurrent PATCHes from different replicas
  can't both read the same `original` and clobber each other's merge.

Lock granularity stays per-realm — concurrent writes to different
realms remain fully parallel. SQLite (host) is a passthrough (no
cross-connection concurrency to coordinate), so this is a no-op there.

## Regression test

`packages/realm-server/tests/data-plane-write-lock-test.ts`:
1. Two concurrent PATCHes against the same card with non-overlapping
   attribute changes — both must persist. Without the lock the second
   writer would compute its merge from pre-first state and drop the
   first writer's field.
2. Two concurrent `/_atomic` `add` operations on the same href — must
   yield 201 + 409, not 201 + 201. Without the lock both pass the
   precheck and both write.

Compatible with single-replica deploy: the lock is a no-op without
contention. Lands before CS-10899 (atomicity follow-up) and the
cache-invalidation tickets that assume writes are already serialized.

Stacks on #4839 (DBAdapter.withWriteLock refactor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lukemelia lukemelia force-pushed the cs-11125-per-realm-advisory-lock-for-data-plane-write-paths branch from 9de891b to 68680de Compare May 14, 2026 19:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Host Test Results

    1 files  ±    0      1 suites  ±0   3h 47m 48s ⏱️ + 1h 51m 1s
2 658 tests ±    0  2 642 ✅  -     1  15 💤 ± 0  0 ❌ ±0  1 🔥 +1 
5 354 runs  +2 677  5 322 ✅ +2 660  30 💤 +15  1 ❌ +1  1 🔥 +1 

Results for commit 65b0596. ± Comparison against earlier commit e8f6a07.

For more details on these errors, see this check.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   9m 36s ⏱️ +25s
1 376 tests ±0  1 376 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 457 runs  ±0  1 457 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 65b0596. ± Comparison against earlier commit e8f6a07.

68680de unintentionally removed Realm.clearLocalCaches() — the
CS-11043 publish-realm cache invalidation primitive. The advisory-lock
work on this branch had no reason to touch it; it's a rebase casualty.

Two callers were still on this API:

* packages/realm-server/handlers/handle-publish-realm.ts:575 (production
  — the synchronous local cache drop that keeps the reindex's prerender
  from seeing pre-swap bytes).
* packages/realm-server/tests/card-source-endpoints-test.ts:263 (the
  regression test pinning that production behavior).

Restoring exactly the CS-11043 implementation. The added note points at
[CS-11156](https://linear.app/cardstack/issue/CS-11156), the follow-up
that replaces the publish handler's local-only call with a cross-replica
NOTIFY broadcast — out of scope here.

Addresses Copilot review feedback on PR #4840.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runtime-common/realm.ts: collapse a buildCardJsonEtag call back onto
  one line — prettier nit caught by Lint CI, unrelated to the
  advisory-lock work but blocking merge.
* tests/data-plane-write-lock-test.ts: add `?waitForIndex=true` to the
  two concurrent /_atomic POSTs. /_atomic returns on durability, not on
  index completion, so the immediate GET that asserts on the winning
  card's attributes could race the indexer and 404 / read stale. The
  PATCH test elsewhere in this file uses `writeMany` which defaults to
  `waitForIndex: true`, so only /_atomic needed the explicit opt-in.
  Addresses the Copilot comment at PR #4840 review r3243846106.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lukemelia
Copy link
Copy Markdown
Contributor Author

Addressed both Copilot review comments and the Lint failure:

  • Lint (prettier on realm.ts:4111) — collapsed the buildCardJsonEtag(...) call back onto a single line. Unrelated to advisory locks, blocking merge.
  • r3243846068Realm.clearLocalCaches() deletion was an unintentional rebase casualty. Restored at the original CS-11043 site in commit e8f6a07.
  • r3243846106 — added ?waitForIndex=true to the two /_atomic POSTs in data-plane-write-lock-test.ts. /_atomic returns on durability, not on index completion, so the immediate GET could race the indexer. The PATCH test in the same file uses writeMany which defaults to waitForIndex: true, so only /_atomic needed the explicit opt-in.

Follow-up CS-11156 (cross-replica cache invalidation via NOTIFY broadcast) is open as draft PR #4842, stacked on this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants