from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850
from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850habdelra wants to merge 4 commits into
Conversation
chooseFromScratchCoalesceDecision previously consulted only the pending bucket (candidates). A worker claim arriving between two near-simultaneous from-scratch publishes for the same realm moved the first row into inFlightCandidates, so the second publish's pending lookup found nothing and minted a fresh row at its own priority — even though the in-flight job would produce exactly the result the second caller wanted. Mirror chooseIncrementalCoalesceDecision's in-flight fallback. The from-scratch case is simpler than incremental: same concurrency group + same jobType is sufficient because a from-scratch reindex subsumes any other from-scratch for that realm by definition; no per-args coverage check is needed. Regression test: publish first job, wait for the worker to claim (moving it to in-flight), then publish a second. Assert second.id === first.id and exactly one row exists in the concurrency group. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1fb17b7e62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
The in-flight fallback added in the previous commit is unsafe for a publish that has just nulled boxel_index.last_modified for the realm (handle-publish-realm, handle-reindex, full-reindex). The running job's mtimes snapshot pre-dates the clear, so attaching the clearing publish to it would let the caller observe a successful job that did NOT re-render the swapped files. Surface the clearing intent as args.clearLastModified on from-scratch-index jobs (always present; required field on FromScratchArgs so the JSON-shape index signature on WorkerArgs is satisfied). The coalesce decision checks the flag and forces a fresh row instead of joining an in-flight candidate when set. Pending coalesce is unchanged: a pending join is still safe because the pending job hasn't read its mtimes snapshot yet. Regression test: publish a from-scratch and wait for the worker to claim it; publish a second with clearLastModified=true; assert the second got its own row (not attached to the running job). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ex args The grafana reindex path uses clearLastModified: true so every row in boxel_index re-renders even when its mtime hasn't changed. The flag is now surfaced in args (so the from-scratch coalesce can refuse to attach this publish to a running same-realm from-scratch), which means the strict args shape check has to include it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Host Test Results 1 files ±0 1 suites ±0 1h 45m 2s ⏱️ + 2m 2s Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108. Realm Server Test Results 1 files ±0 1 suites ±0 9m 36s ⏱️ +22s Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108. |
The full-reindex task enqueues from-scratch with clearLastModified: true. The flag is now surfaced in args (so the from-scratch coalesce can refuse to attach this kind of publish to a running same-realm from-scratch), which means the strict args shape check has to include it for both the source and published realm jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Hardens queue coalescing for from-scratch-index jobs so a second near-simultaneous publish for the same realm can correctly attach to an already-claimed (in-flight) job, avoiding duplicate rows and priority skew during worker-claim races (CS-11157).
Changes:
- Add an
inFlightCandidatesfallback tochooseFromScratchCoalesceDecision, mirroring the existing incremental behavior. - Introduce/propagate
clearLastModifiedin from-scratch job args to prevent incorrectly joining a “forced refresh” publish onto an already-running job whose mtime snapshot predates the DB clear. - Add queue regression tests to cover both the in-flight dedup path and the
clearLastModifiedexception.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/runtime-common/tasks/indexer.ts | Adds in-flight fallback coalescing for from-scratch jobs and introduces clearLastModified args flag + guard. |
| packages/runtime-common/jobs/reindex-realm.ts | Plumbs clearLastModified into published from-scratch job args (and normalizes the clear condition). |
| packages/realm-server/tests/server-endpoints/maintenance-endpoints-test.ts | Updates expected job args to include clearLastModified: true for Grafana-triggered reindex. |
| packages/realm-server/tests/queue-test.ts | Adds regression tests for in-flight from-scratch dedup and for the clearLastModified non-join behavior. |
| packages/realm-server/tests/full-reindex-test.ts | Updates expected job args to include clearLastModified: true for full-reindex enqueues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Defense-in-depth fix from CS-11157. Independent of PR #4849 (the realm-creation fix), but useful any time two from-scratch publishes for the same realm race a worker claim.
chooseFromScratchCoalesceDecision(packages/runtime-common/tasks/indexer.ts) previously consulted only the pending candidates bucket. If a worker claimed the first row between two near-simultaneous publishes for the same realm, the row moved out ofcandidatesintoinFlightCandidatesbefore the second publish's coalesce ran — so the pending lookup found nothing and the second publish minted a fresh row at its own priority, even though the in-flight job was going to produce exactly the result the second caller wanted.chooseIncrementalCoalesceDecisionin the same file already has an in-flight fallback (packages/runtime-common/tasks/indexer.ts:220-234). This PR adds the equivalent for from-scratch. The from-scratch case is simpler than incremental: same concurrency group + samejobTypeis sufficient because a from-scratch reindex subsumes any other from-scratch for that realm by definition — no per-args coverage check needed.Test plan
pnpm lint:typescleanpnpm lint:jsclean on touched filesNew regression test in
packages/realm-server/tests/queue-test.ts: publish a from-scratch job, wait for the worker to actually claim it (moving it fromcandidatesintoinFlightCandidates), then publish a second from-scratch for the same realm. Asserts:second.id === first.id(waiter attaches to the running job)Locally: all 30 queue tests pass (
TEST_MODULES=queue-test.ts ./tests/scripts/run-qunit-with-test-pg.sh).Full realm-server suite in CI.
Relationship to PR #4849
PR #4849 (stacked on the refactor #4846) prevents the realm-creation flow from ever creating a second from-scratch enqueue in the first place, by mounting through
lookupOrMount(..., { fromScratchIndexPriority: userInitiatedPriority })so the realm's own#startupproduces the single canonical job. That change is sufficient for the realm-creation hang.This PR is the orthogonal hardening: any other code path that enqueues a from-scratch and then races a worker claim against another publish (e.g.
handle-publish-realm.tsand therealm.start()enqueue it triggers vialookupOrMount) gets correct coalesce behaviour without each call site needing to thread a priority through.🤖 Generated with Claude Code