from-scratch coalesce: fall back to in-flight candidates (CS-11157) by habdelra · Pull Request #4850 · cardstack/boxel

habdelra · 2026-05-15T14:40:07Z

Summary

Defense-in-depth fix from CS-11157. Independent of PR #4849 (the realm-creation fix), but useful any time two from-scratch publishes for the same realm race a worker claim.

chooseFromScratchCoalesceDecision (packages/runtime-common/tasks/indexer.ts) previously consulted only the pending candidates bucket. If a worker claimed the first row between two near-simultaneous publishes for the same realm, the row moved out of candidates into inFlightCandidates before the second publish's coalesce ran — so the pending lookup found nothing and the second publish minted a fresh row at its own priority, even though the in-flight job was going to produce exactly the result the second caller wanted.

chooseIncrementalCoalesceDecision in the same file already has an in-flight fallback (packages/runtime-common/tasks/indexer.ts:220-234). This PR adds the equivalent for from-scratch. The from-scratch case is simpler than incremental: same concurrency group + same jobType is sufficient because a from-scratch reindex subsumes any other from-scratch for that realm by definition — no per-args coverage check needed.

Test plan

pnpm lint:types clean
pnpm lint:js clean on touched files
New regression test in packages/realm-server/tests/queue-test.ts: publish a from-scratch job, wait for the worker to actually claim it (moving it from candidates into inFlightCandidates), then publish a second from-scratch for the same realm. Asserts:
- second.id === first.id (waiter attaches to the running job)
- exactly one row in the concurrency group
Locally: all 30 queue tests pass (TEST_MODULES=queue-test.ts ./tests/scripts/run-qunit-with-test-pg.sh).
Full realm-server suite in CI.

Relationship to PR #4849

PR #4849 (stacked on the refactor #4846) prevents the realm-creation flow from ever creating a second from-scratch enqueue in the first place, by mounting through lookupOrMount(..., { fromScratchIndexPriority: userInitiatedPriority }) so the realm's own #startup produces the single canonical job. That change is sufficient for the realm-creation hang.

This PR is the orthogonal hardening: any other code path that enqueues a from-scratch and then races a worker claim against another publish (e.g. handle-publish-realm.ts and the realm.start() enqueue it triggers via lookupOrMount) gets correct coalesce behaviour without each call site needing to thread a priority through.

🤖 Generated with Claude Code

chooseFromScratchCoalesceDecision previously consulted only the pending bucket (candidates). A worker claim arriving between two near-simultaneous from-scratch publishes for the same realm moved the first row into inFlightCandidates, so the second publish's pending lookup found nothing and minted a fresh row at its own priority — even though the in-flight job would produce exactly the result the second caller wanted. Mirror chooseIncrementalCoalesceDecision's in-flight fallback. The from-scratch case is simpler than incremental: same concurrency group + same jobType is sufficient because a from-scratch reindex subsumes any other from-scratch for that realm by definition; no per-args coverage check is needed. Regression test: publish first job, wait for the worker to claim (moving it to in-flight), then publish a second. Assert second.id === first.id and exactly one row exists in the concurrency group. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1fb17b7e62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The in-flight fallback added in the previous commit is unsafe for a publish that has just nulled boxel_index.last_modified for the realm (handle-publish-realm, handle-reindex, full-reindex). The running job's mtimes snapshot pre-dates the clear, so attaching the clearing publish to it would let the caller observe a successful job that did NOT re-render the swapped files. Surface the clearing intent as args.clearLastModified on from-scratch-index jobs (always present; required field on FromScratchArgs so the JSON-shape index signature on WorkerArgs is satisfied). The coalesce decision checks the flag and forces a fresh row instead of joining an in-flight candidate when set. Pending coalesce is unchanged: a pending join is still safe because the pending job hasn't read its mtimes snapshot yet. Regression test: publish a from-scratch and wait for the worker to claim it; publish a second with clearLastModified=true; assert the second got its own row (not attached to the running job). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ex args The grafana reindex path uses clearLastModified: true so every row in boxel_index re-renders even when its mtime hasn't changed. The flag is now surfaced in args (so the from-scratch coalesce can refuse to attach this publish to a running same-realm from-scratch), which means the strict args shape check has to include it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-15T15:23:09Z

Host Test Results

1 files ±0 1 suites ±0 1h 45m 2s ⏱️ + 2m 2s
2 659 tests ±0 2 644 ✅ ±0 15 💤 ±0 0 ❌ ±0
2 678 runs ±0 2 663 ✅ ±0 15 💤 ±0 0 ❌ ±0

Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108.

Realm Server Test Results

1 files ±0 1 suites ±0 9m 36s ⏱️ +22s
1 379 tests ±0 1 379 ✅ +1 0 💤 ±0 0 ❌ - 1
1 460 runs ±0 1 460 ✅ +1 0 💤 ±0 0 ❌ - 1

Results for commit 67bf0ec. ± Comparison against earlier commit a7f2108.

The full-reindex task enqueues from-scratch with clearLastModified: true. The flag is now surfaced in args (so the from-scratch coalesce can refuse to attach this kind of publish to a running same-realm from-scratch), which means the strict args shape check has to include it for both the source and published realm jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Hardens queue coalescing for from-scratch-index jobs so a second near-simultaneous publish for the same realm can correctly attach to an already-claimed (in-flight) job, avoiding duplicate rows and priority skew during worker-claim races (CS-11157).

Changes:

Add an inFlightCandidates fallback to chooseFromScratchCoalesceDecision, mirroring the existing incremental behavior.
Introduce/propagate clearLastModified in from-scratch job args to prevent incorrectly joining a “forced refresh” publish onto an already-running job whose mtime snapshot predates the DB clear.
Add queue regression tests to cover both the in-flight dedup path and the clearLastModified exception.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
packages/runtime-common/tasks/indexer.ts	Adds in-flight fallback coalescing for from-scratch jobs and introduces `clearLastModified` args flag + guard.
packages/runtime-common/jobs/reindex-realm.ts	Plumbs `clearLastModified` into published from-scratch job args (and normalizes the clear condition).
packages/realm-server/tests/server-endpoints/maintenance-endpoints-test.ts	Updates expected job args to include `clearLastModified: true` for Grafana-triggered reindex.
packages/realm-server/tests/queue-test.ts	Adds regression tests for in-flight from-scratch dedup and for the `clearLastModified` non-join behavior.
packages/realm-server/tests/full-reindex-test.ts	Updates expected job args to include `clearLastModified: true` for full-reindex enqueues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

Comment thread packages/runtime-common/tasks/indexer.ts Outdated

habdelra and others added 2 commits May 15, 2026 10:49

habdelra requested review from a team and Copilot May 15, 2026 16:39

Copilot started reviewing on behalf of habdelra May 15, 2026 16:42 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850

from-scratch coalesce: fall back to in-flight candidates (CS-11157)#4850
habdelra wants to merge 4 commits into
mainfrom
cs-11157-coalesce-in-flight

habdelra commented May 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

habdelra commented May 15, 2026

Summary

Test plan

Relationship to PR #4849

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Test Results

Realm Server Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 15, 2026 •

edited

Loading