Skip to content

feat(mcp-server): persist async job registry across restarts (#237)#251

Merged
Neftedollar merged 1 commit into
masterfrom
feat/237-p0-persist-async-job-registry-and-recover-jobs-after-server-restart
Apr 20, 2026
Merged

feat(mcp-server): persist async job registry across restarts (#237)#251
Neftedollar merged 1 commit into
masterfrom
feat/237-p0-persist-async-job-registry-and-recover-jobs-after-server-restart

Conversation

@Neftedollar

Copy link
Copy Markdown
Owner

Closes #237

What changed

  • Added pluggable run persistence in @ageflow/server via RunStore + snapshot hydration/recovery path.
  • Added new package @ageflow/server-sqlite with SQLite-backed RunStore implementation (Bun + Node runtime support).
  • Wired @ageflow/mcp-server async jobs to use durable job store:
    • new jobStore abstraction and SQLite loader
    • jobDbPath support in server/programmatic paths
    • async job lifecycle persists and is recoverable after restart
  • Extended CLI mcp serve to expose async durable store option (--job-db).
  • Updated release pipeline publish order to include @ageflow/server-sqlite.
  • Synced docs/specs/README to reflect that persistence is now opt-in (instead of in-memory only).

Tests

  • bun run --filter @ageflow/server typecheck
  • bun run --filter @ageflow/server test
  • bun run --filter @ageflow/server-sqlite typecheck
  • bun run --filter @ageflow/server-sqlite test
  • bun run --filter @ageflow/mcp-server typecheck
  • bun run --filter @ageflow/mcp-server test
  • bun run --filter @ageflow/cli typecheck
  • bun run --filter @ageflow/cli test

All passed locally.

@Neftedollar Neftedollar merged commit 435e72b into master Apr 20, 2026
1 of 3 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 041adceb07

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

});

jobId = handle.runId;
persistSnapshot(ctx, jobId, handle);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve original run input when writing async snapshots

dispatchStart immediately persists the synchronous handle returned by runner.fire, but that snapshot does not include the original request input, so this write can overwrite the runner’s own store row with an input-less record. Recovery later depends on record.input (packages/server/src/runner.ts, recover) to replay in-flight jobs, so a restart during a still-running job can replay with undefined input (or fallback static task input), producing wrong outputs or validation failures after restart.

Useful? React with 👍 / 👎.

Comment on lines +188 to +189
if (h.pendingCheckpoint.recoveredFromStore === true) {
throw new InvalidRunStateError(runId, h.state);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t expose checkpoint state that cannot be resumed yet

Recovered checkpoint runs are rehydrated as state === "awaiting-checkpoint", but resume() explicitly rejects them while pendingCheckpoint.recoveredFromStore is true. After restart, clients can observe awaiting-checkpoint from status and still get INVALID_RUN_STATE from resume_workflow; if replay takes a long time to hit the checkpoint again, this becomes a prolonged false-ready state.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P0: Persist async job registry and recover jobs after server restart

1 participant