Skip to content

Add workflow restart and replay reuse#639

Merged
AbirAbbas merged 12 commits into
mainfrom
codex/workflow-restart-fork-replay
Jun 22, 2026
Merged

Add workflow restart and replay reuse#639
AbirAbbas merged 12 commits into
mainfrom
codex/workflow-restart-fork-replay

Conversation

@santoshkumarradha

@santoshkumarradha santoshkumarradha commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

  • add a control-plane restart endpoint for executions with workflow/node scope, replay reuse, fork metadata, and run lineage
  • expose restart from CLI, Python SDK, TypeScript SDK, run detail, DAG node sidebar, and run lifecycle menus
  • surface replay, golden, and fork/restart lineage in run and DAG APIs without adding user-authored replay APIs

Product design

  • v1 gives failed workflow recovery: restart from the failed run or from a specific node, reusing already-successful app.call outputs where the control plane has persisted inputs/outputs
  • v1.5 opens comparison and governance workflows without extra product surface: fork with changed input/model, mark a good run golden, and inspect which nodes were reused vs rerun
  • default recovery is one-click Restart run from the run detail header or row menu; advanced paths stay behind existing overflow/menu patterns
  • Fork with changes is intentional branching, not a separate dashboard; it uses the same restart primitive with changed input/reuse settings
  • run detail and run list show small Golden plus Restarted/Forked metadata chips so users can see lineage without another table column
  • reused nodes are marked only where the user scans execution behavior: compact branch markers in trace rows and graph nodes, plus one source execution line in the graph node sidebar
  • new UI accents use existing statusTone theme tokens; no standalone color palette was added

Docs PR

Website docs/screenshots are in https://github.com/Agent-Field/website2.0/pull/27. Merge that after this platform PR so the public docs only describe released API/UI behavior.

Screenshots

Golden restarted run detail
Open full size

Golden-only runs list with lineage chip
Open full size

DAG graph for restarted run
Open full size

Reused node sidebar provenance
Open full size

Restart actions menu
Open full size

Validation

  • cd control-plane && go test ./internal/handlers ./internal/handlers/ui ./internal/server ./internal/cli
  • cd control-plane && go test ./internal/handlers -run 'TestExecute.*Replay|TestRestartExecutionHandler|TestExecutionReuseInfo'
  • cd control-plane/web/client && npm test -- --run src/components/runs/RunLifecycleMenu.test.tsx src/test/pages/RunsPage.test.tsx src/test/pages/RunDetailPage.test.tsx src/test/components/WorkflowDAG/nodeDetailSidebar.test.tsx src/test/components/RunTrace.test.tsx -> 41 passed
  • cd control-plane/web/client && npm run build
  • ./scripts/coverage-surface.sh web-ui -> 128 files / 665 tests passed
  • cd sdk/typescript && npm run lint && npm run build
  • python3 -m py_compile sdk/python/agentfield/async_execution_manager.py sdk/python/tests/test_async_execution_manager_final90.py tests/functional/agents/restart_replay_agent.py tests/functional/tests/test_restart_replay.py
  • cd sdk/python && python3 -m pytest tests/test_async_execution_manager_final90.py::test_update_execution_from_status_populates_same_status_success tests/test_execution_context_core.py -q -> 11 passed
  • Docker restart/replay E2E: OPENROUTER_MODEL=openrouter/google/gemini-3.1-flash-lite PYTEST_ARGS="-v -n 1 tests/test_restart_replay.py::test_restart_reuses_successful_calls_and_continues_complex_openrouter_graph" docker compose -f docker/docker-compose.local.yml -f /tmp/agentfield-compose-no-host-ports.yml up --build --abort-on-container-exit --exit-code-from test-runner -> 1 passed in 13.41s
  • OpenRouter provider smoke: google/gemini-3.1-flash-lite chat completions request with local OPENROUTER_API_KEY -> 200 ok

Note: the restart/replay E2E graph uses deterministic checkpoints so replay correctness is not coupled to provider latency; the separate OpenRouter smoke verifies the requested model/key path.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Performance

SDK Memory Δ Latency Δ Tests Status
Python 9.4 KB +4% 0.34 µs -3%
TS 355 B +1% 1.71 µs -15%

✓ No regressions detected

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 84%, aggregate ≥ 85%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.00% 87.40% ↓ -0.40 pp 🟡
sdk-go 91.80% 92.00% ↓ -0.20 pp 🟢
sdk-python 93.87% 93.73% ↑ +0.14 pp 🟢
sdk-typescript 90.05% 90.42% ↓ -0.37 pp 🟢
web-ui 84.83% 84.79% ↑ +0.04 pp 🟡
aggregate 85.63% 85.75% ↓ -0.12 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 758 81.00%
sdk-go 0 ➖ no changes
sdk-python 42 100.00%
sdk-typescript 21 95.00%
web-ui 360 86.00%

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@santoshkumarradha santoshkumarradha force-pushed the codex/workflow-restart-fork-replay branch from d9f5a54 to c6a37ec Compare June 9, 2026 18:13
@santoshkumarradha santoshkumarradha marked this pull request as ready for review June 9, 2026 19:56
@santoshkumarradha santoshkumarradha requested review from a team and AbirAbbas as code owners June 9, 2026 19:56
@santoshkumarradha

Copy link
Copy Markdown
Member Author

@AbirAbbas this is ready for review.

Platform PR #639 adds restart/fork recovery from a failed workflow point with replay reuse for already-succeeded app.call outputs, plus minimal run/detail/graph provenance so users can see which nodes reused prior output.

Docs website PR should merge after this lands: https://github.com/Agent-Field/website2.0/pull/27

pr-af (https://github.com/Agent-Field/pr-af) reviewed the workflow
restart/replay change and recommended these follow-ups, applied here:

- Back-fill ExecutionReuseInfo.source_run_id from the run lineage so the
  reused-node sidebar and graph provenance show the source run, not just the
  source execution. Every reused node in a restarted run shares the run's
  single replay source, so it is taken from lineage rather than re-queried
  per node.
- Document that the restart workflow_runs row is a metadata-only sidecar:
  its status/total_steps are seeded at enqueue and are not kept current, and
  all read paths derive live status from execution aggregation.
- Document the replay-match contract in findReplayHit (keyed on
  node/reasoner/canonical input+context; earliest succeeded source wins;
  position/multiplicity agnostic).

Adds TestFillReuseSourceRun covering the back-fill behavior.
@CLAassistant

CLAassistant commented Jun 9, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Member Author

Code review follow-ups (recommended by pr-af)

pr-af reviewed the workflow restart/replay change and flagged a few tightening items. Pushed in 87eb67f:

  • Reuse provenance now shows the source run. ExecutionReuseInfo.source_run_id was declared (and the node sidebar/graph render it) but the backend never populated it. It's now back-filled from the run lineage — every reused node in a restarted run shares the run's single replay source, so it's taken from lineage rather than re-queried per node.
  • workflow_runs restart row documented as a metadata-only sidecar. Its status/total_steps are seeded at enqueue and intentionally not kept current; every read path (run list, run detail, DAG) derives live status from execution aggregation and only reads the lineage/golden fields. Comment added so the columns aren't mistaken for authoritative.
  • Replay-match contract documented on findReplayHit: keyed on (node, reasoner, canonical input+context), earliest succeeded source wins, position/multiplicity agnostic — correct for deterministic graphs; vary input/context or use reuse=none when a distinct result per identical call is needed.

Added TestFillReuseSourceRun for the back-fill. go build, go vet, and go test ./internal/handlers are green locally.

Two larger review items were intentionally left out of this commit for separate discussion: restart drops caller/target DID so restarted/forked runs have no VC provenance chain, and fork replaces input/context wholesale (--model drops other context keys).


Generated by Claude Code

@santoshkumarradha

Copy link
Copy Markdown
Member Author

@AbirAbbas pushed a final UI polish pass after review: trace/graph reuse markers are now compact branch indicators instead of text badges, restart/golden chips use existing badge typography, restart colors use statusTone tokens, and the screenshots/PR body were refreshed.

All GitHub workflow checks are passing on the latest push. The only remaining pending status is the external CLA assistant.

@santoshkumarradha

Copy link
Copy Markdown
Member Author

@AbirAbbas this one is gtg as well

@santoshkumarradha santoshkumarradha left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one issue in the Python SDK context propagation path before I can approve this.

ExecutionContext.to_headers() is still forwarding parent_execution_id as X-Parent-Execution-ID when that field is present. For child node calls, the control plane expects X-Execution-ID and X-Parent-Execution-ID to both identify the current execution, because it allocates a fresh child execution ID per hop and uses the parent header to point back to the immediate caller. A local follow-up on top of this branch already changes that header to self.execution_id, which suggests the current PR head is still on the wrong side of that contract.

Please update the Python SDK header mapping and keep the matching test expectation with it. After that I can recheck quickly.

@AbirAbbas

Copy link
Copy Markdown
Contributor

taking over this PR

…t-fork-replay

# Conflicts:
#	control-plane/internal/handlers/workflow_dag.go
#	control-plane/web/client/src/components/RunTrace.tsx
#	control-plane/web/client/src/components/WorkflowDAG/WorkflowNode.tsx
#	control-plane/web/client/src/components/WorkflowDAG/workflowDagUtils.ts
@AbirAbbas AbirAbbas enabled auto-merge June 22, 2026 14:07
@AbirAbbas AbirAbbas added this pull request to the merge queue Jun 22, 2026
Merged via the queue into main with commit e4d8596 Jun 22, 2026
37 checks passed
@AbirAbbas AbirAbbas deleted the codex/workflow-restart-fork-replay branch June 22, 2026 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants