Add workflow restart and replay reuse by santoshkumarradha · Pull Request #639 · Agent-Field/agentfield

santoshkumarradha · 2026-06-09T17:02:59Z

Summary

add a control-plane restart endpoint for executions with workflow/node scope, replay reuse, fork metadata, and run lineage
expose restart from CLI, Python SDK, TypeScript SDK, run detail, DAG node sidebar, and run lifecycle menus
surface replay, golden, and fork/restart lineage in run and DAG APIs without adding user-authored replay APIs

Product design

v1 gives failed workflow recovery: restart from the failed run or from a specific node, reusing already-successful app.call outputs where the control plane has persisted inputs/outputs
v1.5 opens comparison and governance workflows without extra product surface: fork with changed input/model, mark a good run golden, and inspect which nodes were reused vs rerun
default recovery is one-click Restart run from the run detail header or row menu; advanced paths stay behind existing overflow/menu patterns
Fork with changes is intentional branching, not a separate dashboard; it uses the same restart primitive with changed input/reuse settings
run detail and run list show small Golden plus Restarted/Forked metadata chips so users can see lineage without another table column
reused nodes are marked only where the user scans execution behavior: compact branch markers in trace rows and graph nodes, plus one source execution line in the graph node sidebar
new UI accents use existing statusTone theme tokens; no standalone color palette was added

Docs PR

Website docs/screenshots are in https://github.com/Agent-Field/website2.0/pull/27. Merge that after this platform PR so the public docs only describe released API/UI behavior.

Screenshots

Open full size

Validation

cd control-plane && go test ./internal/handlers ./internal/handlers/ui ./internal/server ./internal/cli
cd control-plane && go test ./internal/handlers -run 'TestExecute.*Replay|TestRestartExecutionHandler|TestExecutionReuseInfo'
cd control-plane/web/client && npm test -- --run src/components/runs/RunLifecycleMenu.test.tsx src/test/pages/RunsPage.test.tsx src/test/pages/RunDetailPage.test.tsx src/test/components/WorkflowDAG/nodeDetailSidebar.test.tsx src/test/components/RunTrace.test.tsx -> 41 passed
cd control-plane/web/client && npm run build
./scripts/coverage-surface.sh web-ui -> 128 files / 665 tests passed
cd sdk/typescript && npm run lint && npm run build
python3 -m py_compile sdk/python/agentfield/async_execution_manager.py sdk/python/tests/test_async_execution_manager_final90.py tests/functional/agents/restart_replay_agent.py tests/functional/tests/test_restart_replay.py
cd sdk/python && python3 -m pytest tests/test_async_execution_manager_final90.py::test_update_execution_from_status_populates_same_status_success tests/test_execution_context_core.py -q -> 11 passed
Docker restart/replay E2E: OPENROUTER_MODEL=openrouter/google/gemini-3.1-flash-lite PYTEST_ARGS="-v -n 1 tests/test_restart_replay.py::test_restart_reuses_successful_calls_and_continues_complex_openrouter_graph" docker compose -f docker/docker-compose.local.yml -f /tmp/agentfield-compose-no-host-ports.yml up --build --abort-on-container-exit --exit-code-from test-runner -> 1 passed in 13.41s
OpenRouter provider smoke: google/gemini-3.1-flash-lite chat completions request with local OPENROUTER_API_KEY -> 200 ok

Note: the restart/replay E2E graph uses deterministic checkpoints so replay correctness is not coupled to provider latency; the separate OpenRouter smoke verifies the requested model/key path.

github-actions · 2026-06-09T17:05:02Z

Performance

SDK	Memory	Δ	Latency	Δ	Tests	Status
Python	9.4 KB	+4%	0.34 µs	-3%	✓	✓
TS	355 B	+1%	1.71 µs	-15%	✓	✓

✓ No regressions detected

github-actions · 2026-06-09T17:54:40Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 84%, aggregate ≥ 85%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.00%	87.40%	↓ -0.40 pp	🟡
`sdk-go`	91.80%	92.00%	↓ -0.20 pp	🟢
`sdk-python`	93.87%	93.73%	↑ +0.14 pp	🟢
`sdk-typescript`	90.05%	90.42%	↓ -0.37 pp	🟢
`web-ui`	84.83%	84.79%	↑ +0.04 pp	🟡
aggregate	85.63%	85.75%	↓ -0.12 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-06-09T17:54:41Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Touched lines	Patch coverage	Status
`control-plane`	758	81.00%	✅
`sdk-go`	0	—	➖ no changes
`sdk-python`	42	100.00%	✅
`sdk-typescript`	21	95.00%	✅
`web-ui`	360	86.00%	✅

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

santoshkumarradha · 2026-06-09T19:56:20Z

@AbirAbbas this is ready for review.

Platform PR #639 adds restart/fork recovery from a failed workflow point with replay reuse for already-succeeded app.call outputs, plus minimal run/detail/graph provenance so users can see which nodes reused prior output.

Docs website PR should merge after this lands: https://github.com/Agent-Field/website2.0/pull/27

pr-af (https://github.com/Agent-Field/pr-af) reviewed the workflow restart/replay change and recommended these follow-ups, applied here: - Back-fill ExecutionReuseInfo.source_run_id from the run lineage so the reused-node sidebar and graph provenance show the source run, not just the source execution. Every reused node in a restarted run shares the run's single replay source, so it is taken from lineage rather than re-queried per node. - Document that the restart workflow_runs row is a metadata-only sidecar: its status/total_steps are seeded at enqueue and are not kept current, and all read paths derive live status from execution aggregation. - Document the replay-match contract in findReplayHit (keyed on node/reasoner/canonical input+context; earliest succeeded source wins; position/multiplicity agnostic). Adds TestFillReuseSourceRun covering the back-fill behavior.

CLAassistant · 2026-06-09T20:33:23Z

All committers have signed the CLA.

santoshkumarradha · 2026-06-09T20:33:26Z

Code review follow-ups (recommended by pr-af)

pr-af reviewed the workflow restart/replay change and flagged a few tightening items. Pushed in 87eb67f:

Reuse provenance now shows the source run. ExecutionReuseInfo.source_run_id was declared (and the node sidebar/graph render it) but the backend never populated it. It's now back-filled from the run lineage — every reused node in a restarted run shares the run's single replay source, so it's taken from lineage rather than re-queried per node.
workflow_runs restart row documented as a metadata-only sidecar. Its status/total_steps are seeded at enqueue and intentionally not kept current; every read path (run list, run detail, DAG) derives live status from execution aggregation and only reads the lineage/golden fields. Comment added so the columns aren't mistaken for authoritative.
Replay-match contract documented on findReplayHit: keyed on (node, reasoner, canonical input+context), earliest succeeded source wins, position/multiplicity agnostic — correct for deterministic graphs; vary input/context or use reuse=none when a distinct result per identical call is needed.

Added TestFillReuseSourceRun for the back-fill. go build, go vet, and go test ./internal/handlers are green locally.

Two larger review items were intentionally left out of this commit for separate discussion: restart drops caller/target DID so restarted/forked runs have no VC provenance chain, and fork replaces input/context wholesale (--model drops other context keys).

Generated by Claude Code

santoshkumarradha · 2026-06-09T21:28:01Z

@AbirAbbas pushed a final UI polish pass after review: trace/graph reuse markers are now compact branch indicators instead of text badges, restart/golden chips use existing badge typography, restart colors use statusTone tokens, and the screenshots/PR body were refreshed.

All GitHub workflow checks are passing on the latest push. The only remaining pending status is the external CLA assistant.

santoshkumarradha · 2026-06-17T00:22:19Z

@AbirAbbas this one is gtg as well

santoshkumarradha

I found one issue in the Python SDK context propagation path before I can approve this.

ExecutionContext.to_headers() is still forwarding parent_execution_id as X-Parent-Execution-ID when that field is present. For child node calls, the control plane expects X-Execution-ID and X-Parent-Execution-ID to both identify the current execution, because it allocates a fresh child execution ID per hop and uses the parent header to point back to the immediate caller. A local follow-up on top of this branch already changes that header to self.execution_id, which suggests the current PR head is still on the wrong side of that contract.

Please update the Python SDK header mapping and keep the matching test expectation with it. After that I can recheck quickly.

AbirAbbas · 2026-06-22T12:24:14Z

taking over this PR

…t-fork-replay # Conflicts: # control-plane/internal/handlers/workflow_dag.go # control-plane/web/client/src/components/RunTrace.tsx # control-plane/web/client/src/components/WorkflowDAG/WorkflowNode.tsx # control-plane/web/client/src/components/WorkflowDAG/workflowDagUtils.ts

feat: restart workflow runs with replay reuse

f4e9578

santoshkumarradha and others added 3 commits June 9, 2026 13:36

Merge branch 'main' into codex/workflow-restart-fork-replay

a6e868a

Polish restart recovery UI states

29f5ca5

Fix run detail test mocks for restart hooks

35dd224

santoshkumarradha added 2 commits June 9, 2026 13:57

Cover Python restart replay SDK paths

63cd04a

Show reused nodes in restart views

c6a37ec

santoshkumarradha force-pushed the codex/workflow-restart-fork-replay branch from d9f5a54 to c6a37ec Compare June 9, 2026 18:13

santoshkumarradha added 3 commits June 9, 2026 14:38

Cover restart provenance UX paths

f0f0421

Stabilize restart replay functional test

1edd96a

Stabilize restart replay coverage

97435ea

santoshkumarradha marked this pull request as ready for review June 9, 2026 19:56

santoshkumarradha requested review from a team and AbirAbbas as code owners June 9, 2026 19:56

Polish restart UI typography

812ec69

santoshkumarradha commented Jun 17, 2026

View reviewed changes

AbirAbbas enabled auto-merge June 22, 2026 14:07

AbirAbbas approved these changes Jun 22, 2026

View reviewed changes

AbirAbbas added this pull request to the merge queue Jun 22, 2026

Merged via the queue into main with commit e4d8596 Jun 22, 2026
37 checks passed

AbirAbbas deleted the codex/workflow-restart-fork-replay branch June 22, 2026 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow restart and replay reuse#639

Add workflow restart and replay reuse#639
AbirAbbas merged 12 commits into
mainfrom
codex/workflow-restart-fork-replay

santoshkumarradha commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

santoshkumarradha commented Jun 9, 2026

Uh oh!

CLAassistant commented Jun 9, 2026 •

edited

Loading

Uh oh!

santoshkumarradha commented Jun 9, 2026

Uh oh!

santoshkumarradha commented Jun 9, 2026

Uh oh!

santoshkumarradha commented Jun 17, 2026

Uh oh!

santoshkumarradha left a comment

Uh oh!

AbirAbbas commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

santoshkumarradha commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Product design

Docs PR

Screenshots

Validation

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

santoshkumarradha commented Jun 9, 2026

Uh oh!

CLAassistant commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

santoshkumarradha commented Jun 9, 2026

Code review follow-ups (recommended by pr-af)

Uh oh!

santoshkumarradha commented Jun 9, 2026

Uh oh!

santoshkumarradha commented Jun 17, 2026

Uh oh!

santoshkumarradha left a comment

Choose a reason for hiding this comment

Uh oh!

AbirAbbas commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

santoshkumarradha commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading

github-actions Bot commented Jun 9, 2026 •

edited

Loading

CLAassistant commented Jun 9, 2026 •

edited

Loading