Add workflow restart and replay reuse#639
Conversation
Performance
✓ No regressions detected |
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
d9f5a54 to
c6a37ec
Compare
|
@AbirAbbas this is ready for review. Platform PR #639 adds restart/fork recovery from a failed workflow point with replay reuse for already-succeeded app.call outputs, plus minimal run/detail/graph provenance so users can see which nodes reused prior output. Docs website PR should merge after this lands: https://github.com/Agent-Field/website2.0/pull/27 |
pr-af (https://github.com/Agent-Field/pr-af) reviewed the workflow restart/replay change and recommended these follow-ups, applied here: - Back-fill ExecutionReuseInfo.source_run_id from the run lineage so the reused-node sidebar and graph provenance show the source run, not just the source execution. Every reused node in a restarted run shares the run's single replay source, so it is taken from lineage rather than re-queried per node. - Document that the restart workflow_runs row is a metadata-only sidecar: its status/total_steps are seeded at enqueue and are not kept current, and all read paths derive live status from execution aggregation. - Document the replay-match contract in findReplayHit (keyed on node/reasoner/canonical input+context; earliest succeeded source wins; position/multiplicity agnostic). Adds TestFillReuseSourceRun covering the back-fill behavior.
Code review follow-ups (recommended by pr-af)pr-af reviewed the workflow restart/replay change and flagged a few tightening items. Pushed in
Added Two larger review items were intentionally left out of this commit for separate discussion: restart drops caller/target DID so restarted/forked runs have no VC provenance chain, and fork replaces input/context wholesale ( Generated by Claude Code |
|
@AbirAbbas pushed a final UI polish pass after review: trace/graph reuse markers are now compact branch indicators instead of text badges, restart/golden chips use existing badge typography, restart colors use statusTone tokens, and the screenshots/PR body were refreshed. All GitHub workflow checks are passing on the latest push. The only remaining pending status is the external CLA assistant. |
|
@AbirAbbas this one is gtg as well |
santoshkumarradha
left a comment
There was a problem hiding this comment.
I found one issue in the Python SDK context propagation path before I can approve this.
ExecutionContext.to_headers() is still forwarding parent_execution_id as X-Parent-Execution-ID when that field is present. For child node calls, the control plane expects X-Execution-ID and X-Parent-Execution-ID to both identify the current execution, because it allocates a fresh child execution ID per hop and uses the parent header to point back to the immediate caller. A local follow-up on top of this branch already changes that header to self.execution_id, which suggests the current PR head is still on the wrong side of that contract.
Please update the Python SDK header mapping and keep the matching test expectation with it. After that I can recheck quickly.
|
taking over this PR |
…t-fork-replay # Conflicts: # control-plane/internal/handlers/workflow_dag.go # control-plane/web/client/src/components/RunTrace.tsx # control-plane/web/client/src/components/WorkflowDAG/WorkflowNode.tsx # control-plane/web/client/src/components/WorkflowDAG/workflowDagUtils.ts
Summary
Product design
app.calloutputs where the control plane has persisted inputs/outputsGoldenplusRestarted/Forkedmetadata chips so users can see lineage without another table columnstatusTonetheme tokens; no standalone color palette was addedDocs PR
Website docs/screenshots are in https://github.com/Agent-Field/website2.0/pull/27. Merge that after this platform PR so the public docs only describe released API/UI behavior.
Screenshots
Open full size
Open full size
Open full size
Open full size
Open full size
Validation
cd control-plane && go test ./internal/handlers ./internal/handlers/ui ./internal/server ./internal/clicd control-plane && go test ./internal/handlers -run 'TestExecute.*Replay|TestRestartExecutionHandler|TestExecutionReuseInfo'cd control-plane/web/client && npm test -- --run src/components/runs/RunLifecycleMenu.test.tsx src/test/pages/RunsPage.test.tsx src/test/pages/RunDetailPage.test.tsx src/test/components/WorkflowDAG/nodeDetailSidebar.test.tsx src/test/components/RunTrace.test.tsx-> 41 passedcd control-plane/web/client && npm run build./scripts/coverage-surface.sh web-ui-> 128 files / 665 tests passedcd sdk/typescript && npm run lint && npm run buildpython3 -m py_compile sdk/python/agentfield/async_execution_manager.py sdk/python/tests/test_async_execution_manager_final90.py tests/functional/agents/restart_replay_agent.py tests/functional/tests/test_restart_replay.pycd sdk/python && python3 -m pytest tests/test_async_execution_manager_final90.py::test_update_execution_from_status_populates_same_status_success tests/test_execution_context_core.py -q-> 11 passedOPENROUTER_MODEL=openrouter/google/gemini-3.1-flash-lite PYTEST_ARGS="-v -n 1 tests/test_restart_replay.py::test_restart_reuses_successful_calls_and_continues_complex_openrouter_graph" docker compose -f docker/docker-compose.local.yml -f /tmp/agentfield-compose-no-host-ports.yml up --build --abort-on-container-exit --exit-code-from test-runner-> 1 passed in 13.41sgoogle/gemini-3.1-flash-litechat completions request with localOPENROUTER_API_KEY-> 200okNote: the restart/replay E2E graph uses deterministic checkpoints so replay correctness is not coupled to provider latency; the separate OpenRouter smoke verifies the requested model/key path.