Skip to content

fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717

Draft
obra wants to merge 37 commits into
devfrom
sdd-review-dispatch
Draft

fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717
obra wants to merge 37 commits into
devfrom
sdd-review-dispatch

Conversation

@obra

@obra obra commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Targets dev per the template requirement.

Who is submitting this PR? (required)

Field Value
Your model + version Claude Fable 5 (claude-fable-5[1m])
Harness + version Claude Code 2.1.170
All plugins installed superpowers 5.1.0 (marketplace install; work done in a worktree of this repo), superpowers-evals quorum lab as testing apparatus, plus unrelated local plugins (episodic-memory, superpowers-chrome, plugin-dev, agent-sdk-dev, linear, context7, primeradiant-ops, claude-session-driver, github-triage)
Human partner who reviewed this diff Jesse Vincent (@obra) — directed the work and approved the design spec, the implementation plan, every review-gate summary, and each cost-optimization iteration; requested this PR and assigned @arittr for full-diff review

What problem are you trying to solve?

SDD's per-task code quality reviewers routinely did branch-review-scale work on single-task diffs. A field report from a live Serf session flagged reviewers doing repo-wide greps, full-file reads across adjacent systems, package-wide -race runs, and -count=100 test loops — making the parent session appear stuck while a child burned tokens on validation the implementer had already done.

We could not verify that report's cited session (it isn't on this machine), so we mined two real local SDD sessions instead (a1a6719a… sen-core-v2, 0cc1a12d… serf). Confirmed: 7/8 quality reviewers in one session ran repo-wide greps; the most expensive ran 50+ Bash commands over ~200s and ~1.5M tokens in; quality reviewers cost 4-8× what spec reviewers cost on the same tasks. Notably, no reviewer ran heavy tests autonomously — every package-wide or repeated test run was explicitly requested by a controller-written prompt ("check all uses," "run tests if useful, especially race-focused ones"). Spec reviewers, whose prompt has a diff-scope guard (#1595), stayed tight: 6-16 tool calls, 14-65s.

Root causes:

  1. The per-task quality prompt delegated to requesting-code-review/code-reviewer.md — a merge-readiness review (architecture, security, production readiness, "Ready to merge?") — so every per-task review inherited branch-level breadth.
  2. The controller got zero guidance on writing reviewer prompts, so it invented open-ended directives reviewers interpreted literally.
  3. Duplicated work: two reviewers per task each re-derived and re-read the same diff; the quality template re-checked "Plan alignment" the spec reviewer had just verified; reviewers re-ran suites the implementer had already run (and reported with TDD evidence) on identical code.

Live before/after eval runs then surfaced a second problem the field report couldn't see: per-dispatch overhead. Three subagent spin-ups per task (implementer + two reviewers), each re-deriving the task diff with its own git commands, plus controller coordination, made dispatch count — not any single reviewer's behavior — the dominant cost. That evidence (detailed in the spec's "Cost iterations" section) is what drove the design to its final shape.

What does this PR change?

The per-task review is now one reviewer, one reading of the diff, two verdicts:

  • New task-reviewer-prompt.md replaces spec-reviewer-prompt.md + code-quality-reviewer-prompt.md (both deleted). One self-contained, task-scoped template returns Part 1 spec compliance (✅/❌, plus an explicit "⚠️ Cannot verify from diff" channel) and Part 2 code quality (Strengths / Critical / Important / Minor, "Task quality: Approved | Needs fixes"). Scope budget (read changed files first; inspect adjacent code only for a named concrete risk), test budget (the implementer's reported runs cover this exact code; a reviewer runs at most a focused test for a specific doubt), evidence rule (cite file:line, don't narrate), calibration (Important = cannot-trust-until-fixed: incorrect/fragile behavior, missed requirement, or merge-blocking maintainability damage — verbatim logic duplication, swallowed errors, assertion-free tests), and skepticism extended to the implementer's rationales (a stated "per YAGNI" never downgrades a finding's severity).
  • New scripts/review-package generates the reviewer's diff file (commit list + --stat + diff -U10), defaulting to a unique self-describing path (<git-dir>/sdd/review-<base7>..<head7>.diff — worktree- and submodule-safe, so concurrent sessions cannot collide and a re-review after fixes always gets a distinctly named fresh file). Explicit BASE because controllers improvised HEAD~1, which truncates multi-commit tasks. The final whole-branch reviewer gets a branch-wide package the same way — measured 33 turns/23 tool calls → 6 turns/3 calls at controller-model prices.
  • New scripts/task-brief extracts one task's text from the plan to a file the implementer reads directly; implementers write detailed reports to files and return ≤15-line summaries. Dispatch prompts follow a five-part composition recipe (micro-tested: the positive recipe beat a "do not restate" prohibition 3.0 vs 4.4 transcribed values — the prohibition scored worse than no guidance at all).
  • Progress ledger (<git-dir>/sdd/progress.md): one line per completed task (commit range + verdict) plus accumulated Minor findings, written as part of normal bookkeeping. Conversation memory does not survive compaction — transcript mining of real sessions found controllers re-dispatching entire completed task sequences afterwards (269 dispatches for ~22 tasks); the ledger is the durable recovery map. Final-review findings go to ONE omnibus fix subagent (a real session's per-finding fix wave cost more than all its tasks combined); fix dispatches name their covering test files.
  • model: is a REQUIRED line in both prompt templates. "Always specify the model" as prose guidance decayed mid-session in a measured run (17 dispatches silently inherited the most expensive model, +40% run cost); as a template placeholder it held in 3 of 3 validation runs.
  • implementer-prompt.md: run the focused test while iterating, the full suite once before committing; after fixing a review finding, re-run the covering tests and report results (this is what lets reviewers not re-run them).
  • SKILL.md controller guidance: Model Selection rewritten around turn count, not token price (cheap models take 2-3× the turns on multi-step work — mid-tier is the floor; cheapest tier only for single-file mechanical fixes; always specify a model explicitly — omitting it silently inherits the session's, usually most expensive, model). Reviewer-prompt construction rules: no open-ended directives, no test re-runs on unchanged code, never pre-judge findings for the reviewer (phrase-level: "do not flag," "at most Minor," "the plan chose"), include the design's global constraints that bind the task, hand the diff as a file via the script. ⚠️-item handling (controller resolves each itself; confirmed gaps go back to the implementer as failed spec review). Minor findings roll up into the final review for triage. Fix dispatches carry the re-run-tests contract. Final whole-branch review explicitly pointed at ../requesting-code-review/code-reviewer.md (unchanged and still broad).
  • Cross-platform tool tables (using-superpowers/references/{antigravity,gemini}-tools.md): reviewer template names updated to task-reviewer.
  • New eval scenario sdd-quality-reviewer-catches-planted-defect in superpowers-evals (separate repo; submodule pointer bump follows propagation): a fixture plan plants verbatim duplication and an assertion-free test whose name promises verification it never performs; the run passes only if the reviewer flags the duplication openly and the lying test does not survive.

Design spec and implementation plan are in the diff (docs/superpowers/specs/2026-06-09-…, docs/superpowers/plans/2026-06-09-…); the spec's "Cost iterations" section records every optimization round with measurements.

Deliberately preserved: full re-reviews (no re-review narrowing), coordinator model judgment (no forced tier), and skills/requesting-code-review/ untouched — it remains the broad final/ad-hoc review template.

Is this change appropriate for the core library?

Yes. It modifies core SDD workflow files that every superpowers user exercises, regardless of project type. No third-party integrations, no project-specific configuration. The new script is plain bash with no dependencies beyond git.

What alternatives did you consider?

  • Keeping two separate per-task reviewers (the design's original shape, for bias isolation between spec and quality judgment). This was the maintainer's initial preference and the first shipped iteration. Live eval economics overturned it: per-dispatch overhead and duplicate diff ingestion made the two-reviewer round slower and more expensive than baseline (69.9 min vs 65 min on go-fractals) even after prompt tightening. The maintainer explicitly reopened the decision ("everything's on the table") and approved the merge after the single task reviewer beat baseline on every axis with better blind-judged deliverable quality.
  • Scoped re-reviews (the field report's main ask: verify only the prior finding + amendment). Rejected — full re-reviews are intentional; the test budget, not reading-scope narrowing, curbs the report's worst cited example.
  • Turn-count guidance and prompt tightening alone (iteration 1). Measured insufficient: tokens down 29%, wall-clock flat, and controllers adopted optional paste-the-diff guidance in only 2 of 22 dispatches. Optional guidance to agents goes unadopted; the script + REQUIRED placeholder made it structural.
  • Forcing cheaper models for reviews. Rejected by the maintainer in favor of coordinator judgment; the shipped guidance instead explains the turn-count cost model and sets a mid-tier floor for multi-step work.
  • Narrowing requesting-code-review/code-reviewer.md itself. Rejected: it serves final branch review and ad-hoc review, which should stay broad; narrowing it would break the "per-task narrow, final broad" distinction this PR creates.

Does this PR contain multiple unrelated changes?

No. All changes implement one design (task-scoped per-task review dispatch) from one spec, and depend on each other: the reviewer's "don't re-run the implementer's tests" rule requires the implementer's re-run-after-fix rule; the task reviewer's diff-file contract requires the script and the controller guidance that invokes it; the tool-table renames track the prompt-file rename.

Existing PRs

Environment tested

Harness (e.g. Claude Code, Cursor) Harness version Model Model version/ID
Claude Code 2.1.170 Claude Fable 5 claude-fable-5[1m]

New harness support (required if this PR adds a new harness)

Not applicable — no new harness. (The cross-platform tool tables were checked: Antigravity maps reviewer templates to the read-only research type, so the task-reviewer prompt includes "name the test you would run" phrasing, keeping that mapping valid; both tables' template names were updated.)

Evaluation

  • Initial prompt: the human partner provided a field report (2026-06-09-code-quality-reviewer-scope-budget-issue.md) and asked for agents to "study our own quality and spec compliance reviews for issues — we're trying to make the reviews more efficient without sacrificing quality." Three study agents analyzed the prompt chain and mined real session transcripts before any design work.
  • Live before/after quorum runs drove five optimization iterations over two days (full history with measurements in the spec; complete experiment log including negative results in the evals repo, docs/experiments/2026-06-10-sdd-cost-experiments.md). All numbers below are honest ranges — a same-config re-run exposed ±20% run-to-run variance, so we report across all same-design runs rather than cherry-picking:
    • sdd-go-fractals (9 runs of the new design): 44.4-59.6 min / 13.4-20.0M / $11.67-14.84 vs baseline 64.9 min / 21.2M / $16.07 — the worst draw beats baseline on every axis; typical mid-band savings ~20-25%
    • sdd-svelte-todo (final config, 2 runs): 55.0-69.3 min / 19.3-24.1M / $14.99-20.30 vs baseline 79.7 / 27.3M / $20.98 — time and tokens clearly better; cost overlaps baseline at the top of the range (the expensive run hit 9 review-fix waves across 12 tasks — review-strictness variance, with all 34 dispatches model-disciplined)
    • sdd-rejects-extra-features: $1.31-1.37 vs $1.88 baseline; spec-reviewer-catches-planted-flaws: pass, costs flat
    • new sdd-quality-reviewer-catches-planted-defect: pass on the final config ($2.77) — planted duplication flagged openly, the assertion-free test with the lying name caught and fixed
  • Mechanism-level measurements: task reviewers handed a review-package file averaged ~3 turns / 1 tool call vs ~9 turns / 6.4 calls re-deriving diffs (variance note: reviewer escape-hatch appetite swings this run-to-run; the budget bounds it); the final reviewer dropped 33→6 turns with a branch package; reviewers ran zero redundant test suites under the test budget; dispatch-model discipline held 3/3 runs with the REQUIRED template line; blind A/B deliverable comparison (neutral paths, rotated labels, fresh judge) scored the new design's deliverables at parity-or-better with baseline; a 3-run ablation removing the brief/report file mechanism (keeping everything else) measured cost-neutral ($11.61-13.25 vs the same-night control $14.10, inside the band) — the file handoffs are justified by requirements fidelity and compaction durability, not dollars, and we say so rather than claim savings.
  • Tested and declined, with data (recorded so they aren't re-proposed): controller turn-batching and parallel-call review pipelining (the controller emits exactly one tool call per message — zero multi-tool messages in every measured run), background-dispatch pipelining (mechanism adopted but benefit below the noise floor on these scenarios).
  • Honest caveat (also in the spec): per-run deliverable quality is stochastic — defect classes recur across configs run-to-run; quantifying catch-rate differences would need N runs per config, which we did not do.
  • Known eval gap (documented in the spec): no pre-existing scenario measured per-task reviewer catch rate, which is why this PR adds the planted-defect scenario.

Rigor

  • If this is a skills change: I used superpowers:writing-skills-style discipline via the full superpowers pipeline — brainstorming → spec → adversarial spec review → plan → subagent-driven implementation with two-stage review per task → final whole-branch review — followed by three measured optimization iterations against live eval runs
  • This change was tested adversarially, not just on the happy path
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without evals showing the change is an improvement — the Red Flags additions and the spec-reviewer skepticism rewrite are covered by the five-scenario matrix above

Adversarial testing summary: two competing adversarial reviewers attacked the design spec (13 findings; 9 accepted and fixed); every implementation task went through subagent review with review-fix loops; the planted-defect eval failed through five distinct suppression mechanisms during development (controller pre-judging findings, severity pre-rating, reviewer calibration, implementer rationale-framing, and an eval bar that was itself wrong) — each fixed in the prompts generally, not by teaching to the test; a final whole-branch reviewer verified the shipped files against the plan.

Human review

  • A human has reviewed the COMPLETE proposed diff before submission — review structure: Jesse Vincent (@obra, repo owner) gated every stage (spec approval, plan approval, per-task review summaries, each optimization iteration, final review summary) and directed this PR; he has assigned @arittr for the complete line-by-line diff review in-PR

obra and others added 17 commits June 9, 2026 18:19
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the
SDD controller fabricating a plan constraint and instructing the quality
reviewer not to flag the planted DRY violation. The duplication shipped.
Constructing Reviewer Prompts now bans suppression directives alongside
open-ended broadening directives.
In live eval runs, controllers given judgment-based model selection
stopped passing a model at all; the omitted parameter inherits the
session's top-tier model, silently making every subagent maximally
expensive (one run dispatched 26/26 reviewers on the session model).
Live eval deliverables shipped five polish defects; tracing each through
the transcripts showed three mechanisms, each now addressed:
- reviewers answered pointed checklist items with unsupported yes
  (evidence rule: every What-to-Check answer needs file:line evidence)
- no reviewer ever saw the design's global constraints (controllers now
  paste binding constraints into task requirements)
- test output noise was invisible everywhere (pristine-output checks in
  implementer self-review and quality review)
Second observed instance: with the Constructing Reviewer Prompts rule
already live, a controller still wrote 'do not treat that duplication as
a defect to fix — the plan chose it; you may note it as a Minor
observation at most' into a quality reviewer dispatch, fabricating plan
intent from the plan's example snippet. Promote the rule to the Red
Flags Never list and name the rationalization.
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
…ence

Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's
42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn
profiling attributed it to: haiku dispatches taking 2-3x the turns of
sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand
(518 Bash calls), and evidence-rule narration. Changes: turn-count-beats-
token-price model guidance; controllers paste small diffs into reviewer
prompts (reviewers then need few or no tool calls); evidence scoped to
findings and would-be-bare-yes checks; Important defined as cannot-trust-
until-fixed with coverage suggestions Minor; fixes dispatched only for
Critical/Important.
Iteration-1 profiling: implementers and per-dispatch overhead dominate
(429 of 686 subagent turns; controller coordination is half the dollars
and scales with dispatch count), reviewers are individually lean, and
the controller pasted the diff in only 2 of 22 review dispatches when
the guidance was phrased as optional.

Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md
replaced by task-reviewer-prompt.md (one reviewer, one reading of a
pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality);
one fix dispatch can address both kinds of findings; controller now
runs git diff itself and pastes it (imperative, not optional);
implementers run focused tests while iterating and the full suite once
before committing; flowchart, example, Red Flags, tool tables updated.
The broad final whole-branch review is unchanged.
With merged review, a planted verbatim-duplication defect shipped: the
reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted
definition of Important, and the Minor-rolls-up rule meant no fix was
ever dispatched and the final review never saw the finding. Calibration
now names merge-blocking maintainability damage (verbatim duplication,
swallowed errors, assertion-free tests) as Important, and controllers
must paste accumulated Minor findings into the final review dispatch.
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased
as guidance; reviewers without the diff re-derive it by hand, which is
the single largest remaining reviewer cost. Now a Red Flags Never entry
and a REQUIRED marker on the template placeholder.
Fourth planted-defect failure mode: the implementer's self-report said
'noted mild structural duplication; left unabstracted per YAGNI' and the
reviewer deferred to that framing, rating the duplication no finding at
all. The pre-judging keeps relocating — controller prompt, then reviewer
calibration, now the implementer's report. Rationales are claims; they
never downgrade severity.
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's
reluctance is locally rational: pasting loads the diff into the (most
expensive) controller context permanently, while a reviewer self-fetch
costs a few cheap turns. The diff-file handoff is cheap for both sides:
the controller redirects git diff to /tmp without reading it, and the
reviewer gets the whole change in one Read call.
The skill read as a changelog: 'combined task review,' 'one reviewer,
one reading,' 'one dispatch,' and an example still showing diffs pasted
into prompts. A reader who never saw the two-reviewer design has no
referent for 'combined.' Prose now states the design directly, and the
flowchart/example reflect the diff-file handoff.
scripts/review-package generates the reviewer's input deterministically:
commit list, stat summary, and net diff with -U10 context, written to a
file from an explicit BASE. Live runs showed controllers improvising
'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks,
and svelte's five fix dispatches shipped without re-running any tests —
fix dispatches now explicitly carry the implementer's
re-run-and-report contract.
obra added 9 commits June 10, 2026 12:32
…ackage, REQUIRED model lines, reviewer risk budget

Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31
(baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline
79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline
3/3 runs after moving model: into the templates as a REQUIRED line.
Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
Carries the planted-defect + crisp scenarios, batch A-E experiment
logs, claude-sonnet model-variant target, and method docs — rebased
onto the obol migration and pushed to superpowers-evals main.
@obra obra changed the title fix(sdd): task-scoped per-task reviews — self-contained quality prompt, scope/test budgets, controller guidance fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned Jun 11, 2026
@obra obra marked this pull request as draft June 11, 2026 02:57
@obra

obra commented Jun 11, 2026

Copy link
Copy Markdown
Owner Author

Before/after eval results (promised in the PR body)

All runs: superpowers-evals quorum harness driving real Claude Code sessions (claude-fable-5 session model, subagents per the skill's model selection), one scenario per run, LLM-judged verdicts plus measured economics from the session's own token accounting. Baseline = origin/dev skills; "new design" = this branch. All numbers are ranges across every same-design run, not best-run cherry-picks — a same-config re-run measured ±20% variance (44.4 vs 57.1 min on identical prompts), so single-run deltas under that are noise.

End-to-end SDD scenarios

Scenario Config Verdict Wall-clock Tokens Cost
sdd-go-fractals baseline (dev) pass 64.9 min 21.2M $16.07
sdd-go-fractals new design (8 runs) 8/8 pass 44.4-57.1 min 13.4-20.0M $11.67-14.84
sdd-svelte-todo baseline (dev) pass 79.7 min 27.3M $20.98
sdd-svelte-todo final config (run 1) pass 55.0 min (−31%) 19.3M (−29%) $14.99 (−29%)
sdd-svelte-todo final config (run 2) pass 69.3 min (−13%) 24.1M (−12%) $20.30 (−3%)

The worst fractals draw beats the baseline on every axis; typical mid-band savings are ~20-25% across time, tokens, and dollars.

Behavior-gate scenarios

Scenario Config Verdict Cost
sdd-rejects-extra-features baseline pass $1.88
sdd-rejects-extra-features new design (2 runs) pass $1.31-1.37
spec-reviewer-catches-planted-flaws baseline pass $0.50
spec-reviewer-catches-planted-flaws new design pass $0.50
sdd-quality-reviewer-catches-planted-defect (new in this PR) final config pass $2.77

The planted-defect scenario seeds a fixture plan with verbatim duplicated logic and an assertion-free test whose name promises verification it never performs; passing requires the task reviewer to flag the duplication openly and the lying test to not survive the session.

Where the savings come from (per-subagent transcript profiling)

  • One reviewer dispatch per task instead of two (merged spec+quality verdicts, one reading of the diff) — the single largest win.
  • Diff handed as a file, both per-task and final. Task reviewers given a review package averaged ~3 turns / 1 tool call vs ~9 turns / 6.4 calls re-deriving diffs with git; the final whole-branch reviewer dropped 33 turns / 23 calls → 6 / 3 with a branch-wide package — at controller-model prices. Optional "paste the diff" guidance was adopted in 2 of 22 dispatches; a script invoked from a REQUIRED flowchart step is what made it stick.
  • model: as a REQUIRED template line. Prose "always specify the model" decayed mid-session in one measured run — 17 dispatches silently inherited the session's most expensive model (+40% run cost). As a template placeholder: explicit on 57 of 57 dispatches across the three validation runs.
  • Turn-count model guidance. Cheap-tier subagents took 2-3× mid-tier's turns on multi-step work (678 of 1197 subagent turns in one regressed run); cost scales with turns × resident context, so the guidance sets a mid-tier floor.
  • Test budget held: reviewers ran zero redundant suites; fix dispatches name their covering tests.

Durability changes measured from real-session mining (not visible in 45-min evals)

Transcript mining of real local SDD sessions found the most expensive failures only long sessions hit: controllers re-dispatching entire completed task sequences after context compaction (269 dispatches for ~22 tasks), and a final review spawning 7 per-finding fixers whose cost exceeded all preceding tasks. The progress ledger (<git-dir>/sdd/progress.md) and the one-omnibus-fixer rule target these; the eval gates confirm they cost nothing on short runs.

Quality checks

  • Blind A/B deliverable comparison (workdirs copied to neutral paths, labels rotated between comparisons, fresh judge scoring against the fixture's own plan): the merged-reviewer design judged 9/10 vs baseline 7/10 in the deciding round; subsequent configs at parity within observed judge noise (±1.5).
  • Honest caveat (also recorded in the spec): per-run deliverable quality is stochastic — defect classes recur in some runs of both configs; quantifying catch-rate differences rigorously would need N runs per config, which we did not do.

Full iteration history: spec docs/superpowers/specs/2026-06-09-sdd-task-scoped-review-dispatch-design.md ("Cost iterations", iterations 1-5). Complete experiment log including the negative results (turn batching, two pipelining mechanisms — tested, declined, with data): superpowers-evals docs/experiments/2026-06-10-sdd-cost-experiments.md.

cc @arittr — branch is refreshed with the full optimization campaign (body has the summary; complete experiment log incl. negative results lives in superpowers-evals docs/experiments/2026-06-10-sdd-cost-experiments.md). Marked draft while it awaits your review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant