fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717
fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717obra wants to merge 37 commits into
Conversation
…inter, model judgment
A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.
In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).
Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)
Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.
Resumed the offending eval controller session and asked it why it
pre-judged despite the rule being in context. Its retrospective: the
motive was avoiding a review loop, the abstract rule was read but not
applied at the moment it governs, and a phrase-level trigger ('do not
flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose')
would have fired where the principle did not.
…ence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.
Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.
With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.
Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.
Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.
Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.
The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.
scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.
…d, writing-plans variants
…ackage, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md
…eport completeness as checklist
…l validated ranges)
…lan-side crispness first
…con positive, L3 dead)
…nt clean, escalation points unstressed)
Carries the planted-defect + crisp scenarios, batch A-E experiment logs, claude-sonnet model-variant target, and method docs — rebased onto the obol migration and pushed to superpowers-evals main.
Before/after eval results (promised in the PR body)All runs: superpowers-evals quorum harness driving real Claude Code sessions ( End-to-end SDD scenarios
The worst fractals draw beats the baseline on every axis; typical mid-band savings are ~20-25% across time, tokens, and dollars. Behavior-gate scenarios
The planted-defect scenario seeds a fixture plan with verbatim duplicated logic and an assertion-free test whose name promises verification it never performs; passing requires the task reviewer to flag the duplication openly and the lying test to not survive the session. Where the savings come from (per-subagent transcript profiling)
Durability changes measured from real-session mining (not visible in 45-min evals)Transcript mining of real local SDD sessions found the most expensive failures only long sessions hit: controllers re-dispatching entire completed task sequences after context compaction (269 dispatches for ~22 tasks), and a final review spawning 7 per-finding fixers whose cost exceeded all preceding tasks. The progress ledger ( Quality checks
Full iteration history: spec cc @arittr — branch is refreshed with the full optimization campaign (body has the summary; complete experiment log incl. negative results lives in superpowers-evals |
Who is submitting this PR? (required)
claude-fable-5[1m])What problem are you trying to solve?
SDD's per-task code quality reviewers routinely did branch-review-scale work on single-task diffs. A field report from a live Serf session flagged reviewers doing repo-wide greps, full-file reads across adjacent systems, package-wide
-raceruns, and-count=100test loops — making the parent session appear stuck while a child burned tokens on validation the implementer had already done.We could not verify that report's cited session (it isn't on this machine), so we mined two real local SDD sessions instead (
a1a6719a…sen-core-v2,0cc1a12d…serf). Confirmed: 7/8 quality reviewers in one session ran repo-wide greps; the most expensive ran 50+ Bash commands over ~200s and ~1.5M tokens in; quality reviewers cost 4-8× what spec reviewers cost on the same tasks. Notably, no reviewer ran heavy tests autonomously — every package-wide or repeated test run was explicitly requested by a controller-written prompt ("check all uses," "run tests if useful, especially race-focused ones"). Spec reviewers, whose prompt has a diff-scope guard (#1595), stayed tight: 6-16 tool calls, 14-65s.Root causes:
requesting-code-review/code-reviewer.md— a merge-readiness review (architecture, security, production readiness, "Ready to merge?") — so every per-task review inherited branch-level breadth.Live before/after eval runs then surfaced a second problem the field report couldn't see: per-dispatch overhead. Three subagent spin-ups per task (implementer + two reviewers), each re-deriving the task diff with its own git commands, plus controller coordination, made dispatch count — not any single reviewer's behavior — the dominant cost. That evidence (detailed in the spec's "Cost iterations" section) is what drove the design to its final shape.
What does this PR change?
The per-task review is now one reviewer, one reading of the diff, two verdicts:
task-reviewer-prompt.mdreplacesspec-reviewer-prompt.md+code-quality-reviewer-prompt.md(both deleted). One self-contained, task-scoped template returns Part 1 spec compliance (✅/❌, plus an explicit "scripts/review-packagegenerates the reviewer's diff file (commit list +--stat+diff -U10), defaulting to a unique self-describing path (<git-dir>/sdd/review-<base7>..<head7>.diff— worktree- and submodule-safe, so concurrent sessions cannot collide and a re-review after fixes always gets a distinctly named fresh file). Explicit BASE because controllers improvisedHEAD~1, which truncates multi-commit tasks. The final whole-branch reviewer gets a branch-wide package the same way — measured 33 turns/23 tool calls → 6 turns/3 calls at controller-model prices.scripts/task-briefextracts one task's text from the plan to a file the implementer reads directly; implementers write detailed reports to files and return ≤15-line summaries. Dispatch prompts follow a five-part composition recipe (micro-tested: the positive recipe beat a "do not restate" prohibition 3.0 vs 4.4 transcribed values — the prohibition scored worse than no guidance at all).<git-dir>/sdd/progress.md): one line per completed task (commit range + verdict) plus accumulated Minor findings, written as part of normal bookkeeping. Conversation memory does not survive compaction — transcript mining of real sessions found controllers re-dispatching entire completed task sequences afterwards (269 dispatches for ~22 tasks); the ledger is the durable recovery map. Final-review findings go to ONE omnibus fix subagent (a real session's per-finding fix wave cost more than all its tasks combined); fix dispatches name their covering test files.model:is a REQUIRED line in both prompt templates. "Always specify the model" as prose guidance decayed mid-session in a measured run (17 dispatches silently inherited the most expensive model, +40% run cost); as a template placeholder it held in 3 of 3 validation runs.implementer-prompt.md: run the focused test while iterating, the full suite once before committing; after fixing a review finding, re-run the covering tests and report results (this is what lets reviewers not re-run them).SKILL.mdcontroller guidance: Model Selection rewritten around turn count, not token price (cheap models take 2-3× the turns on multi-step work — mid-tier is the floor; cheapest tier only for single-file mechanical fixes; always specify a model explicitly — omitting it silently inherits the session's, usually most expensive, model). Reviewer-prompt construction rules: no open-ended directives, no test re-runs on unchanged code, never pre-judge findings for the reviewer (phrase-level: "do not flag," "at most Minor," "the plan chose"), include the design's global constraints that bind the task, hand the diff as a file via the script.../requesting-code-review/code-reviewer.md(unchanged and still broad).using-superpowers/references/{antigravity,gemini}-tools.md): reviewer template names updated totask-reviewer.sdd-quality-reviewer-catches-planted-defectin superpowers-evals (separate repo; submodule pointer bump follows propagation): a fixture plan plants verbatim duplication and an assertion-free test whose name promises verification it never performs; the run passes only if the reviewer flags the duplication openly and the lying test does not survive.Design spec and implementation plan are in the diff (
docs/superpowers/specs/2026-06-09-…,docs/superpowers/plans/2026-06-09-…); the spec's "Cost iterations" section records every optimization round with measurements.Deliberately preserved: full re-reviews (no re-review narrowing), coordinator model judgment (no forced tier), and
skills/requesting-code-review/untouched — it remains the broad final/ad-hoc review template.Is this change appropriate for the core library?
Yes. It modifies core SDD workflow files that every superpowers user exercises, regardless of project type. No third-party integrations, no project-specific configuration. The new script is plain bash with no dependencies beyond git.
What alternatives did you consider?
requesting-code-review/code-reviewer.mditself. Rejected: it serves final branch review and ad-hoc review, which should stay broad; narrowing it would break the "per-task narrow, final broad" distinction this PR creates.Does this PR contain multiple unrelated changes?
No. All changes implement one design (task-scoped per-task review dispatch) from one spec, and depend on each other: the reviewer's "don't re-run the implementer's tests" rule requires the implementer's re-run-after-fix rule; the task reviewer's diff-file contract requires the script and the controller guidance that invokes it; the tool-table renames track the prompt-file rename.
Existing PRs
Environment tested
New harness support (required if this PR adds a new harness)
Not applicable — no new harness. (The cross-platform tool tables were checked: Antigravity maps reviewer templates to the read-only
researchtype, so the task-reviewer prompt includes "name the test you would run" phrasing, keeping that mapping valid; both tables' template names were updated.)Evaluation
2026-06-09-code-quality-reviewer-scope-budget-issue.md) and asked for agents to "study our own quality and spec compliance reviews for issues — we're trying to make the reviews more efficient without sacrificing quality." Three study agents analyzed the prompt chain and mined real session transcripts before any design work.docs/experiments/2026-06-10-sdd-cost-experiments.md). All numbers below are honest ranges — a same-config re-run exposed ±20% run-to-run variance, so we report across all same-design runs rather than cherry-picking:sdd-go-fractals(9 runs of the new design): 44.4-59.6 min / 13.4-20.0M / $11.67-14.84 vs baseline 64.9 min / 21.2M / $16.07 — the worst draw beats baseline on every axis; typical mid-band savings ~20-25%sdd-svelte-todo(final config, 2 runs): 55.0-69.3 min / 19.3-24.1M / $14.99-20.30 vs baseline 79.7 / 27.3M / $20.98 — time and tokens clearly better; cost overlaps baseline at the top of the range (the expensive run hit 9 review-fix waves across 12 tasks — review-strictness variance, with all 34 dispatches model-disciplined)sdd-rejects-extra-features: $1.31-1.37 vs $1.88 baseline;spec-reviewer-catches-planted-flaws: pass, costs flatsdd-quality-reviewer-catches-planted-defect: pass on the final config ($2.77) — planted duplication flagged openly, the assertion-free test with the lying name caught and fixedRigor
superpowers:writing-skills-style discipline via the full superpowers pipeline — brainstorming → spec → adversarial spec review → plan → subagent-driven implementation with two-stage review per task → final whole-branch review — followed by three measured optimization iterations against live eval runsAdversarial testing summary: two competing adversarial reviewers attacked the design spec (13 findings; 9 accepted and fixed); every implementation task went through subagent review with review-fix loops; the planted-defect eval failed through five distinct suppression mechanisms during development (controller pre-judging findings, severity pre-rating, reviewer calibration, implementer rationale-framing, and an eval bar that was itself wrong) — each fixed in the prompts generally, not by teaching to the test; a final whole-branch reviewer verified the shipped files against the plan.
Human review