fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned by obra · Pull Request #1717 · obra/superpowers

obra · 2026-06-10T00:53:36Z

Targets dev per the template requirement.

Who is submitting this PR? (required)

Field	Value
Your model + version	Claude Fable 5 (`claude-fable-5[1m]`)
Harness + version	Claude Code 2.1.170
All plugins installed	superpowers 5.1.0 (marketplace install; work done in a worktree of this repo), superpowers-evals quorum lab as testing apparatus, plus unrelated local plugins (episodic-memory, superpowers-chrome, plugin-dev, agent-sdk-dev, linear, context7, primeradiant-ops, claude-session-driver, github-triage)
Human partner who reviewed this diff	Jesse Vincent (@obra) — directed the work and approved the design spec, the implementation plan, every review-gate summary, and each cost-optimization iteration; requested this PR and assigned @arittr for full-diff review

What problem are you trying to solve?

SDD's per-task code quality reviewers routinely did branch-review-scale work on single-task diffs. A field report from a live Serf session flagged reviewers doing repo-wide greps, full-file reads across adjacent systems, package-wide -race runs, and -count=100 test loops — making the parent session appear stuck while a child burned tokens on validation the implementer had already done.

We could not verify that report's cited session (it isn't on this machine), so we mined two real local SDD sessions instead (a1a6719a… sen-core-v2, 0cc1a12d… serf). Confirmed: 7/8 quality reviewers in one session ran repo-wide greps; the most expensive ran 50+ Bash commands over ~200s and ~1.5M tokens in; quality reviewers cost 4-8× what spec reviewers cost on the same tasks. Notably, no reviewer ran heavy tests autonomously — every package-wide or repeated test run was explicitly requested by a controller-written prompt ("check all uses," "run tests if useful, especially race-focused ones"). Spec reviewers, whose prompt has a diff-scope guard (#1595), stayed tight: 6-16 tool calls, 14-65s.

Root causes:

The per-task quality prompt delegated to requesting-code-review/code-reviewer.md — a merge-readiness review (architecture, security, production readiness, "Ready to merge?") — so every per-task review inherited branch-level breadth.
The controller got zero guidance on writing reviewer prompts, so it invented open-ended directives reviewers interpreted literally.
Duplicated work: two reviewers per task each re-derived and re-read the same diff; the quality template re-checked "Plan alignment" the spec reviewer had just verified; reviewers re-ran suites the implementer had already run (and reported with TDD evidence) on identical code.

Live before/after eval runs then surfaced a second problem the field report couldn't see: per-dispatch overhead. Three subagent spin-ups per task (implementer + two reviewers), each re-deriving the task diff with its own git commands, plus controller coordination, made dispatch count — not any single reviewer's behavior — the dominant cost. That evidence (detailed in the spec's "Cost iterations" section) is what drove the design to its final shape.

What does this PR change?

The per-task review is now one reviewer, one reading of the diff, two verdicts:

New task-reviewer-prompt.md replaces spec-reviewer-prompt.md + code-quality-reviewer-prompt.md (both deleted). One self-contained, task-scoped template returns Part 1 spec compliance (✅/❌, plus an explicit "⚠️ Cannot verify from diff" channel) and Part 2 code quality (Strengths / Critical / Important / Minor, "Task quality: Approved | Needs fixes"). Scope budget (read changed files first; inspect adjacent code only for a named concrete risk), test budget (the implementer's reported runs cover this exact code; a reviewer runs at most a focused test for a specific doubt), evidence rule (cite file:line, don't narrate), calibration (Important = cannot-trust-until-fixed: incorrect/fragile behavior, missed requirement, or merge-blocking maintainability damage — verbatim logic duplication, swallowed errors, assertion-free tests), and skepticism extended to the implementer's rationales (a stated "per YAGNI" never downgrades a finding's severity).
New scripts/review-package generates the reviewer's diff file (commit list + --stat + diff -U10), defaulting to a unique self-describing path (<git-dir>/sdd/review-<base7>..<head7>.diff — worktree- and submodule-safe, so concurrent sessions cannot collide and a re-review after fixes always gets a distinctly named fresh file). Explicit BASE because controllers improvised HEAD~1, which truncates multi-commit tasks. The final whole-branch reviewer gets a branch-wide package the same way — measured 33 turns/23 tool calls → 6 turns/3 calls at controller-model prices.
New scripts/task-brief extracts one task's text from the plan to a file the implementer reads directly; implementers write detailed reports to files and return ≤15-line summaries. Dispatch prompts follow a five-part composition recipe (micro-tested: the positive recipe beat a "do not restate" prohibition 3.0 vs 4.4 transcribed values — the prohibition scored worse than no guidance at all).
Progress ledger (<git-dir>/sdd/progress.md): one line per completed task (commit range + verdict) plus accumulated Minor findings, written as part of normal bookkeeping. Conversation memory does not survive compaction — transcript mining of real sessions found controllers re-dispatching entire completed task sequences afterwards (269 dispatches for ~22 tasks); the ledger is the durable recovery map. Final-review findings go to ONE omnibus fix subagent (a real session's per-finding fix wave cost more than all its tasks combined); fix dispatches name their covering test files.
model: is a REQUIRED line in both prompt templates. "Always specify the model" as prose guidance decayed mid-session in a measured run (17 dispatches silently inherited the most expensive model, +40% run cost); as a template placeholder it held in 3 of 3 validation runs.
implementer-prompt.md: run the focused test while iterating, the full suite once before committing; after fixing a review finding, re-run the covering tests and report results (this is what lets reviewers not re-run them).
SKILL.md controller guidance: Model Selection rewritten around turn count, not token price (cheap models take 2-3× the turns on multi-step work — mid-tier is the floor; cheapest tier only for single-file mechanical fixes; always specify a model explicitly — omitting it silently inherits the session's, usually most expensive, model). Reviewer-prompt construction rules: no open-ended directives, no test re-runs on unchanged code, never pre-judge findings for the reviewer (phrase-level: "do not flag," "at most Minor," "the plan chose"), include the design's global constraints that bind the task, hand the diff as a file via the script. ⚠️-item handling (controller resolves each itself; confirmed gaps go back to the implementer as failed spec review). Minor findings roll up into the final review for triage. Fix dispatches carry the re-run-tests contract. Final whole-branch review explicitly pointed at ../requesting-code-review/code-reviewer.md (unchanged and still broad).
Cross-platform tool tables (using-superpowers/references/{antigravity,gemini}-tools.md): reviewer template names updated to task-reviewer.
New eval scenario sdd-quality-reviewer-catches-planted-defect in superpowers-evals (separate repo; submodule pointer bump follows propagation): a fixture plan plants verbatim duplication and an assertion-free test whose name promises verification it never performs; the run passes only if the reviewer flags the duplication openly and the lying test does not survive.

Design spec and implementation plan are in the diff (docs/superpowers/specs/2026-06-09-…, docs/superpowers/plans/2026-06-09-…); the spec's "Cost iterations" section records every optimization round with measurements.

Deliberately preserved: full re-reviews (no re-review narrowing), coordinator model judgment (no forced tier), and skills/requesting-code-review/ untouched — it remains the broad final/ad-hoc review template.

Is this change appropriate for the core library?

Yes. It modifies core SDD workflow files that every superpowers user exercises, regardless of project type. No third-party integrations, no project-specific configuration. The new script is plain bash with no dependencies beyond git.

What alternatives did you consider?

Keeping two separate per-task reviewers (the design's original shape, for bias isolation between spec and quality judgment). This was the maintainer's initial preference and the first shipped iteration. Live eval economics overturned it: per-dispatch overhead and duplicate diff ingestion made the two-reviewer round slower and more expensive than baseline (69.9 min vs 65 min on go-fractals) even after prompt tightening. The maintainer explicitly reopened the decision ("everything's on the table") and approved the merge after the single task reviewer beat baseline on every axis with better blind-judged deliverable quality.
Scoped re-reviews (the field report's main ask: verify only the prior finding + amendment). Rejected — full re-reviews are intentional; the test budget, not reading-scope narrowing, curbs the report's worst cited example.
Turn-count guidance and prompt tightening alone (iteration 1). Measured insufficient: tokens down 29%, wall-clock flat, and controllers adopted optional paste-the-diff guidance in only 2 of 22 dispatches. Optional guidance to agents goes unadopted; the script + REQUIRED placeholder made it structural.
Forcing cheaper models for reviews. Rejected by the maintainer in favor of coordinator judgment; the shipped guidance instead explains the turn-count cost model and sets a mid-tier floor for multi-step work.
Narrowing requesting-code-review/code-reviewer.md itself. Rejected: it serves final branch review and ad-hoc review, which should stay broad; narrowing it would break the "per-task narrow, final broad" distinction this PR creates.

Does this PR contain multiple unrelated changes?

No. All changes implement one design (task-scoped per-task review dispatch) from one spec, and depend on each other: the reviewer's "don't re-run the implementer's tests" rule requires the implementer's re-run-after-fix rule; the task reviewer's diff-file contract requires the script and the controller guidance that invokes it; the tool-table renames track the prompt-file rename.

Existing PRs

I have reviewed all open AND closed PRs for duplicates or prior art
Related PRs: fix: scope SDD spec reviewer to task diff #1595 (closed/merged — scoped the spec reviewer to the task diff and made reviewers read-only; this PR extends the same discipline to the whole per-task review and the controller), fix(skills): SDD review fanout scales with the change #1716 (open — scales review fanout with change size; complementary mechanism, same motivation; no file-level conflict expected beyond SKILL.md context), [codex] Clarify SDD code review overlap #1572 (open — clarifies requesting-code-review/SDD overlap wording; this PR's Integration-bullet change addresses the same confusion from the SDD side without touching requesting-code-review)

Environment tested

Harness (e.g. Claude Code, Cursor)	Harness version	Model	Model version/ID
Claude Code	2.1.170	Claude Fable 5	claude-fable-5[1m]

New harness support (required if this PR adds a new harness)

Not applicable — no new harness. (The cross-platform tool tables were checked: Antigravity maps reviewer templates to the read-only research type, so the task-reviewer prompt includes "name the test you would run" phrasing, keeping that mapping valid; both tables' template names were updated.)

Evaluation

Initial prompt: the human partner provided a field report (2026-06-09-code-quality-reviewer-scope-budget-issue.md) and asked for agents to "study our own quality and spec compliance reviews for issues — we're trying to make the reviews more efficient without sacrificing quality." Three study agents analyzed the prompt chain and mined real session transcripts before any design work.
Live before/after quorum runs drove five optimization iterations over two days (full history with measurements in the spec; complete experiment log including negative results in the evals repo, docs/experiments/2026-06-10-sdd-cost-experiments.md). All numbers below are honest ranges — a same-config re-run exposed ±20% run-to-run variance, so we report across all same-design runs rather than cherry-picking:
- sdd-go-fractals (9 runs of the new design): 44.4-59.6 min / 13.4-20.0M / $11.67-14.84 vs baseline 64.9 min / 21.2M / $16.07 — the worst draw beats baseline on every axis; typical mid-band savings ~20-25%
- sdd-svelte-todo (final config, 2 runs): 55.0-69.3 min / 19.3-24.1M / $14.99-20.30 vs baseline 79.7 / 27.3M / $20.98 — time and tokens clearly better; cost overlaps baseline at the top of the range (the expensive run hit 9 review-fix waves across 12 tasks — review-strictness variance, with all 34 dispatches model-disciplined)
- sdd-rejects-extra-features: $1.31-1.37 vs $1.88 baseline; spec-reviewer-catches-planted-flaws: pass, costs flat
- new sdd-quality-reviewer-catches-planted-defect: pass on the final config ($2.77) — planted duplication flagged openly, the assertion-free test with the lying name caught and fixed
Mechanism-level measurements: task reviewers handed a review-package file averaged ~3 turns / 1 tool call vs ~9 turns / 6.4 calls re-deriving diffs (variance note: reviewer escape-hatch appetite swings this run-to-run; the budget bounds it); the final reviewer dropped 33→6 turns with a branch package; reviewers ran zero redundant test suites under the test budget; dispatch-model discipline held 3/3 runs with the REQUIRED template line; blind A/B deliverable comparison (neutral paths, rotated labels, fresh judge) scored the new design's deliverables at parity-or-better with baseline; a 3-run ablation removing the brief/report file mechanism (keeping everything else) measured cost-neutral ($11.61-13.25 vs the same-night control $14.10, inside the band) — the file handoffs are justified by requirements fidelity and compaction durability, not dollars, and we say so rather than claim savings.
Tested and declined, with data (recorded so they aren't re-proposed): controller turn-batching and parallel-call review pipelining (the controller emits exactly one tool call per message — zero multi-tool messages in every measured run), background-dispatch pipelining (mechanism adopted but benefit below the noise floor on these scenarios).
Honest caveat (also in the spec): per-run deliverable quality is stochastic — defect classes recur across configs run-to-run; quantifying catch-rate differences would need N runs per config, which we did not do.
Known eval gap (documented in the spec): no pre-existing scenario measured per-task reviewer catch rate, which is why this PR adds the planted-defect scenario.

Rigor

If this is a skills change: I used superpowers:writing-skills-style discipline via the full superpowers pipeline — brainstorming → spec → adversarial spec review → plan → subagent-driven implementation with two-stage review per task → final whole-branch review — followed by three measured optimization iterations against live eval runs
This change was tested adversarially, not just on the happy path
I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without evals showing the change is an improvement — the Red Flags additions and the spec-reviewer skepticism rewrite are covered by the five-scenario matrix above

Adversarial testing summary: two competing adversarial reviewers attacked the design spec (13 findings; 9 accepted and fixed); every implementation task went through subagent review with review-fix loops; the planted-defect eval failed through five distinct suppression mechanisms during development (controller pre-judging findings, severity pre-rating, reviewer calibration, implementer rationale-framing, and an eval bar that was itself wrong) — each fixed in the prompts generally, not by teaching to the test; a final whole-branch reviewer verified the shipped files against the plan.

Human review

A human has reviewed the COMPLETE proposed diff before submission — review structure: Jesse Vincent (@obra, repo owner) gated every stage (spec approval, plan approval, per-task review summaries, each optimization iteration, final review summary) and directed this PR; he has assigned @arittr for the complete line-by-line diff review in-PR

…hannel

…inter, model judgment

A live eval run of sdd-quality-reviewer-catches-planted-defect caught the SDD controller fabricating a plan constraint and instructing the quality reviewer not to flag the planted DRY violation. The duplication shipped. Constructing Reviewer Prompts now bans suppression directives alongside open-ended broadening directives.

In live eval runs, controllers given judgment-based model selection stopped passing a model at all; the omitted parameter inherits the session's top-tier model, silently making every subagent maximally expensive (one run dispatched 26/26 reviewers on the session model).

Live eval deliverables shipped five polish defects; tracing each through the transcripts showed three mechanisms, each now addressed: - reviewers answered pointed checklist items with unsupported yes (evidence rule: every What-to-Check answer needs file:line evidence) - no reviewer ever saw the design's global constraints (controllers now paste binding constraints into task requirements) - test output noise was invisible everywhere (pristine-output checks in implementer self-review and quality review)

Second observed instance: with the Constructing Reviewer Prompts rule already live, a controller still wrote 'do not treat that duplication as a defect to fix — the plan chose it; you may note it as a Minor observation at most' into a quality reviewer dispatch, fabricating plan intent from the plan's example snippet. Promote the rule to the Red Flags Never list and name the rationalization.

Resumed the offending eval controller session and asked it why it pre-judged despite the rule being in context. Its retrospective: the motive was avoiding a review loop, the abstract rule was read but not applied at the moment it governs, and a phrase-level trigger ('do not flag', 'at most Minor', 'don't treat X as a defect', 'the plan chose') would have fired where the principle did not.

…ence Round-2 fractals eval regressed to 70min/32.2M tokens (vs round-1's 42.8min/14.5M) while reaching baseline-parity quality. Per-subagent turn profiling attributed it to: haiku dispatches taking 2-3x the turns of sonnet (678 of 1197 subagent turns), reviewers re-fetching diffs by hand (518 Bash calls), and evidence-rule narration. Changes: turn-count-beats- token-price model guidance; controllers paste small diffs into reviewer prompts (reviewers then need few or no tool calls); evidence scoped to findings and would-be-bare-yes checks; Important defined as cannot-trust- until-fixed with coverage suggestions Minor; fixes dispatched only for Critical/Important.

Iteration-1 profiling: implementers and per-dispatch overhead dominate (429 of 686 subagent turns; controller coordination is half the dollars and scales with dispatch count), reviewers are individually lean, and the controller pasted the diff in only 2 of 22 review dispatches when the guidance was phrased as optional. Changes: spec-reviewer-prompt.md + code-quality-reviewer-prompt.md replaced by task-reviewer-prompt.md (one reviewer, one reading of a pasted diff, two verdicts: spec compliance ✅/❌/⚠️ and task quality); one fix dispatch can address both kinds of findings; controller now runs git diff itself and pastes it (imperative, not optional); implementers run focused tests while iterating and the full suite once before committing; flowchart, example, Red Flags, tool tables updated. The broad final whole-branch review is unchanged.

With merged review, a planted verbatim-duplication defect shipped: the reviewer rated it Minor (YAGNI) under the strict cannot-be-trusted definition of Important, and the Minor-rolls-up rule meant no fix was ever dispatched and the final review never saw the finding. Calibration now names merge-blocking maintainability damage (verbatim duplication, swallowed errors, assertion-free tests) as Important, and controllers must paste accumulated Minor findings into the final review dispatch.

Adoption was 6/11 reviews on fractals and 0/17 on svelte when phrased as guidance; reviewers without the diff re-derive it by hand, which is the single largest remaining reviewer cost. Now a Red Flags Never entry and a REQUIRED marker on the template placeholder.

Fourth planted-defect failure mode: the implementer's self-report said 'noted mild structural duplication; left unabstracted per YAGNI' and the reviewer deferred to that framing, rating the duplication no finding at all. The pre-judging keeps relocating — controller prompt, then reviewer calibration, now the implementer's report. Rationales are claims; they never downgrade severity.

Paste adoption stayed at 0/15 even as a Red Flag — and the controller's reluctance is locally rational: pasting loads the diff into the (most expensive) controller context permanently, while a reviewer self-fetch costs a few cheap turns. The diff-file handoff is cheap for both sides: the controller redirects git diff to /tmp without reading it, and the reviewer gets the whole change in one Read call.

The skill read as a changelog: 'combined task review,' 'one reviewer, one reading,' 'one dispatch,' and an example still showing diffs pasted into prompts. A reader who never saw the two-reviewer design has no referent for 'combined.' Prose now states the design directly, and the flowchart/example reflect the diff-file handoff.

scripts/review-package generates the reviewer's input deterministically: commit list, stat summary, and net diff with -U10 context, written to a file from an explicit BASE. Live runs showed controllers improvising 'git diff HEAD~1..HEAD', which silently truncates multi-commit tasks, and svelte's five fix dispatches shipped without re-running any tests — fix dispatches now explicitly carry the implementer's re-run-and-report contract.

…d, writing-plans variants

…ackage, REQUIRED model lines, reviewer risk budget Validated 2026-06-10 (all gates pass): go-fractals 54.1-54.7 min / $12.81-14.31 (baseline 64.9 / $16.07); svelte-todo 55.0 min / 19.3M / $14.99 (baseline 79.7 / 27.3M / $20.98); planted-defect pass $2.77. Dispatch-model discipline 3/3 runs after moving model: into the templates as a REQUIRED line. Full experiment log: evals docs/experiments/2026-06-10-sdd-cost-experiments.md

…eport completeness as checklist

…l validated ranges)

…lan-side crispness first

…con positive, L3 dead)

…nt clean, escalation points unstressed)

Carries the planted-defect + crisp scenarios, batch A-E experiment logs, claude-sonnet model-variant target, and method docs — rebased onto the obol migration and pushed to superpowers-evals main.

obra · 2026-06-11T02:57:26Z

Before/after eval results (promised in the PR body)

All runs: superpowers-evals quorum harness driving real Claude Code sessions (claude-fable-5 session model, subagents per the skill's model selection), one scenario per run, LLM-judged verdicts plus measured economics from the session's own token accounting. Baseline = origin/dev skills; "new design" = this branch. All numbers are ranges across every same-design run, not best-run cherry-picks — a same-config re-run measured ±20% variance (44.4 vs 57.1 min on identical prompts), so single-run deltas under that are noise.

End-to-end SDD scenarios

Scenario	Config	Verdict	Wall-clock	Tokens	Cost
sdd-go-fractals	baseline (dev)	pass	64.9 min	21.2M	$16.07
sdd-go-fractals	new design (8 runs)	8/8 pass	44.4-57.1 min	13.4-20.0M	$11.67-14.84
sdd-svelte-todo	baseline (dev)	pass	79.7 min	27.3M	$20.98
sdd-svelte-todo	final config (run 1)	pass	55.0 min (−31%)	19.3M (−29%)	$14.99 (−29%)
sdd-svelte-todo	final config (run 2)	pass	69.3 min (−13%)	24.1M (−12%)	$20.30 (−3%)

The worst fractals draw beats the baseline on every axis; typical mid-band savings are ~20-25% across time, tokens, and dollars.

Behavior-gate scenarios

Scenario	Config	Verdict	Cost
sdd-rejects-extra-features	baseline	pass	$1.88
sdd-rejects-extra-features	new design (2 runs)	pass	$1.31-1.37
spec-reviewer-catches-planted-flaws	baseline	pass	$0.50
spec-reviewer-catches-planted-flaws	new design	pass	$0.50
sdd-quality-reviewer-catches-planted-defect (new in this PR)	final config	pass	$2.77

The planted-defect scenario seeds a fixture plan with verbatim duplicated logic and an assertion-free test whose name promises verification it never performs; passing requires the task reviewer to flag the duplication openly and the lying test to not survive the session.

Where the savings come from (per-subagent transcript profiling)

One reviewer dispatch per task instead of two (merged spec+quality verdicts, one reading of the diff) — the single largest win.
Diff handed as a file, both per-task and final. Task reviewers given a review package averaged ~3 turns / 1 tool call vs ~9 turns / 6.4 calls re-deriving diffs with git; the final whole-branch reviewer dropped 33 turns / 23 calls → 6 / 3 with a branch-wide package — at controller-model prices. Optional "paste the diff" guidance was adopted in 2 of 22 dispatches; a script invoked from a REQUIRED flowchart step is what made it stick.
model: as a REQUIRED template line. Prose "always specify the model" decayed mid-session in one measured run — 17 dispatches silently inherited the session's most expensive model (+40% run cost). As a template placeholder: explicit on 57 of 57 dispatches across the three validation runs.
Turn-count model guidance. Cheap-tier subagents took 2-3× mid-tier's turns on multi-step work (678 of 1197 subagent turns in one regressed run); cost scales with turns × resident context, so the guidance sets a mid-tier floor.
Test budget held: reviewers ran zero redundant suites; fix dispatches name their covering tests.

Durability changes measured from real-session mining (not visible in 45-min evals)

Transcript mining of real local SDD sessions found the most expensive failures only long sessions hit: controllers re-dispatching entire completed task sequences after context compaction (269 dispatches for ~22 tasks), and a final review spawning 7 per-finding fixers whose cost exceeded all preceding tasks. The progress ledger (<git-dir>/sdd/progress.md) and the one-omnibus-fixer rule target these; the eval gates confirm they cost nothing on short runs.

Quality checks

Blind A/B deliverable comparison (workdirs copied to neutral paths, labels rotated between comparisons, fresh judge scoring against the fixture's own plan): the merged-reviewer design judged 9/10 vs baseline 7/10 in the deciding round; subsequent configs at parity within observed judge noise (±1.5).
Honest caveat (also recorded in the spec): per-run deliverable quality is stochastic — defect classes recur in some runs of both configs; quantifying catch-rate differences rigorously would need N runs per config, which we did not do.

Full iteration history: spec docs/superpowers/specs/2026-06-09-sdd-task-scoped-review-dispatch-design.md ("Cost iterations", iterations 1-5). Complete experiment log including the negative results (turn batching, two pipelining mechanisms — tested, declined, with data): superpowers-evals docs/experiments/2026-06-10-sdd-cost-experiments.md.

cc @arittr — branch is refreshed with the full optimization campaign (body has the summary; complete experiment log incl. negative results lives in superpowers-evals docs/experiments/2026-06-10-sdd-cost-experiments.md). Marked draft while it awaits your review.

obra added 11 commits June 9, 2026 16:26

Add design spec: task-scoped review dispatch for SDD

5da15d7

Harden review-dispatch spec per adversarial review findings

4192572

Add implementation plan for task-scoped review dispatch

f8dcd1e

Make per-task quality reviewer prompt self-contained and task-scoped

2cc449b

Use bare placeholder names in quality reviewer prompt body

da41209

Spec reviewer: judge from the diff, grounded skepticism, ⚠️ verdict c…

be8a626

…hannel

Scope spec reviewer's Your Job wording to the diff

c14c1de

Implementer prompt: re-run covering tests after fixing review findings

b3281c0

SDD controller: reviewer prompt budgets, ⚠️ handling, final-review po…

5aea3dc

…inter, model judgment

Sync plan's Task 5 blocks with review fixes

71dc271

Fix plan doc: correct Task 1 grep expectation; sync Task 5 story block

b3bb9a6

obra requested a review from arittr June 10, 2026 00:53

svenbledt mentioned this pull request Jun 10, 2026

fix(skills): account for Claude Fable in model selection and testing guidance #1719

Open

5 tasks

obra and others added 17 commits June 9, 2026 18:19

Sync plan: escaped pre() pattern in Task 5 checks block

09cb4d7

Spec: document cost iterations and the per-task review consolidation

e532f24

Spec: record iterations 2-3 results and final frozen-config matrix

7cf7843

Shared: unique review-package collateral names

a995af2

obra added 9 commits June 10, 2026 12:32

Spec: positive-instruction redesign — audit results, micro-test metho…

926096a

…d, writing-plans variants

Adopt audited positive phrasings: evidence rule leads positive; fix-r…

fe90d6c

…eport completeness as checklist

Spec: record iterations 4-5 (variance honesty, structural fixes, fina…

43a6ee2

…l validated ranges)

Record writing-plans micro-test result: resolved, no change needed

60fa4f6

Spec: strict-cost SDD experiment ladder — judgment as co-invariant, p…

9a25a75

…lan-side crispness first

Strict-cost spec: record batch A-E rung verdicts (L1 validated, L2 re…

27788fd

…con positive, L3 dead)

Strict-cost spec: L2 recon n=2 (sonnet controller $6.68/$8.05, judgme…

eba16f6

…nt clean, escalation points unstressed)

Bump evals submodule to merged superpowers-evals main (ac264b1)

ec014e7

Carries the planted-defect + crisp scenarios, batch A-E experiment logs, claude-sonnet model-variant target, and method docs — rebased onto the obol migration and pushed to superpowers-evals main.

obra changed the title ~~fix(sdd): task-scoped per-task reviews — self-contained quality prompt, scope/test budgets, controller guidance~~ fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned Jun 11, 2026

obra marked this pull request as draft June 11, 2026 02:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717

fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned#1717
obra wants to merge 37 commits into
devfrom
sdd-review-dispatch

obra commented Jun 10, 2026 •

edited

Loading

Uh oh!

obra commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

obra commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who is submitting this PR? (required)

What problem are you trying to solve?

What does this PR change?

Is this change appropriate for the core library?

What alternatives did you consider?

Does this PR contain multiple unrelated changes?

Existing PRs

Environment tested

New harness support (required if this PR adds a new harness)

Evaluation

Rigor

Human review

Uh oh!

obra commented Jun 11, 2026

Before/after eval results (promised in the PR body)

End-to-end SDD scenarios

Behavior-gate scenarios

Where the savings come from (per-subagent transcript profiling)

Durability changes measured from real-session mining (not visible in 45-min evals)

Quality checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

obra commented Jun 10, 2026 •

edited

Loading