SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717) by obra · Pull Request #1744 · obra/superpowers

obra · 2026-06-11T22:40:25Z

Stacked on #1717 (base branch sdd-review-dispatch): this PR contains only the optimization/judgment layer on top of the task-scoped review redesign. It retargets dev when #1717 lands. Draft until the #1717 eval pass is finished.

@arittr — sequencing: please don't stop the eval run you have going on #1717 / sdd-e27-stack. Once that one is put to bed, this branch is the next thing to eval — it is sdd-e27-stack plus three commits (the two plan-mandated-defect rules, the pre-flight plan review) and the evals/spec bumps.

Who is submitting this PR? (required)

Field	Value
Your model + version	claude-fable-5 (Fable 5), session `7c4a7741-5e94-44b9-8c0f-3800d1241f89`
Harness + version	Claude Code 2.1.173
All plugins installed	superpowers 5.1.0, episodic-memory, linear, context7, superpowers-chrome, plugin-dev, github-triage, agent-sdk-dev, code-simplifier
Human partner who reviewed this diff	@obra — directed each layer; reviewed the L2b rule texts, the pre-flight rule text, and the E27-stack layer verbatim before each push

What problem are you trying to solve?

Three measured failure modes in subagent-driven development, each from real eval sessions:

SDD runs cost more than they need to. Controllers dispatched implementers on the most capable model for transcription-grade tasks, reviewers narrated process instead of reporting findings, and controllers narrated every step. Measured across the 2026-06-10/11 cost campaign (46 runs + 30 micro-experiments).
Reviewers advocate for plan-mandated defects. When a plan explicitly mandates something the quality rubric calls a defect (our planted fixture: a test named "renders correctly" that asserts nothing), reviewers praised it — "no assertion, as required" under Strengths — and the defect shipped. Run e8e9 showed a controller noticing the defect at plan-read and deliberately deferring to the review loop, because nothing told it raising plan conflicts up front was allowed.
Conflicts surface mid-plan instead of up front, after the defect has been implemented, reviewed, and escalated — paying for a fix dispatch and re-review that an up-front question would have avoided, and interrupting the human at the worst time (hours in) rather than the best (kickoff, when they just issued the command).

What does this PR change?

Five additions to skills/subagent-driven-development/ on top of #1717: (1) conditional implementer tiering — cheapest tier when the plan carries complete code; (2) final whole-branch review pinned to the most capable model; (3) a one-line narration recipe and a terse reviewer report contract; (4) two plan-mandated-defect rules — reviewer tripwire ("that IS a finding — Important, labeled plan-mandated; the human decides") and controller escalation rule ("present the finding and the plan text, ask which governs"); (5) a Pre-Flight Plan Review section — scan once before Task 1, batch all plan conflicts into one question. Plus the evals-submodule bump making the planted-defect scenario escalation-aware.

Is this change appropriate for the core library?

Yes — all changes are to the general-purpose SDD skill and benefit any project type. No third-party dependencies, no domain-specific content.

What alternatives did you consider?

Cheaper controller (sonnet) instead of cheaper subagents: died at its quality gates — explicit escalation held 5/5 but planted-defect adjudication collapsed into plan-advocacy 4/5 (E34).
Haiku task reviewers: dead — 0/10 planted defects flagged at correct severity; haiku advocates for defects (batch D).
Reviewer template as a read-once file (kill paraphrase drift): failed its gate — 0/3 first-pass catches vs 3/5 inline; read-once dilutes (E33).
Thinking caps for cost: backfire — raise the turn floor, output tokens up ~80% (E06).
Tripwire placement within the dispatch (Calibration section vs inside the constraints block): refuted as the variable — the live failure is attention decay across the reviewer's tool reads, which is why the pre-flight rule (controller's own context, no dispatch channel) is the load-bearing fix (E35/E36/E37).

Does this PR contain multiple unrelated changes?

The commits are one coherent layer: the cost stack and the judgment rules were co-developed against the same eval battery, and the pre-flight rule exists because the L2b battery exposed the dispatch-transmission gap. Splitting them would ship a cost stack whose planted-defect gate we know is weaker without the judgment rules.

Existing PRs

I have reviewed all open AND closed PRs for duplicates or prior art
Related PRs: fix(sdd): task-scoped review dispatch — single task reviewer, review-package script, eval-tuned #1717 (base of this stack), fix(skills): plans reference the spec instead of restating it — end to end #1715 (compatible — we eval-tested the interaction; see fix(skills): plans reference the spec instead of restating it — end to end #1715 (comment))

Environment tested

Harness	Harness version	Model	Model version/ID
Claude Code	2.1.173	Opus (controller + subagents)	claude-opus-4-8
Claude Code	2.1.173	Sonnet (controller) / mixed subagents	claude-sonnet-4-6

New harness support

N/A — no new harness.

Evaluation

Initial prompt: quorum-driven eval sessions ("execute docs/superpowers/plans/report-plan.md with subagent-driven-development" and the fractals/svelte build scenarios), run via the superpowers-evals harness.
Sessions after the change: 30+ full quorum runs plus ~90 micro-test samples across the campaign; for the newest layer specifically: 9 full runs (L2b battery) + 5 full runs (E37) + 48 micro samples, every automated score manually verified.
Outcomes before → after:
- Cost: fractals $16.07 baseline → $6.24–6.60 on this stack; svelte $20.98 → $10.59–11.25.
- Plan-mandated defect, opus: reviewer praised it before; now 2/2 runs escalate to the human with the rule quoted, sanctioned fix lands.
- Pre-flight: without the rule, 0/12 micro samples and 0/5 prior full runs asked before dispatching; with it, 12/12 micro and 5/5 full sonnet runs ask before dispatch #0 (zero variance), including one run batching both planted conflicts into a single question.
- Escalation sanity (plan self-contradiction scenario): 2/2 pass with the new rules — the working behavior is preserved, and the catch moves from mid-plan to time zero.
- Known limits, stated plainly: sonnet-controller planted-defect remains 1/5 at the per-task reviewer gate (the pre-flight rule is what rescues the run outcome); the gauntlet judge is documented unreliable on this scenario (PRI-2160); benign-plan false-positive rate for the pre-flight scan is unmeasured at large-plan scale.

Rigor

If this is a skills change: I used superpowers:writing-skills methodology — micro-tests with no-guidance controls before full runs, pre-registered predictions, negative results logged at equal billing
This change was tested adversarially, not just on the happy path (planted-defect fixtures, plan self-contradiction fixture, frozen-input replays)
I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement

Human review

A human has reviewed the COMPLETE proposed diff before submission — @obra directed and approved each commit in-session; rule texts were reviewed verbatim before pushing

…scription hypothesis)

…recipe + terse reviewer contract

Reviewer tripwire (Calibration): a plan-mandated defect IS a finding, reported as Important and labeled plan-mandated — the plan's authorship does not grade its own work. Controller rule (review loop): a plan-mandated finding, or any finding conflicting with the plan's text, escalates to the human like any plan contradiction — never dismissed because the plan mandates it. E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the tripwire 0/6 reports give the controller anything to escalate on (all Approved, defect endorsed as spec-required); with it 6/6 report the defect as a labeled finding.

… gap (E35/E36); bump evals to 9919b27

…uestion before Task 1

obra and others added 6 commits June 11, 2026 13:17

E03: cheapest-tier implementers when plan carries complete code (tran…

90b5433

…scription hypothesis)

E27 stack: conditional impl tier + final-review tier pin + narration …

35464d6

…recipe + terse reviewer contract

Spec: L2b tested — opus structural win, sonnet transmission+attention…

a2a4190

… gap (E35/E36); bump evals to 9919b27

E37: pre-flight plan review — surface plan conflicts as one batched q…

c87f336

…uestion before Task 1

Bump evals submodule: escalation-aware planted-defect scenario (7dc0be7)

08447d0

GamingwithJJ mentioned this pull request Jun 13, 2026

SDD has no effort-level dimension: subagents run at session effort regardless of task complexity #1747

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744

SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744
obra wants to merge 6 commits into
sdd-review-dispatchfrom
sdd-l2b-plan-mandated

obra commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

obra commented Jun 11, 2026

Who is submitting this PR? (required)

What problem are you trying to solve?

What does this PR change?

Is this change appropriate for the core library?

What alternatives did you consider?

Does this PR contain multiple unrelated changes?

Existing PRs

Environment tested

New harness support

Evaluation

Rigor

Human review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant