Skip to content

SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744

Draft
obra wants to merge 6 commits into
sdd-review-dispatchfrom
sdd-l2b-plan-mandated
Draft

SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744
obra wants to merge 6 commits into
sdd-review-dispatchfrom
sdd-l2b-plan-mandated

Conversation

@obra

@obra obra commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Stacked on #1717 (base branch sdd-review-dispatch): this PR contains only the optimization/judgment layer on top of the task-scoped review redesign. It retargets dev when #1717 lands. Draft until the #1717 eval pass is finished.

@arittr — sequencing: please don't stop the eval run you have going on #1717 / sdd-e27-stack. Once that one is put to bed, this branch is the next thing to eval — it is sdd-e27-stack plus three commits (the two plan-mandated-defect rules, the pre-flight plan review) and the evals/spec bumps.

Who is submitting this PR? (required)

Field Value
Your model + version claude-fable-5 (Fable 5), session 7c4a7741-5e94-44b9-8c0f-3800d1241f89
Harness + version Claude Code 2.1.173
All plugins installed superpowers 5.1.0, episodic-memory, linear, context7, superpowers-chrome, plugin-dev, github-triage, agent-sdk-dev, code-simplifier
Human partner who reviewed this diff @obra — directed each layer; reviewed the L2b rule texts, the pre-flight rule text, and the E27-stack layer verbatim before each push

What problem are you trying to solve?

Three measured failure modes in subagent-driven development, each from real eval sessions:

  1. SDD runs cost more than they need to. Controllers dispatched implementers on the most capable model for transcription-grade tasks, reviewers narrated process instead of reporting findings, and controllers narrated every step. Measured across the 2026-06-10/11 cost campaign (46 runs + 30 micro-experiments).
  2. Reviewers advocate for plan-mandated defects. When a plan explicitly mandates something the quality rubric calls a defect (our planted fixture: a test named "renders correctly" that asserts nothing), reviewers praised it — "no assertion, as required" under Strengths — and the defect shipped. Run e8e9 showed a controller noticing the defect at plan-read and deliberately deferring to the review loop, because nothing told it raising plan conflicts up front was allowed.
  3. Conflicts surface mid-plan instead of up front, after the defect has been implemented, reviewed, and escalated — paying for a fix dispatch and re-review that an up-front question would have avoided, and interrupting the human at the worst time (hours in) rather than the best (kickoff, when they just issued the command).

What does this PR change?

Five additions to skills/subagent-driven-development/ on top of #1717: (1) conditional implementer tiering — cheapest tier when the plan carries complete code; (2) final whole-branch review pinned to the most capable model; (3) a one-line narration recipe and a terse reviewer report contract; (4) two plan-mandated-defect rules — reviewer tripwire ("that IS a finding — Important, labeled plan-mandated; the human decides") and controller escalation rule ("present the finding and the plan text, ask which governs"); (5) a Pre-Flight Plan Review section — scan once before Task 1, batch all plan conflicts into one question. Plus the evals-submodule bump making the planted-defect scenario escalation-aware.

Is this change appropriate for the core library?

Yes — all changes are to the general-purpose SDD skill and benefit any project type. No third-party dependencies, no domain-specific content.

What alternatives did you consider?

  • Cheaper controller (sonnet) instead of cheaper subagents: died at its quality gates — explicit escalation held 5/5 but planted-defect adjudication collapsed into plan-advocacy 4/5 (E34).
  • Haiku task reviewers: dead — 0/10 planted defects flagged at correct severity; haiku advocates for defects (batch D).
  • Reviewer template as a read-once file (kill paraphrase drift): failed its gate — 0/3 first-pass catches vs 3/5 inline; read-once dilutes (E33).
  • Thinking caps for cost: backfire — raise the turn floor, output tokens up ~80% (E06).
  • Tripwire placement within the dispatch (Calibration section vs inside the constraints block): refuted as the variable — the live failure is attention decay across the reviewer's tool reads, which is why the pre-flight rule (controller's own context, no dispatch channel) is the load-bearing fix (E35/E36/E37).

Does this PR contain multiple unrelated changes?

The commits are one coherent layer: the cost stack and the judgment rules were co-developed against the same eval battery, and the pre-flight rule exists because the L2b battery exposed the dispatch-transmission gap. Splitting them would ship a cost stack whose planted-defect gate we know is weaker without the judgment rules.

Existing PRs

Environment tested

Harness Harness version Model Model version/ID
Claude Code 2.1.173 Opus (controller + subagents) claude-opus-4-8
Claude Code 2.1.173 Sonnet (controller) / mixed subagents claude-sonnet-4-6

New harness support

N/A — no new harness.

Evaluation

  • Initial prompt: quorum-driven eval sessions ("execute docs/superpowers/plans/report-plan.md with subagent-driven-development" and the fractals/svelte build scenarios), run via the superpowers-evals harness.
  • Sessions after the change: 30+ full quorum runs plus ~90 micro-test samples across the campaign; for the newest layer specifically: 9 full runs (L2b battery) + 5 full runs (E37) + 48 micro samples, every automated score manually verified.
  • Outcomes before → after:
    • Cost: fractals $16.07 baseline → $6.24–6.60 on this stack; svelte $20.98 → $10.59–11.25.
    • Plan-mandated defect, opus: reviewer praised it before; now 2/2 runs escalate to the human with the rule quoted, sanctioned fix lands.
    • Pre-flight: without the rule, 0/12 micro samples and 0/5 prior full runs asked before dispatching; with it, 12/12 micro and 5/5 full sonnet runs ask before dispatch #0 (zero variance), including one run batching both planted conflicts into a single question.
    • Escalation sanity (plan self-contradiction scenario): 2/2 pass with the new rules — the working behavior is preserved, and the catch moves from mid-plan to time zero.
    • Known limits, stated plainly: sonnet-controller planted-defect remains 1/5 at the per-task reviewer gate (the pre-flight rule is what rescues the run outcome); the gauntlet judge is documented unreliable on this scenario (PRI-2160); benign-plan false-positive rate for the pre-flight scan is unmeasured at large-plan scale.

Rigor

  • If this is a skills change: I used superpowers:writing-skills methodology — micro-tests with no-guidance controls before full runs, pre-registered predictions, negative results logged at equal billing
  • This change was tested adversarially, not just on the happy path (planted-defect fixtures, plan self-contradiction fixture, frozen-input replays)
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement

Human review

  • A human has reviewed the COMPLETE proposed diff before submission — @obra directed and approved each commit in-session; rule texts were reviewed verbatim before pushing

obra and others added 6 commits June 11, 2026 13:17
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding,
reported as Important and labeled plan-mandated — the plan's authorship
does not grade its own work.

Controller rule (review loop): a plan-mandated finding, or any finding
conflicting with the plan's text, escalates to the human like any plan
contradiction — never dismissed because the plan mandates it.

E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the
tripwire 0/6 reports give the controller anything to escalate on (all
Approved, defect endorsed as spec-required); with it 6/6 report the
defect as a labeled finding.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant