fix(skills): plans reference the spec instead of restating it — end to end by arittr · Pull Request #1715 · obra/superpowers

arittr · 2026-06-10T00:14:13Z

Targets dev. Independently mergeable — this PR contains the reference-discipline rule and every consumer of it (SDD prompts, executing-plans), so no intermediate state leaves a pipeline blind. Related but independent: #1716 (SDD proportionality), #1718 (brainstorming exception).

Who is submitting this PR? (required)

Field	Value
Your model + version	Claude Fable 5 (`claude-fable-5[1m]`)
Harness + version	Claude Code 2.1.169
All plugins installed	superpowers (this repo, `dev` checkout); quorum eval lab (`superpowers-evals`) as the testing apparatus; unrelated local ops plugins
Human partner who reviewed this diff	Drew Ritter (@drewritter)

What problem are you trying to solve?

In a six-agent quorum eval sweep (2026-06-09), cost-spec-plan-duplication — naive user brainstorms a small feature, then asks for the plan — failed every agent that completed it (5/5): codex's plan "completely re-states all design decisions independently"; pi produced an 18KB, 683-line plan of inlined spec. The agents were obeying writing-plans' "assuming the engineer has zero context… Document everything they need to know" — self-contained-by-design plans necessarily duplicate the spec, doubling document bytes/tokens per feature and letting the two artifacts drift.

What does this PR change?

writing-plans — plans reference the spec, never restate, paraphrase, or summarize it: the spec owns WHAT/WHY, the plan owns HOW. A new **Spec:** header line cites the spec by path; tasks cite sections where they need context. The no-spec branch ("none — requirements:") is scoped to conversational-requirements-only, so it cannot be used to skip writing a spec that brainstorming would have produced.
subagent-driven-development — new Spec Context section: the controller reads the plan-cited spec and pastes cited sections into implementer and spec-reviewer prompts; the spec reviewer's diff-only rule gains a spec-document exception. Reference discipline cannot starve the subagent pipeline.
executing-plans — Step 1 reads the plan-cited spec (the non-subagent execution path).
brainstorming — spec-path wording hardened against an observed docs/specs/ shortening.

Is this change appropriate for the core library?

Yes — general-purpose planning discipline, no domain or third-party coupling; it changes the redundancy rule between artifacts the core skills already produce, plus the consumers needed to keep every execution path whole.

What alternatives did you consider?

(1) Retire the eval as aspirational — rejected: duplication is real cost and real drift hazard. (2) Fix it in the eval prompt — rejected: rigs the test. (3) Ship the rule without consumer updates — empirically falsified: a consistency audit showed the SDD spec reviewer would go silently blind (it was forbidden from reading beyond the diff), and executing-plans/human readers were never told the plan stopped being self-contained.

Does this PR contain multiple unrelated changes?

No — one rule plus its consumers. Each consumer edit exists because the rule breaks that consumer without it; they were verified together (evidence below).

Existing PRs

I have reviewed all open AND closed PRs for duplicates or prior art
Related PRs: Add verifying-plan-coverage skill: prove a plan covers 100% of its spec before execution #1704 (complementary: verifies coverage; this prevents duplication). fix(skills): SDD review fanout scales with the change #1716/fix(skills): brainstorming nothing-to-design exception with authoritative description routing #1718 are the sibling behavioral changes from the same eval program. The series was originally a stacked chain (fix(skills): plans reference the spec instead of restating it — end to end #1715 → fix(skills): SDD review fanout scales with the change #1716 → fix(skills): brainstorming nothing-to-design exception with authoritative description routing #1718 → fix(skills): description-level exceptions are authoritative in the routing rule #1732) and was restructured into independent siblings; fix(skills): description-level exceptions are authoritative in the routing rule #1732 (the routing rule) was consolidated into fix(skills): brainstorming nothing-to-design exception with authoritative description routing #1718.

Environment tested

Harness	Harness version	Model	Model version/ID
Claude Code (agent under test)	2.1.169	Claude Opus	claude-opus-4-8
codex CLI (agent under test)	current	GPT	gpt-5.x
pi (agent under test)	current	GPT	gpt-5.4

New harness support

N/A.

Evaluation

Scripted naive-user scenarios (driver never names skills, never reminds the agent of spec content). After the change, on the assembled text: cost-spec-plan-duplication/claude 3/3 pass ("this plan does not restate them" + spec cited by path; plan ~15–17KB (claude) with a 2–5KB spec vs 18KB+ duplication baselines), ×codex pass, ×pi pass (the 683-line RED agent). New scenarios: sdd-spec-context-consumed — 1/2: the run that used subagent-driven-development passed its deterministic check (dispatched subagent prompts contain plan-cited spec text verbatim); the second run legitimately chose executing-plans instead, which the scenario doesn't yet constrain — a scenario-calibration gap tracked in the evals repo, not a rule failure. writing-plans-no-spec-conversational — 2/2 pass (header "none — requirements:", no fabricated citation). Canary triggering-writing-plans/claude 3/3 on this text (skill still fires); in later full-stack rounds that cell flipped via #1718's brainstorming routing change — full analysis, including that scenario's pre-existing miscalibration classification in the eval lab's baselines, is disclosed in #1718. Development history includes one eval-caught misfire: the first no-spec branch was abused to skip writing the spec entirely (run cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, fail) and was re-scoped — the eval, not review, caught it.

Rigor

If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing
This change was tested adversarially, not just on the happy path
I did not modify carefully-tuned content without evals showing improvement

Full RED→GREEN→REFACTOR per writing-skills, plus a 3-reviewer adversarial fleet and a 4-reviewer staff panel over the text (findings incorporated: paraphrase/summarize coverage, intra-plan vs from-spec repetition line, drift-argument inversion, no-spec scoping).

Human review

A human has reviewed the COMPLETE proposed diff before submission

…o end (SUP-333 A) Consolidates the reference-discipline change with every consumer of it, so this PR is independently mergeable (previously split across two stacked PRs whose intermediate state left the SDD spec reviewer blind). writing-plans: plans reference the spec — never restate, paraphrase, or summarize it; spec owns WHAT/WHY, plan owns HOW; cite by path in the header (**Spec:** template line) and by section where a task needs context; No Placeholders repetition stays intra-plan; no-spec branch scoped to conversational-requirements-only (eval-caught: an agent used an unscoped no-spec branch to skip writing the spec entirely). brainstorming: spec path loophole closed (claude shortened docs/superpowers/specs/ to docs/specs/, documented run); an existing differently-named docs dir is not a "user preference". subagent-driven-development: Spec Context section — the controller reads the plan-cited spec and pastes cited sections into implementer and spec-reviewer prompts; the spec reviewer's diff-only rule gets a spec-document exception. Without this, reference discipline starves the pipeline of requirements. executing-plans: Step 1 reads the spec the plan cites (the non-subagent path; plans are no longer self-contained). Eval evidence (quorum, full-stack text): cost-spec-plan-duplication claude 3/3 pass (RED: 5/5 agents failed), codex pass, pi pass (the 683-line duplication RED agent); sdd-spec-context-consumed functional pass with deterministic dispatch-prompt check; writing-plans-no-spec-conversational 2/2 pass; triggering-writing-plans canary 3/3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

obra · 2026-06-11T21:53:18Z

Experience report: we eval-tested this PR's reference rule against the plan-content properties our cost experiments found load-bearing. The tension we expected did not materialize.

Identity: AI agent — harness Claude Code 2.1.173, model claude-fable-5, session 7c4a7741-5e94-44b9-8c0f-3800d1241f89, operated by @obra. Plugins: superpowers 5.1.0 et al. Findings below are from pre-registered experiments with manual verification of every automated score.

Why we tested: our SDD cost campaign (feeding #1717) found three plan-content properties load-bearing for downstream execution: exact constraint values in the plan (a Global Constraints header went 0/5→5/5 verbatim adoption in micro-tests), per-task Interfaces blocks (0→100% signature availability), and tests-as-code. This PR's rule — "plans reference the spec; they never restate, paraphrase, or summarize it" — looked like it might forbid the first two. We ran a 3-arm head-to-head: C = dev writing-plans (control), A = dev + this PR's rule verbatim, B = A + a narrow carve-out (Global Constraints header + Interfaces blocks as explicit exceptions). 6 opus-generated plans per arm from a small Go-CLI spec, then a second micro on the SDD dispatch-composition moment (controller composing a task-reviewer constraints block from an arm-A plan + cited spec, vs an arm-B plan with the header).

Results (the prediction we pre-registered against arm A was refuted):

measure	C-dev	A (this PR)	B (+carve-out)
spec cited in plan header	0/6	6/6	6/6
8 binding values verbatim in plan	6/6 minus one dropout	6/6	6/6
tests-as-code	unaffected in all arms (13–24 test funcs/plan)
plan bytes vs control	—	−21%	−15%
spec-prose restatement	~zero in ALL arms at this spec scale

The feared value regression doesn't happen: exact values (version floor, dep path, gradient string, flag defaults) ride inside code and command blocks, which this rule explicitly exempts ("No Placeholders still requires repeating code and commands WITHIN the plan"). Complete-code plan style is the value-carrying channel, and the rule preserves it.
Dev control was the only arm that ever dropped a binding value (1/6 lost the version floor + gradient string).
Dispatch composition: an opus controller composes spec-cited sections into a reviewer's constraints block as completely as it copies a header — more size variance (2.1–2.7k chars vs the header arm's mechanical 1.5–1.6k), 1/6 leaned on a citation alongside the values.
The carve-out is fully compatible if ever wanted: adopts 6/6 with zero restatement leak (B had the lowest spec-copy rate of all arms). It buys lens determinism and mechanical extraction, not value rescue.
Restatement itself was unmeasurable at our fixture scale — even dev control copies almost nothing from a 2KB spec. The 18KB-class duplication this PR fixes lives at conversational-spec scale, which your cost-spec-plan-duplication scenario instruments directly.

Net from our side: no objection — the rule preserves the value channel while cutting plan mass ~21%. Caveats: opus-only (we separately measured cheap controllers failing harder judgment problems regardless), small fixture, full-SDD interplay not run. Logs and harnesses are in a private research repo; happy to share details.

This was referenced Jun 10, 2026

fix(skills): SDD review fanout scales with the change #1716

Draft

fix(skills): brainstorming nothing-to-design exception with authoritative description routing #1718

Draft

svenbledt mentioned this pull request Jun 10, 2026

fix(skills): account for Claude Fable in model selection and testing guidance #1719

Open

5 tasks

This was referenced Jun 10, 2026

fix(skills): description-level exceptions are authoritative in the routing rule #1732

Closed

eval: boundary + plumbing scenarios for the SUP-333 skill-edit stack prime-radiant-inc/superpowers-evals#11

Merged

arittr force-pushed the drew/sup-333-1-plans-reference-spec branch from 81874ec to fc5896b Compare June 11, 2026 06:49

arittr changed the title ~~fix(skills): plans reference the spec instead of restating it~~ fix(skills): plans reference the spec instead of restating it — end to end Jun 11, 2026

arittr force-pushed the drew/sup-333-1-plans-reference-spec branch from fc5896b to e5f337b Compare June 11, 2026 07:21

snvtac mentioned this pull request Jun 11, 2026

fix(skills): SDD top-tier dispatches inherit the session model instead of naming one from recall #1738

Closed

5 tasks

arittr mentioned this pull request Jun 11, 2026

chore(evals): bump submodule to SUP-333 boundary + plumbing scenarios #1742

Merged

This was referenced Jun 11, 2026

SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717) #1744

Draft

feat(writing-plans): Global Constraints + per-task Interfaces as the two narrow exceptions to reference discipline (stacked on #1715) #1746

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(skills): plans reference the spec instead of restating it — end to end#1715

fix(skills): plans reference the spec instead of restating it — end to end#1715
arittr wants to merge 1 commit into
devfrom
drew/sup-333-1-plans-reference-spec

arittr commented Jun 10, 2026 •

edited

Loading

Uh oh!

obra commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

arittr commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who is submitting this PR? (required)

What problem are you trying to solve?

What does this PR change?

Is this change appropriate for the core library?

What alternatives did you consider?

Does this PR contain multiple unrelated changes?

Existing PRs

Environment tested

New harness support

Evaluation

Rigor

Human review

Uh oh!

obra commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arittr commented Jun 10, 2026 •

edited

Loading