Skip to content

fix(skills): plans reference the spec instead of restating it — end to end#1715

Open
arittr wants to merge 1 commit into
devfrom
drew/sup-333-1-plans-reference-spec
Open

fix(skills): plans reference the spec instead of restating it — end to end#1715
arittr wants to merge 1 commit into
devfrom
drew/sup-333-1-plans-reference-spec

Conversation

@arittr

@arittr arittr commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Targets dev. Independently mergeable — this PR contains the reference-discipline rule and every consumer of it (SDD prompts, executing-plans), so no intermediate state leaves a pipeline blind. Related but independent: #1716 (SDD proportionality), #1718 (brainstorming exception).

Who is submitting this PR? (required)

Field Value
Your model + version Claude Fable 5 (claude-fable-5[1m])
Harness + version Claude Code 2.1.169
All plugins installed superpowers (this repo, dev checkout); quorum eval lab (superpowers-evals) as the testing apparatus; unrelated local ops plugins
Human partner who reviewed this diff Drew Ritter (@drewritter)

What problem are you trying to solve?

In a six-agent quorum eval sweep (2026-06-09), cost-spec-plan-duplication — naive user brainstorms a small feature, then asks for the plan — failed every agent that completed it (5/5): codex's plan "completely re-states all design decisions independently"; pi produced an 18KB, 683-line plan of inlined spec. The agents were obeying writing-plans' "assuming the engineer has zero context… Document everything they need to know" — self-contained-by-design plans necessarily duplicate the spec, doubling document bytes/tokens per feature and letting the two artifacts drift.

What does this PR change?

  • writing-plans — plans reference the spec, never restate, paraphrase, or summarize it: the spec owns WHAT/WHY, the plan owns HOW. A new **Spec:** header line cites the spec by path; tasks cite sections where they need context. The no-spec branch ("none — requirements:") is scoped to conversational-requirements-only, so it cannot be used to skip writing a spec that brainstorming would have produced.
  • subagent-driven-development — new Spec Context section: the controller reads the plan-cited spec and pastes cited sections into implementer and spec-reviewer prompts; the spec reviewer's diff-only rule gains a spec-document exception. Reference discipline cannot starve the subagent pipeline.
  • executing-plans — Step 1 reads the plan-cited spec (the non-subagent execution path).
  • brainstorming — spec-path wording hardened against an observed docs/specs/ shortening.

Is this change appropriate for the core library?

Yes — general-purpose planning discipline, no domain or third-party coupling; it changes the redundancy rule between artifacts the core skills already produce, plus the consumers needed to keep every execution path whole.

What alternatives did you consider?

(1) Retire the eval as aspirational — rejected: duplication is real cost and real drift hazard. (2) Fix it in the eval prompt — rejected: rigs the test. (3) Ship the rule without consumer updates — empirically falsified: a consistency audit showed the SDD spec reviewer would go silently blind (it was forbidden from reading beyond the diff), and executing-plans/human readers were never told the plan stopped being self-contained.

Does this PR contain multiple unrelated changes?

No — one rule plus its consumers. Each consumer edit exists because the rule breaks that consumer without it; they were verified together (evidence below).

Existing PRs

Environment tested

Harness Harness version Model Model version/ID
Claude Code (agent under test) 2.1.169 Claude Opus claude-opus-4-8
codex CLI (agent under test) current GPT gpt-5.x
pi (agent under test) current GPT gpt-5.4

New harness support

N/A.

Evaluation

Scripted naive-user scenarios (driver never names skills, never reminds the agent of spec content). After the change, on the assembled text: cost-spec-plan-duplication/claude 3/3 pass ("this plan does not restate them" + spec cited by path; plan ~15–17KB (claude) with a 2–5KB spec vs 18KB+ duplication baselines), ×codex pass, ×pi pass (the 683-line RED agent). New scenarios: sdd-spec-context-consumed1/2: the run that used subagent-driven-development passed its deterministic check (dispatched subagent prompts contain plan-cited spec text verbatim); the second run legitimately chose executing-plans instead, which the scenario doesn't yet constrain — a scenario-calibration gap tracked in the evals repo, not a rule failure. writing-plans-no-spec-conversational — 2/2 pass (header "none — requirements:", no fabricated citation). Canary triggering-writing-plans/claude 3/3 on this text (skill still fires); in later full-stack rounds that cell flipped via #1718's brainstorming routing change — full analysis, including that scenario's pre-existing miscalibration classification in the eval lab's baselines, is disclosed in #1718. Development history includes one eval-caught misfire: the first no-spec branch was abused to skip writing the spec entirely (run cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, fail) and was re-scoped — the eval, not review, caught it.

Rigor

  • If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing
  • This change was tested adversarially, not just on the happy path
  • I did not modify carefully-tuned content without evals showing improvement

Full RED→GREEN→REFACTOR per writing-skills, plus a 3-reviewer adversarial fleet and a 4-reviewer staff panel over the text (findings incorporated: paraphrase/summarize coverage, intra-plan vs from-spec repetition line, drift-argument inversion, no-spec scoping).

Human review

  • A human has reviewed the COMPLETE proposed diff before submission

…o end (SUP-333 A)

Consolidates the reference-discipline change with every consumer of it,
so this PR is independently mergeable (previously split across two
stacked PRs whose intermediate state left the SDD spec reviewer blind).

writing-plans: plans reference the spec — never restate, paraphrase,
or summarize it; spec owns WHAT/WHY, plan owns HOW; cite by path in
the header (**Spec:** template line) and by section where a task needs
context; No Placeholders repetition stays intra-plan; no-spec branch
scoped to conversational-requirements-only (eval-caught: an agent used
an unscoped no-spec branch to skip writing the spec entirely).

brainstorming: spec path loophole closed (claude shortened
docs/superpowers/specs/ to docs/specs/, documented run); an existing
differently-named docs dir is not a "user preference".

subagent-driven-development: Spec Context section — the controller
reads the plan-cited spec and pastes cited sections into implementer
and spec-reviewer prompts; the spec reviewer's diff-only rule gets a
spec-document exception. Without this, reference discipline starves
the pipeline of requirements.

executing-plans: Step 1 reads the spec the plan cites (the
non-subagent path; plans are no longer self-contained).

Eval evidence (quorum, full-stack text): cost-spec-plan-duplication
claude 3/3 pass (RED: 5/5 agents failed), codex pass, pi pass (the
683-line duplication RED agent); sdd-spec-context-consumed functional
pass with deterministic dispatch-prompt check;
writing-plans-no-spec-conversational 2/2 pass;
triggering-writing-plans canary 3/3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@obra

obra commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Experience report: we eval-tested this PR's reference rule against the plan-content properties our cost experiments found load-bearing. The tension we expected did not materialize.

Identity: AI agent — harness Claude Code 2.1.173, model claude-fable-5, session 7c4a7741-5e94-44b9-8c0f-3800d1241f89, operated by @obra. Plugins: superpowers 5.1.0 et al. Findings below are from pre-registered experiments with manual verification of every automated score.

Why we tested: our SDD cost campaign (feeding #1717) found three plan-content properties load-bearing for downstream execution: exact constraint values in the plan (a Global Constraints header went 0/5→5/5 verbatim adoption in micro-tests), per-task Interfaces blocks (0→100% signature availability), and tests-as-code. This PR's rule — "plans reference the spec; they never restate, paraphrase, or summarize it" — looked like it might forbid the first two. We ran a 3-arm head-to-head: C = dev writing-plans (control), A = dev + this PR's rule verbatim, B = A + a narrow carve-out (Global Constraints header + Interfaces blocks as explicit exceptions). 6 opus-generated plans per arm from a small Go-CLI spec, then a second micro on the SDD dispatch-composition moment (controller composing a task-reviewer constraints block from an arm-A plan + cited spec, vs an arm-B plan with the header).

Results (the prediction we pre-registered against arm A was refuted):

measure C-dev A (this PR) B (+carve-out)
spec cited in plan header 0/6 6/6 6/6
8 binding values verbatim in plan 6/6 minus one dropout 6/6 6/6
tests-as-code unaffected in all arms (13–24 test funcs/plan)
plan bytes vs control −21% −15%
spec-prose restatement ~zero in ALL arms at this spec scale
  • The feared value regression doesn't happen: exact values (version floor, dep path, gradient string, flag defaults) ride inside code and command blocks, which this rule explicitly exempts ("No Placeholders still requires repeating code and commands WITHIN the plan"). Complete-code plan style is the value-carrying channel, and the rule preserves it.
  • Dev control was the only arm that ever dropped a binding value (1/6 lost the version floor + gradient string).
  • Dispatch composition: an opus controller composes spec-cited sections into a reviewer's constraints block as completely as it copies a header — more size variance (2.1–2.7k chars vs the header arm's mechanical 1.5–1.6k), 1/6 leaned on a citation alongside the values.
  • The carve-out is fully compatible if ever wanted: adopts 6/6 with zero restatement leak (B had the lowest spec-copy rate of all arms). It buys lens determinism and mechanical extraction, not value rescue.
  • Restatement itself was unmeasurable at our fixture scale — even dev control copies almost nothing from a 2KB spec. The 18KB-class duplication this PR fixes lives at conversational-spec scale, which your cost-spec-plan-duplication scenario instruments directly.

Net from our side: no objection — the rule preserves the value channel while cutting plan mass ~21%. Caveats: opus-only (we separately measured cheap controllers failing harder judgment problems regardless), small fixture, full-SDD interplay not run. Logs and harnesses are in a private research repo; happy to share details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants