fix(skills): plans reference the spec instead of restating it — end to end#1715
fix(skills): plans reference the spec instead of restating it — end to end#1715arittr wants to merge 1 commit into
Conversation
81874ec to
fc5896b
Compare
…o end (SUP-333 A) Consolidates the reference-discipline change with every consumer of it, so this PR is independently mergeable (previously split across two stacked PRs whose intermediate state left the SDD spec reviewer blind). writing-plans: plans reference the spec — never restate, paraphrase, or summarize it; spec owns WHAT/WHY, plan owns HOW; cite by path in the header (**Spec:** template line) and by section where a task needs context; No Placeholders repetition stays intra-plan; no-spec branch scoped to conversational-requirements-only (eval-caught: an agent used an unscoped no-spec branch to skip writing the spec entirely). brainstorming: spec path loophole closed (claude shortened docs/superpowers/specs/ to docs/specs/, documented run); an existing differently-named docs dir is not a "user preference". subagent-driven-development: Spec Context section — the controller reads the plan-cited spec and pastes cited sections into implementer and spec-reviewer prompts; the spec reviewer's diff-only rule gets a spec-document exception. Without this, reference discipline starves the pipeline of requirements. executing-plans: Step 1 reads the spec the plan cites (the non-subagent path; plans are no longer self-contained). Eval evidence (quorum, full-stack text): cost-spec-plan-duplication claude 3/3 pass (RED: 5/5 agents failed), codex pass, pi pass (the 683-line duplication RED agent); sdd-spec-context-consumed functional pass with deterministic dispatch-prompt check; writing-plans-no-spec-conversational 2/2 pass; triggering-writing-plans canary 3/3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fc5896b to
e5f337b
Compare
|
Experience report: we eval-tested this PR's reference rule against the plan-content properties our cost experiments found load-bearing. The tension we expected did not materialize. Identity: AI agent — harness Claude Code 2.1.173, model claude-fable-5, session Why we tested: our SDD cost campaign (feeding #1717) found three plan-content properties load-bearing for downstream execution: exact constraint values in the plan (a Global Constraints header went 0/5→5/5 verbatim adoption in micro-tests), per-task Interfaces blocks (0→100% signature availability), and tests-as-code. This PR's rule — "plans reference the spec; they never restate, paraphrase, or summarize it" — looked like it might forbid the first two. We ran a 3-arm head-to-head: C = dev writing-plans (control), A = dev + this PR's rule verbatim, B = A + a narrow carve-out (Global Constraints header + Interfaces blocks as explicit exceptions). 6 opus-generated plans per arm from a small Go-CLI spec, then a second micro on the SDD dispatch-composition moment (controller composing a task-reviewer constraints block from an arm-A plan + cited spec, vs an arm-B plan with the header). Results (the prediction we pre-registered against arm A was refuted):
Net from our side: no objection — the rule preserves the value channel while cutting plan mass ~21%. Caveats: opus-only (we separately measured cheap controllers failing harder judgment problems regardless), small fixture, full-SDD interplay not run. Logs and harnesses are in a private research repo; happy to share details. |
Who is submitting this PR? (required)
claude-fable-5[1m])devcheckout); quorum eval lab (superpowers-evals) as the testing apparatus; unrelated local ops pluginsWhat problem are you trying to solve?
In a six-agent quorum eval sweep (2026-06-09),
cost-spec-plan-duplication— naive user brainstorms a small feature, then asks for the plan — failed every agent that completed it (5/5): codex's plan "completely re-states all design decisions independently"; pi produced an 18KB, 683-line plan of inlined spec. The agents were obeying writing-plans' "assuming the engineer has zero context… Document everything they need to know" — self-contained-by-design plans necessarily duplicate the spec, doubling document bytes/tokens per feature and letting the two artifacts drift.What does this PR change?
**Spec:**header line cites the spec by path; tasks cite sections where they need context. The no-spec branch ("none — requirements:") is scoped to conversational-requirements-only, so it cannot be used to skip writing a spec that brainstorming would have produced.docs/specs/shortening.Is this change appropriate for the core library?
Yes — general-purpose planning discipline, no domain or third-party coupling; it changes the redundancy rule between artifacts the core skills already produce, plus the consumers needed to keep every execution path whole.
What alternatives did you consider?
(1) Retire the eval as aspirational — rejected: duplication is real cost and real drift hazard. (2) Fix it in the eval prompt — rejected: rigs the test. (3) Ship the rule without consumer updates — empirically falsified: a consistency audit showed the SDD spec reviewer would go silently blind (it was forbidden from reading beyond the diff), and executing-plans/human readers were never told the plan stopped being self-contained.
Does this PR contain multiple unrelated changes?
No — one rule plus its consumers. Each consumer edit exists because the rule breaks that consumer without it; they were verified together (evidence below).
Existing PRs
Environment tested
New harness support
N/A.
Evaluation
Scripted naive-user scenarios (driver never names skills, never reminds the agent of spec content). After the change, on the assembled text:
cost-spec-plan-duplication/claude 3/3 pass ("this plan does not restate them" + spec cited by path; plan ~15–17KB (claude) with a 2–5KB spec vs 18KB+ duplication baselines), ×codex pass, ×pi pass (the 683-line RED agent). New scenarios:sdd-spec-context-consumed— 1/2: the run that used subagent-driven-development passed its deterministic check (dispatched subagent prompts contain plan-cited spec text verbatim); the second run legitimately chose executing-plans instead, which the scenario doesn't yet constrain — a scenario-calibration gap tracked in the evals repo, not a rule failure.writing-plans-no-spec-conversational— 2/2 pass (header "none — requirements:", no fabricated citation). Canarytriggering-writing-plans/claude 3/3 on this text (skill still fires); in later full-stack rounds that cell flipped via #1718's brainstorming routing change — full analysis, including that scenario's pre-existing miscalibration classification in the eval lab's baselines, is disclosed in #1718. Development history includes one eval-caught misfire: the first no-spec branch was abused to skip writing the spec entirely (runcost-spec-plan-duplication-claude-20260610T213934Z-8e5b, fail) and was re-scoped — the eval, not review, caught it.Rigor
superpowers:writing-skillsand completed adversarial pressure testingFull RED→GREEN→REFACTOR per writing-skills, plus a 3-reviewer adversarial fleet and a 4-reviewer staff panel over the text (findings incorporated: paraphrase/summarize coverage, intra-plan vs from-spec repetition line, drift-argument inversion, no-spec scoping).
Human review