fix(skills): plans reference the spec instead of restating it by arittr · Pull Request #1715 · obra/superpowers

arittr · 2026-06-10T00:14:13Z

Targets dev per the template requirement.

Who is submitting this PR? (required)

Field	Value
Your model + version	Claude Fable 5 (`claude-fable-5[1m]`)
Harness + version	Claude Code 2.1.169
All plugins installed	superpowers (this repo, `dev` checkout); quorum eval lab (`superpowers-evals`) as the testing apparatus; unrelated local ops plugins (decision-log, episodic-memory, superpowers-chrome, primeradiant-ops)
Human partner who reviewed this diff	Drew Ritter (@drewritter)

What problem are you trying to solve?

In a six-agent behavioral eval sweep (quorum/Gauntlet lab, 2026-06-09; agents: claude, codex, antigravity, kimi, opencode/gpt-5.5, pi), the cost-spec-plan-duplication scenario — naive user brainstorms a small feature, then says "Looks good, please write the plan" — failed every agent that completed it (5/5). Each plan restated the entire spec inline instead of referencing it:

codex: plan "completely re-states all design decisions independently"
opencode: "contains zero references to the spec… re-derives and duplicates all design decisions from scratch"
pi: 18,213-byte, 683-line plan of inlined spec content

The agents were obeying the skill: writing-plans' overview says "assuming the engineer has zero context… Document everything they need to know." Self-contained-by-design plans necessarily duplicate the spec — doubling document bytes/tokens and letting the two artifacts drift apart. The eval and the skill text were in direct contradiction; no agent could satisfy both.

A second defect surfaced while verifying the fix: claude shortened the spec path docs/superpowers/specs/ to plain docs/specs/ in 2/2 runs (the path sits mid-checklist and loses to the model's docs/specs prior).

What does this PR change?

States the division of labor in writing-plans: the spec owns WHAT/WHY, the plan owns HOW — cite the spec by path, never restate it ("zero context" now explicitly means mechanically executable steps, not duplication), and adds a **Spec:** line to the plan header template. In brainstorming, closes the observed path loophole: both docs/superpowers/specs/ mentions now explicitly forbid shortening to docs/specs/.

Is this change appropriate for the core library?

Yes — general-purpose planning discipline. It applies to any project that produces a spec and a plan, is not tied to any domain, tool, or third-party service, and changes no workflow structure — only the redundancy rule between two artifacts the core skills already produce.

What alternatives did you consider?

Retire/re-tier the eval scenario as aspirational — rejected: the duplication is a real cost (doubled doc bytes/tokens per feature) and a real drift hazard, not a test calibration artifact.
Fix it in the eval prompt (tell the agent not to restate) — rejected: that rigs the test instead of fixing the product behavior; real users won't say it either.
Remove the "zero context" framing entirely — rejected: it earns its keep for execution detail (exact paths, complete code, exact commands). The fix scopes it rather than deletes it.

Does this PR contain multiple unrelated changes?

Two skills, one dependency: the brainstorming path fix was exposed by this change's own eval cycle (first post-edit run passed the reference criterion but failed the path criterion, 2/2 reproducible), and the same scenario's deterministic checks gate both (file-exists docs/superpowers/specs/*.md + the reference criterion). The GREEN evidence below requires the pair.

Existing PRs

I have reviewed all open AND closed PRs for duplicates or prior art
Related PRs: Add verifying-plan-coverage skill: prove a plan covers 100% of its spec before execution #1704 (verifying-plan-coverage — complementary: it verifies the plan covers the spec; this PR prevents the plan duplicating the spec. No overlap in edited content.)

Environment tested

Harness (e.g. Claude Code, Cursor)	Harness version	Model	Model version/ID
Claude Code (agent under test in eval)	2.1.169	Claude Opus	claude-opus-4-8

New harness support (required if this PR adds a new harness)

N/A — no harness changes.

Evaluation

Initial prompt: the scenario's scripted naive user: "I want to add a feature to mark habits as completed for the day. Help me think through this then plan it." → (after spec) "Looks good, please write the plan." The QA driver is scripted not to remind the agent of the spec's content.
Eval sessions after the change: 3 — (1) cost-spec-plan-duplication post-edit: reference criterion flipped to pass, exposed the path defect; (2) after the path fix: full pass (run cost-spec-plan-duplication-claude-20260609T234142Z-9625); (3) regression canary triggering-writing-plans: pass (skill still auto-fires; Skill(superpowers:writing-plans) in the session log).
Before/after: baseline 5/5 agents fail with inline-restated plans (e.g. 18KB/683 lines); after, the plan cites the spec by path, states "this plan does not restate them," and lands at 12.8KB with a 4.3KB spec — both in the prescribed docs/superpowers/ locations, all deterministic post-checks green.

Rigor

If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing (paste results below)
This change was tested adversarially, not just on the happy path
I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement

RED→GREEN→REFACTOR per writing-skills: RED = 5-agent baseline failures with verbatim rationalizations (above) against the skill text unchanged since May 4; GREEN = minimal edit, scenario passes; REFACTOR = the path loophole found in GREEN-attempt-1 closed with explicit counters ("exactly this path — not docs/specs/") and re-verified. The pressure scenario is adversarial by construction: a scripted naive user who never names skills and never reminds the agent of spec content, graded by an independent QA agent plus deterministic filesystem checks.

Human review

A human has reviewed the COMPLETE proposed diff before submission

Hardening round (adversarial review fleet + eval re-verification)

Three parallel reviewers (red-team, cross-corpus consistency, evidence verification) audited the stack. Findings addressed here by two follow-up commits:

"never restate" did not cover paraphrase/summary — the actual baseline failure mode (pi's 683-line plan paraphrased). Now "never restate, paraphrase, or summarize."
The No Placeholders intra-plan repetition mandate gave a symmetric argument for re-inlining the spec; the rule now draws the line explicitly: repetition WITHIN the plan (code, commands) required, copying FROM the spec forbidden.
Drift argument was invertible ("snapshot to avoid drift"); now: snapshots hide drift.
The **Spec:** header gained a no-spec branch — which the eval immediately caught being abused: the agent skipped writing the spec entirely "to avoid duplication" (cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, FAIL — a reviewer-recommended fix that misfired, caught only because every change re-runs the eval). Re-scoped: if brainstorming happened the spec exists and is cited; "none — requirements:" applies only when requirements arrived conversationally and no spec doc was ever produced. Re-run: clean pass (cost-spec-plan-duplication-claude-20260610T221053Z-bf8e — "The plan contains implementation HOW details… rather than restating the WHAT/WHY").
Brainstorming path bullet: an existing differently-named docs directory is not a "user preference" override.
Execution Handoff now forward-references SDD's Proportionality rule instead of promising unconditional two-stage review.

Final verification battery on the assembled 4-PR stack: 5/5 pass (cost-checkbox/claude, cost-spec-plan/claude, cost-trivial-task-review-fanout/opencode, sdd-rejects-extra-features/claude, triggering-writing-plans/claude).

…#1) writing-plans told agents to "document everything they need to know" assuming zero context — every agent in the 2026-06-09 six-agent quorum sweep obeyed and restated the entire spec inline in the plan (cost-spec-plan-duplication failed 5/5 completed agents; pi's plan was 683 lines of duplicated spec). - writing-plans: state the division of labor — spec owns WHAT/WHY, plan owns HOW; cite the spec by path/section, never restate it. "Zero context" means mechanically executable steps, not duplication. Add a **Spec:** line to the plan header template. - brainstorming: close the path loophole the re-run exposed — claude shortened docs/superpowers/specs/ to docs/specs/ in 2/2 runs; both path mentions now explicitly forbid the shortening. TDD evidence (quorum): - RED: batch-20260609T023452Z-68aa et al — 5/5 agents fail - GREEN: cost-spec-plan-duplication-claude-20260609T234142Z-9625 pass (plan: "this plan does not restate them" + spec cited by path; both docs in docs/superpowers/) - Canary: triggering-writing-plans-claude pass (skill still fires) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… reference rule Adversarial review findings (C1, C2, C3, C5, A8, F3): - "never restate" did not cover paraphrase/summary — the actual failure mode in the RED evidence; now "never restate, paraphrase, or summarize". - The No Placeholders intra-plan repetition mandate gave a symmetric argument for re-inlining the spec; the rule now draws the line: repetition WITHIN the plan is required, copying FROM the spec is not. - Drift argument was invertible ("snapshot to avoid drift"); now states snapshots hide drift. - **Spec:** header gets a no-spec branch (state requirements once in the header, not per task) instead of inviting "no spec, rule is moot". - Brainstorming path bullet: an existing differently-named docs dir is not a "user preference" override. - Execution Handoff now notes review fanout scales (forward-ref to SDD's Proportionality rule) instead of promising unconditional two-stage review. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Eval-caught regression: the no-spec branch added to the **Spec:** header gave the agent a sanctioned path to skip the spec doc entirely ("avoiding duplication by skipping the spec" — cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, fail). The branch is now scoped: if brainstorming happened the spec exists and must be cited; "none — requirements:" applies only when requirements arrived conversationally and no spec doc was ever produced. The reference-discipline paragraph states the same rule up front. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

arittr · 2026-06-10T22:31:05Z

This change is part of the following stack:

fix(skills): plans reference the spec instead of restating it #1715 ◀
- fix(skills): SDD review fanout scales with the change #1716
  - fix(skills): brainstorming gate exempts requests with nothing to design #1718

_{Change managed by git-spice.}

This was referenced Jun 10, 2026

fix(skills): SDD review fanout scales with the change #1716

Draft

fix(skills): brainstorming gate exempts requests with nothing to design #1718

Draft

svenbledt mentioned this pull request Jun 10, 2026

fix(skills): account for Claude Fable in model selection and testing guidance #1719

Open

5 tasks

arittr and others added 2 commits June 10, 2026 14:34

arittr mentioned this pull request Jun 10, 2026

fix(skills): description-level exceptions are authoritative in the routing rule #1732

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(skills): plans reference the spec instead of restating it#1715

fix(skills): plans reference the spec instead of restating it#1715
arittr wants to merge 3 commits into
devfrom
drew/sup-333-1-plans-reference-spec

arittr commented Jun 10, 2026 •

edited

Loading

Uh oh!

arittr commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

arittr commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Who is submitting this PR? (required)

What problem are you trying to solve?

What does this PR change?

Is this change appropriate for the core library?

What alternatives did you consider?

Does this PR contain multiple unrelated changes?

Existing PRs

Environment tested

New harness support (required if this PR adds a new harness)

Evaluation

Rigor

Human review

Hardening round (adversarial review fleet + eval re-verification)

Uh oh!

arittr commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arittr commented Jun 10, 2026 •

edited

Loading