Skip to content

fix(skills): plans reference the spec instead of restating it#1715

Open
arittr wants to merge 3 commits into
devfrom
drew/sup-333-1-plans-reference-spec
Open

fix(skills): plans reference the spec instead of restating it#1715
arittr wants to merge 3 commits into
devfrom
drew/sup-333-1-plans-reference-spec

Conversation

@arittr

@arittr arittr commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Targets dev per the template requirement.

Who is submitting this PR? (required)

Field Value
Your model + version Claude Fable 5 (claude-fable-5[1m])
Harness + version Claude Code 2.1.169
All plugins installed superpowers (this repo, dev checkout); quorum eval lab (superpowers-evals) as the testing apparatus; unrelated local ops plugins (decision-log, episodic-memory, superpowers-chrome, primeradiant-ops)
Human partner who reviewed this diff Drew Ritter (@drewritter)

What problem are you trying to solve?

In a six-agent behavioral eval sweep (quorum/Gauntlet lab, 2026-06-09; agents: claude, codex, antigravity, kimi, opencode/gpt-5.5, pi), the cost-spec-plan-duplication scenario — naive user brainstorms a small feature, then says "Looks good, please write the plan" — failed every agent that completed it (5/5). Each plan restated the entire spec inline instead of referencing it:

  • codex: plan "completely re-states all design decisions independently"
  • opencode: "contains zero references to the spec… re-derives and duplicates all design decisions from scratch"
  • pi: 18,213-byte, 683-line plan of inlined spec content

The agents were obeying the skill: writing-plans' overview says "assuming the engineer has zero context… Document everything they need to know." Self-contained-by-design plans necessarily duplicate the spec — doubling document bytes/tokens and letting the two artifacts drift apart. The eval and the skill text were in direct contradiction; no agent could satisfy both.

A second defect surfaced while verifying the fix: claude shortened the spec path docs/superpowers/specs/ to plain docs/specs/ in 2/2 runs (the path sits mid-checklist and loses to the model's docs/specs prior).

What does this PR change?

States the division of labor in writing-plans: the spec owns WHAT/WHY, the plan owns HOW — cite the spec by path, never restate it ("zero context" now explicitly means mechanically executable steps, not duplication), and adds a **Spec:** line to the plan header template. In brainstorming, closes the observed path loophole: both docs/superpowers/specs/ mentions now explicitly forbid shortening to docs/specs/.

Is this change appropriate for the core library?

Yes — general-purpose planning discipline. It applies to any project that produces a spec and a plan, is not tied to any domain, tool, or third-party service, and changes no workflow structure — only the redundancy rule between two artifacts the core skills already produce.

What alternatives did you consider?

  1. Retire/re-tier the eval scenario as aspirational — rejected: the duplication is a real cost (doubled doc bytes/tokens per feature) and a real drift hazard, not a test calibration artifact.
  2. Fix it in the eval prompt (tell the agent not to restate) — rejected: that rigs the test instead of fixing the product behavior; real users won't say it either.
  3. Remove the "zero context" framing entirely — rejected: it earns its keep for execution detail (exact paths, complete code, exact commands). The fix scopes it rather than deletes it.

Does this PR contain multiple unrelated changes?

Two skills, one dependency: the brainstorming path fix was exposed by this change's own eval cycle (first post-edit run passed the reference criterion but failed the path criterion, 2/2 reproducible), and the same scenario's deterministic checks gate both (file-exists docs/superpowers/specs/*.md + the reference criterion). The GREEN evidence below requires the pair.

Existing PRs

Environment tested

Harness (e.g. Claude Code, Cursor) Harness version Model Model version/ID
Claude Code (agent under test in eval) 2.1.169 Claude Opus claude-opus-4-8

New harness support (required if this PR adds a new harness)

N/A — no harness changes.

Evaluation

  • Initial prompt: the scenario's scripted naive user: "I want to add a feature to mark habits as completed for the day. Help me think through this then plan it." → (after spec) "Looks good, please write the plan." The QA driver is scripted not to remind the agent of the spec's content.
  • Eval sessions after the change: 3 — (1) cost-spec-plan-duplication post-edit: reference criterion flipped to pass, exposed the path defect; (2) after the path fix: full pass (run cost-spec-plan-duplication-claude-20260609T234142Z-9625); (3) regression canary triggering-writing-plans: pass (skill still auto-fires; Skill(superpowers:writing-plans) in the session log).
  • Before/after: baseline 5/5 agents fail with inline-restated plans (e.g. 18KB/683 lines); after, the plan cites the spec by path, states "this plan does not restate them," and lands at 12.8KB with a 4.3KB spec — both in the prescribed docs/superpowers/ locations, all deterministic post-checks green.

Rigor

  • If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing (paste results below)
  • This change was tested adversarially, not just on the happy path
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement

RED→GREEN→REFACTOR per writing-skills: RED = 5-agent baseline failures with verbatim rationalizations (above) against the skill text unchanged since May 4; GREEN = minimal edit, scenario passes; REFACTOR = the path loophole found in GREEN-attempt-1 closed with explicit counters ("exactly this path — not docs/specs/") and re-verified. The pressure scenario is adversarial by construction: a scripted naive user who never names skills and never reminds the agent of spec content, graded by an independent QA agent plus deterministic filesystem checks.

Human review

  • A human has reviewed the COMPLETE proposed diff before submission

Hardening round (adversarial review fleet + eval re-verification)

Three parallel reviewers (red-team, cross-corpus consistency, evidence verification) audited the stack. Findings addressed here by two follow-up commits:

  • "never restate" did not cover paraphrase/summary — the actual baseline failure mode (pi's 683-line plan paraphrased). Now "never restate, paraphrase, or summarize."
  • The No Placeholders intra-plan repetition mandate gave a symmetric argument for re-inlining the spec; the rule now draws the line explicitly: repetition WITHIN the plan (code, commands) required, copying FROM the spec forbidden.
  • Drift argument was invertible ("snapshot to avoid drift"); now: snapshots hide drift.
  • The **Spec:** header gained a no-spec branch — which the eval immediately caught being abused: the agent skipped writing the spec entirely "to avoid duplication" (cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, FAIL — a reviewer-recommended fix that misfired, caught only because every change re-runs the eval). Re-scoped: if brainstorming happened the spec exists and is cited; "none — requirements:" applies only when requirements arrived conversationally and no spec doc was ever produced. Re-run: clean pass (cost-spec-plan-duplication-claude-20260610T221053Z-bf8e — "The plan contains implementation HOW details… rather than restating the WHAT/WHY").
  • Brainstorming path bullet: an existing differently-named docs directory is not a "user preference" override.
  • Execution Handoff now forward-references SDD's Proportionality rule instead of promising unconditional two-stage review.

Final verification battery on the assembled 4-PR stack: 5/5 pass (cost-checkbox/claude, cost-spec-plan/claude, cost-trivial-task-review-fanout/opencode, sdd-rejects-extra-features/claude, triggering-writing-plans/claude).

…#1)

writing-plans told agents to "document everything they need to know"
assuming zero context — every agent in the 2026-06-09 six-agent quorum
sweep obeyed and restated the entire spec inline in the plan
(cost-spec-plan-duplication failed 5/5 completed agents; pi's plan was
683 lines of duplicated spec).

- writing-plans: state the division of labor — spec owns WHAT/WHY,
  plan owns HOW; cite the spec by path/section, never restate it.
  "Zero context" means mechanically executable steps, not duplication.
  Add a **Spec:** line to the plan header template.
- brainstorming: close the path loophole the re-run exposed — claude
  shortened docs/superpowers/specs/ to docs/specs/ in 2/2 runs; both
  path mentions now explicitly forbid the shortening.

TDD evidence (quorum):
- RED: batch-20260609T023452Z-68aa et al — 5/5 agents fail
- GREEN: cost-spec-plan-duplication-claude-20260609T234142Z-9625 pass
  (plan: "this plan does not restate them" + spec cited by path;
  both docs in docs/superpowers/)
- Canary: triggering-writing-plans-claude pass (skill still fires)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
arittr and others added 2 commits June 10, 2026 14:34
… reference rule

Adversarial review findings (C1, C2, C3, C5, A8, F3):
- "never restate" did not cover paraphrase/summary — the actual failure
  mode in the RED evidence; now "never restate, paraphrase, or summarize".
- The No Placeholders intra-plan repetition mandate gave a symmetric
  argument for re-inlining the spec; the rule now draws the line:
  repetition WITHIN the plan is required, copying FROM the spec is not.
- Drift argument was invertible ("snapshot to avoid drift"); now states
  snapshots hide drift.
- **Spec:** header gets a no-spec branch (state requirements once in
  the header, not per task) instead of inviting "no spec, rule is moot".
- Brainstorming path bullet: an existing differently-named docs dir is
  not a "user preference" override.
- Execution Handoff now notes review fanout scales (forward-ref to
  SDD's Proportionality rule) instead of promising unconditional
  two-stage review.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Eval-caught regression: the no-spec branch added to the **Spec:**
header gave the agent a sanctioned path to skip the spec doc entirely
("avoiding duplication by skipping the spec" —
cost-spec-plan-duplication-claude-20260610T213934Z-8e5b, fail). The
branch is now scoped: if brainstorming happened the spec exists and
must be cited; "none — requirements:" applies only when requirements
arrived conversationally and no spec doc was ever produced. The
reference-discipline paragraph states the same rule up front.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@arittr

arittr commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant