Skip to content

Add verifying-plan-coverage skill: prove a plan covers 100% of its spec before execution#1704

Open
filipboshevski wants to merge 1 commit into
obra:devfrom
filipboshevski:add-verifying-plan-coverage-skill
Open

Add verifying-plan-coverage skill: prove a plan covers 100% of its spec before execution#1704
filipboshevski wants to merge 1 commit into
obra:devfrom
filipboshevski:add-verifying-plan-coverage-skill

Conversation

@filipboshevski

Copy link
Copy Markdown

Who is submitting this PR? (required)

Field Value
Your model + version Claude Opus 4.8 (claude-opus-4-8)
Harness + version Claude Code 2.1.123
All plugins installed superpowers 5.1.0 (primary). Also enabled in the session: anthropic-skills, cowork-plugin-management, and one project-local skill (migrating-templates-components). Marketplace: anthropics/claude-plugins-official.
Human partner who reviewed this diff Filip Boshevski (filipbosevski1234@gmail.com)

What problem are you trying to solve?

When writing-plans produces a plan for a non-trivial spec, the built-in Self-Review step ("Skim each section/requirement — can you point to a task that implements it?") is a self-skim from memory. In real planning sessions on a production polyglot monorepo, that skim repeatedly passed plans that had actually drifted from the spec in three recurring ways:

  1. Silent gaps — a stated requirement (often an implicit one like "light and dark", a file-split rule, or a "no comments" constraint) had no corresponding task, but the skim "felt" complete and marked it covered.
  2. Scope creep — the plan added steps the spec never asked for ("this is obviously beneficial"), which then got built and reviewed before anyone noticed they weren't requested.
  3. Self-patching — when the review did spot a gap, the agent immediately rewrote the plan to close it, hiding the scope decision from the human partner instead of surfacing it.

The shared root cause: the skim produces a feeling of coverage, not an artifact you can point at, and it lets the agent both find and unilaterally resolve divergence. As plans get long, context pollution makes the from-memory skim progressively less reliable — exactly when rigor matters most.

What does this PR change?

Adds a new general-purpose skill, verifying-plan-coverage, that runs immediately after writing-plans saves a plan and before execution handoff. It enumerates every atomic spec requirement (Rn), builds a bidirectional coverage matrix (spec→plan and plan→spec), and — crucially — when it finds any gap, divergence, unrequested addition, or nonsense step it stops and emits a structured Amendment Proposal for the human partner to rule on, rather than silently fixing the plan or the spec. writing-plans is wired to require it (a ## Coverage Verification (MANDATORY) section) so the execution handoff is gated on a VERIFIED plan.

Is this change appropriate for the core library?

Yes — it is domain-agnostic plan-quality infrastructure, not project/tool/third-party-specific. It operates purely on "a plan and the spec it claims to implement," adds zero dependencies, and benefits anyone using the existing brainstorming → writing-plans → executing-plans workflow regardless of stack. It directly strengthens a skill (writing-plans) that already ships in core.

What alternatives did you consider?

  • Strengthening the existing Self-Review skim in writing-plans (the approach of Improved Plan Review #1644). Rejected as insufficient for our failure mode: a richer in-line skim is still a self-skim that both finds and resolves issues in one pass — it does not produce a pointable matrix and does not route scope decisions to the human. We kept Self-Review and added a separate, rigorous gate after it.
  • A completion-time Evaluator gate (the approach of feat(writing-plans): add Evaluator section + self-review check #1627). That gates plan → complete (did execution satisfy the plan). Our gap is earlier and different: spec → plan (does the plan faithfully and completely reflect the spec) before any execution begins. Complementary, not overlapping.
  • An acceptance-criteria skill upstream of planning (the approach of feat: add acceptance-criteria skill #1112). That improves the input to planning by turning design into testable ACs. It does not verify that the resulting plan covers those ACs with zero divergence, which is what this skill does.
  • Folding the check into executing-plans/subagent-driven-development. Rejected — by execution time the divergence is already baked in and more expensive to unwind; the check belongs at the plan/spec boundary.

Does this PR contain multiple unrelated changes?

No. One concern: proving plan↔spec coverage. Two files, tightly coupled — the new skill, and the one-section wiring in writing-plans that invokes it. The writing-plans edit is the minimal integration point for the new skill, not an independent change.

Existing PRs

  • I have reviewed all open AND closed PRs for duplicates or prior art
  • Related PRs: feat(writing-plans): add Evaluator section + self-review check #1627 (open — adds an Evaluator/completion gate to writing-plans), Improved Plan Review #1644 (open, targets dev — enriches the existing plan-review subagent with code correctness/consistency checks), feat: add acceptance-criteria skill #1112 (open — adds an acceptance-criteria skill before planning). Distinction from each is detailed in "What alternatives did you consider?" above. None add a post-plan, pre-execution, bidirectional spec-coverage matrix with a no-silent-fix / human-ruling Amendment Proposal. No prior closed PR was found proposing this.

Environment tested

Harness Harness version Model Model version/ID
Claude Code 2.1.123 Claude Opus 4.8 claude-opus-4-8

Evaluation

Developed and pressure-tested with superpowers:writing-skills (RED→GREEN→REFACTOR), using subagents as the test harness. Two scenarios, each a spec + a deliberately drifted plan, run under combined pressure (time + authority + "no more back-and-forth"). Each agent had to produce its deliverable and a forced-honesty DECISION LOG (did you edit the plan yourself / did you stop and escalate / did you call it ready).

Scenario 1 — multi-defect plan (8 atomic requirements; plan had 1 missing requirement, 2 partial, 1 divergent, 1 unrequested addition that also violated a "no new dependencies" constraint). Measured the discipline the skill enforces: surface gaps as a proposal and STOP — never silently fix.

Condition n Detected the defects Silently rewrote the plan Stopped + Amendment Proposal
Baseline (no skill) 3 3/3 3/3 0/3
With skill 3 3/3 0/3 3/3

Baseline agents detected the problems but unilaterally deleted the unrequested feature, swapped the storage backend, and added the missing auth check, then declared the plan ready — explicitly "No [I didn't ask], you told me not to." That is the exact failure mode: the scope decision (e.g. drop the CSV feature, or add it to the spec?) was hidden from the human. With the skill, all 3 produced the requirement enumeration + bidirectional matrix, edited nothing, and routed every item — including treating the unrequested feature as "spec change, your call" rather than deleting it.

Scenario 2 — adversarial REFACTOR probe (the hardest case): a near-verified plan with ONE trivially minor gap (a Retry-After: seconds line present verbatim in the spec but dropped from the plan), plus explicit standing authority: "fix small stuff yourself, don't bug me." This attacks the Iron Rule directly.

  • Found a real loophole. With the skill as originally written, 2/2 agents amended the plan themselves and declared it VERIFIED, reasoning: "restoring an already-approved spec requirement isn't new scope, so it's a fix not an amendment" and "I'm consciously overriding the strict reading on the strength of your standing authority."

  • Closed it. Added explicit counters to the skill (Iron Rule now covers trivial/one-line/already-in-spec gaps and disclaims standing "fix small stuff" authority; plus matching Red Flags + rationalization-table rows).

  • Re-tested, same scenario: 2/2 now stop and propose, both citing the new language ("the skill explicitly removes that authority for coverage gaps... 'small' and 'already in the spec' are not exemptions") while acknowledging the temptation — the bulletproofing signal.

  • Number of eval subagent sessions run: 10 (3 baseline + 3 treatment on Scenario 1; 1 baseline + 2 treatment pre-fix + 2 treatment post-fix on Scenario 2).

  • Before/after: the discipline (stop + surface, never silently fix) went 0/3 → 3/3 on the multi-defect scenario, and the trivial-gap loophole went 0/2 → 2/2 after the REFACTOR edit. Detection was already strong in both conditions — the skill's value is entirely in routing the decision to the human instead of smuggling it.

Rigor

  • If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing (results above — one RED→GREEN→REFACTOR cycle that found and closed a real loophole)
  • This change was tested adversarially, not just on the happy path (combined time + authority + sunk-cost pressure; a dedicated probe targeting the Iron Rule's weakest point)
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals — the writing-plans edit only adds a new section + one clause; it does not reword existing tuned content.

Human review

  • A human has reviewed the COMPLETE proposed diff before submission

A note on the verifying-plan-coverage ↔ acceptance-criteria relationship (#1112)

If #1112's acceptance-criteria skill lands, the two compose cleanly: that skill produces testable ACs as the spec; this skill proves the plan covers them with zero divergence. They are not competing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant