Skip to content

feat(writing-plans): Global Constraints + per-task Interfaces as the two narrow exceptions to reference discipline (stacked on #1715)#1746

Draft
obra wants to merge 1 commit into
drew/sup-333-1-plans-reference-specfrom
writing-plans-crisp
Draft

feat(writing-plans): Global Constraints + per-task Interfaces as the two narrow exceptions to reference discipline (stacked on #1715)#1746
obra wants to merge 1 commit into
drew/sup-333-1-plans-reference-specfrom
writing-plans-crisp

Conversation

@obra

@obra obra commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Stacked PR — targets drew/sup-333-1-plans-reference-spec (#1715), not dev. This change is written as an amendment to #1715's reference-discipline rule and is meaningless without it; targeting dev would bury these 24 lines inside #1715's full diff. When #1715 merges to dev, this retargets to dev automatically (or I'll rebase --onto dev if it squash-merges). Don't review this before #1715 settles.

Who is submitting this PR? (required)

Field Value
Your model + version Fable 5 (claude-fable-5)
Harness + version Claude Code 2.1.173–2.1.175 (work spanned several days; session 7c4a7741-5e94-44b9-8c0f-3800d1241f89)
All plugins installed superpowers, superpowers-lab, superpowers-chrome, superpowers-developing-for-claude-code, plugin-dev, agent-sdk-dev, mcp-server-dev, episodic-memory, context7, linear, frontend-design, elements-of-style, code-simplifier, code-review, claude-code-setup, claude-session-driver, github-triage, summarize-meetings, worldview-synthesis, release-radar, security-guidance, primeradiant-ops (private)
Human partner who reviewed this diff Jesse Vincent (@obra) — directed the change and the PR; reviewed the exception wording and the three structures in-session. Complete-diff sign-off is the gate before this leaves draft (full diff inlined below for exactly that).

What problem are you trying to solve?

Three failure modes observed in real SDD eval runs (quorum harness driving actual Claude Code sessions, sdd-go-fractals / svelte scenarios):

  1. Project-wide constraints scatter or silently drop. Plans written under current guidance never collect binding values (version floors, dependency choices, exact copy strings) in one place. The only outright constraint-value drop we ever measured came from a control plan (dropped the go 1.21 floor AND the gradient string). In execution, scattered constraints surfaced as review fix waves — e.g. a go.mod version-floor fix that was only catchable because the experimental plan had a constraints header to check against.
  2. Implementers can't see their neighbors' signatures. Under SDD, a task's implementer sees only their own task — never the spec, never other tasks. Cross-task function signatures were the controller's main legitimate "restating" work, re-derived at every dispatch. Control plans ran 2–4 fix waves per run (including a real Sierpinski formula bug that shipped in the plan's own code, both runs); plans carrying exact signatures ran 1.
  3. Task over-splitting at larger scale. At svelte scale, 4/5 control plans emitted a standalone "Types" micro-task; each unnecessary task costs a full implementer+reviewer cycle.

Separately, #1715 ("plans reference the spec; they never restate it") landed while this branch existed, and the obvious worry was that reference discipline would strip exactly the verbatim values these structures carry. We ran a head-to-head to find out (PRI-2173, below) instead of assuming.

What does this PR change?

+24 lines to skills/writing-plans/SKILL.md, nothing else: a Task Right-Sizing section, a ## Global Constraints section in the plan-header template, a per-task **Interfaces:** block in the task template, and one paragraph framing the latter two as the two narrow exceptions to #1715's reference discipline (subagents see the plan, never the spec, so these two kinds of spec content must travel in the plan — everything else stays referenced).

Complete diff (24 insertions, 1 file)
diff --git a/skills/writing-plans/SKILL.md b/skills/writing-plans/SKILL.md
index d2fb1b8..2d25c07 100644
--- a/skills/writing-plans/SKILL.md
+++ b/skills/writing-plans/SKILL.md
@@ -13,6 +13,8 @@ Assume they are a skilled developer, but know almost nothing about our toolset o
 
 **Plans reference the spec; they never restate, paraphrase, or summarize it.** [...]
 
+**Two narrow exceptions to reference discipline** — subagents executing the plan see the plan (or a single task of it), never the spec, so two kinds of spec content travel in the plan itself: the `## Global Constraints` section (the spec's project-wide requirements, exact values copied verbatim) and each task's `**Interfaces:**` block (exact signatures). Copy those values exactly; everything else stays referenced, never restated.
+
 **Announce at start:** "I'm using the writing-plans skill to create the implementation plan."
 
@@ -35,6 +37,15 @@ Before defining tasks, map out which files will be created or modified [...]
 
 This structure informs the task decomposition. Each task should produce self-contained changes that make sense independently.
 
+## Task Right-Sizing
+
+A task is the smallest unit that carries its own test cycle and is worth a
+fresh reviewer's gate. When drawing task boundaries: fold setup,
+configuration, scaffolding, and documentation steps into the task whose
+deliverable needs them; split only where a reviewer could meaningfully
+reject one task while approving its neighbor. Each task ends with an
+independently testable deliverable.
+
 ## Bite-Sized Task Granularity
 
@@ -61,6 +72,13 @@ [plan header template]
 
 **Tech Stack:** [Key technologies/libraries]
 
+## Global Constraints
+
+[The spec's project-wide requirements — version floors, dependency limits,
+naming and copy rules, platform requirements — one line each, with exact
+values copied verbatim from the spec. Every task's requirements implicitly
+include this section.]
+
 ---

@@ -74,6 +92,12 @@ [task template]

  • Modify: exact/path/to/existing.py:123-145
  • Test: tests/exact/path/to/test.py

+Interfaces:
+- Consumes: [what this task uses from earlier tasks — exact signatures]
+- Produces: [what later tasks rely on — exact function names, parameter

  • and return types. A task's implementer sees only their own task; this
  • block is how they learn the names and types neighboring tasks use.]
  • Step 1: Write the failing test

</details>

## Is this change appropriate for the core library?

Yes. `writing-plans` is a core skill; the change benefits any multi-task plan executed by subagents on any kind of project (the eval fixtures were a Go CLI and a Svelte web app — the structures behaved identically). No third-party anything, no domain-specific content.

## What alternatives did you consider?

- **Rely on #1715's spec-citation + the SDD controller pasting spec context at dispatch time** (no exceptions at all). Tested head-to-head (PRI-2173 E2, below): it *works* at opus, but with higher variance and occasional citation-substitution; the header arm is mechanically tight (near-zero size variance — the signature of a verbatim copy rather than a composition). The exception buys determinism, not rescue.
- **Hand-crisping plans without skill guidance.** Validated in effect (−21% dispatches, gates 3/3) but doesn't scale — the point is eliciting the structure from the skill.
- **Controller-side signature restating at dispatch time** (status quo). The Interfaces block moves that per-dispatch re-derivation to plan time, done once.
- **Plan word budgets for tightness.** Refuted in a separate experiment — budgets slash test content first. Not pursued.

## Does this PR contain multiple unrelated changes?

The three structures + the exception paragraph are one coherent change: the plan-side content contract for subagent execution. Global Constraints and Interfaces are the two exception channels the paragraph names; Task Right-Sizing governs the unit those channels attach to. They were validated together as one variant stack (the "C variant" below) and land in one file.

## Existing PRs
- [x] I have reviewed all open AND closed PRs for duplicates or prior art
- Related PRs: #1715 (the base this stacks on — reference discipline; this PR is its two exceptions), #1717 / #1744 (SDD execution-side siblings that *consume* these plan structures at dispatch time), #1062 and #1704 (plan-review/coverage gates — different problem: they verify plans after writing; this changes what plans contain). No open or closed PR adds Global Constraints, per-task Interfaces, or task right-sizing to writing-plans.

## Environment tested

| Harness (e.g. Claude Code, Cursor) | Harness version | Model | Model version/ID |
|-------------------------------------|-----------------|-------|------------------|
| Claude Code (orchestration + live SDD eval runs via quorum/drill tmux harness) | 2.1.173–2.1.175 | Opus (eval subjects), Fable (orchestrator) | claude-opus-4-5 / claude-fable-5 |
| Anthropic API direct (plan-generation + dispatch-composition micros) | n/a | Opus | claude-opus-4-5 |

## New harness support (required if this PR adds a new harness)

N/A — no harness change.

## Evaluation

**Initial prompt:** this came out of an autoresearch campaign on SDD build-loop cost/reliability (Jesse's directive: instrument real plan execution, find what's load-bearing), not a single broken session. The specific motivating evidence is the fix-wave and constraint-drop data below. The #1715 reconciliation was a Linear ticket (PRI-2173): "settle whether #1715's reference discipline regresses the L1 plan structures, with data."

**Eval sessions after the change:** 15 generated plans (L1 micros) + 4 full SDD quorum runs directly attributing this guidance (plus 3 hand-crisped precursor runs) + 18 generated plans and 12 composed dispatches for the #1715 head-to-head. Every automated score was manually inspected; all flagged samples read in full.

### 1. L1 elicitation micros — does the guidance elicit the structures? (~$3, micro tier)

Opus, one API call per plan, 5 reps/variant, fractals + svelte design fixtures. Variants: **A** control (current skill), **B** +right-sizing, **C** +Global Constraints header +Interfaces blocks.

| Structure | Control | With guidance |
|---|---|---|
| Global Constraints header, exact values verbatim (gradient string, cobra path, go 1.21) | 0/5 | **5/5** |
| Per-task Interfaces blocks with exact signatures, consumed verbatim by later tasks | 0% of tasks | **100% of tasks** |
| Task count, svelte scale | 9.4 mean (standalone "Types" micro-task in 4/5) | 8.4 mean, sensible merges |
| Task count, fractals scale | 6–7 | no change (control already right-sized at this scale) |

Watch item logged honestly: one fractals C-sample over-merged to 3 tasks — coarse-gate risk when stacking guidance.

### 2. L1 full-run attribution — does it matter in execution? (4 quorum runs ≈ $30; 3 precursors ≈ $35)

Real Claude Code SDD sessions (opus controller), elicited C-variant plan vs elicited control plan, same scenario, all gates green. Coding-agent cost per run:

| Plan | Coding $ | Fix waves |
|---|---|---|
| Elicited with this guidance ×2 | $6.34 / $8.49 | **1** (a go.mod version floor — catchable *because* the constraints header existed) |
| Elicited control ×2 | $7.59 / $7.73 | **2–4** (incl. a real Sierpinski formula bug in the plan's own code, both runs) |
| Hand-crisped precursor ×3 | $9.51–12.65 | 5–9 |
| Prior hand-written 10-task fixture band | $11.67–14.84 | ~7 |

**Honest attribution:** the big dollar win belongs to opus-written complete-code plans, *not* to this guidance — the control plan lands inside the elicited cost range. What this guidance measurably buys is **fidelity and variance**: deterministic constraint propagation, exact cross-task signatures, fix waves 1 vs 2–4, −1 task at svelte scale. This PR claims those grounds, not dollars.

### 3. PRI-2173 — head-to-head vs #1715's reference discipline (≈ $11 Tier 1 + a few $ Tier 1.5, within a $40 cap)

Pre-registered three-arm plan-generation micro, opus, 6 reps/arm (18 plans): **C-dev** control, **A-1715** (reference-discipline paragraph verbatim from the PR diff), **B-carveout** (A + this PR's exception wording).

| Metric | C-dev | A-1715 | B-carveout |
|---|---|---|---|
| Spec cited in header | 0/6 | 6/6 | 6/6 |
| Global Constraints header | 0/6 | 0/6 | **6/6** |
| 8 binding values verbatim | 5.8/6 avg (one rep **dropped** go-1.21 + gradient) | 6/6 (riding code/command blocks) | 6/6 (header AND code) |
| Interfaces blocks | 0/6 | 0/6 | **6/6** |
| AC prose copied (restatement leak) | ≤2/7 trivial lines | ≤2/7 trivial | **0–1/7 (lowest of all arms)** |
| Plan bytes (avg) | 26,622 | 20,920 (−21%) | 22,703 (−15% vs control) |

Two findings worth stating against my own interest: **(a)** my pre-registered prediction that #1715 alone would strip constraint values was **refuted** — under reference discipline the values ride code and command blocks, which #1715 explicitly permits. #1715 is safe as written. **(b)** the exception does **not** re-open the restatement door — arm B has the *lowest* spec-prose copy rate and is still −15% bytes vs control. The exception buys structure determinism (6/6 zero-variance adoption), not value rescue.

Follow-up dispatch-composition micro (12 composed reviewer dispatches, 6/arm, opus): both mechanisms (spec-citation + controller paste vs. header copy) compose high-quality constraint lenses; the header arm is mechanically tight (1,507–1,638 chars, near-zero variance — the signature of verbatim copy) where the composed arm ranged 2,070–2,713 with one rep substituting a citation for inline values. Tier 2 full-SDD interplay runs (est. $25–34) were not spent — judged a second-order question after both micros agreed.

**Total measured spend across everything cited: ≈ $80.**

## Rigor

- [x] If this is a skills change: I used `superpowers:writing-skills` and completed adversarial pressure testing (results above — predictions pre-registered before each run, including one designed to kill this PR's premise, which died and is reported)
- [x] This change was tested adversarially, not just on the happy path (the PRI-2173 head-to-head exists *because* #1715 looked like it might falsify this PR; two scorer artifacts — an Interfaces wording variant and a marker-scoping error — were caught by mandatory manual reads and corrected before they could flatter the results)
- [x] I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement (none of that content is touched)

## Human review
- [ ] A human has reviewed the COMPLETE proposed diff before submission

Deliberately unchecked while in draft: @obra directed this PR and reviewed the exception paragraph and structures in-session; the complete 24-line diff is inlined above for his sign-off, which gates undrafting.

---

Known limits, stated plainly: all evals ran at opus (sonnet planners adopt the structures mechanically but choose coarser task structure — separate finding; plan-writing stays at opus); fixtures are small (a Go fractals CLI and a Svelte todo app); the exception *wording* is validated at plan-generation and dispatch-composition micro level, and the full-SDD Tier 2 interplay run is available but unspent.

Submitted by **Fable 5** (`claude-fable-5`), Claude Code 2.1.175, session `7c4a7741-5e94-44b9-8c0f-3800d1241f89`, operated by Jesse Vincent (@obra). I wrote this PR body and ran the evals it reports; the experiment logs live in our private autoresearch repo and the eval scenarios in the `evals/` submodule.

… Interfaces blocks

Builds on #1715's reference discipline: the two structures are framed as
the narrow exceptions for spec content subagents must see (they get the
plan or one task of it, never the spec). Exception wording micro-validated
as PRI-2173 arm B: adopts 6/6 with zero restatement leak (lowest spec-copy
rate of all arms, plans -15% bytes vs dev control); Global Constraints
header elicited 0/5->5/5 with verbatim values, Interfaces 0->100% signature
availability (L1 micros). Value = lens determinism + mechanical extraction
into task briefs and reviewer constraints blocks, plus fix-wave reduction
(1 vs 2-4 in L1 full runs); the values themselves survive reference
discipline by riding code blocks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant