diff --git a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md index ab51c43e4c..dfec87cfde 100644 --- a/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md +++ b/docs/superpowers/specs/2026-06-10-strict-cost-sdd-design.md @@ -133,8 +133,29 @@ opus controller flagged it 5/5. Cheap controllers handle explicit escalation; they absorb implicit authority-vs-quality adjudication. A possible L2b (discrete rule: "a reviewer finding that conflicts with the plan's text is the human's decision — escalate it") would route the -failing judgment through the escalation behavior that held; untested. -Original recon notes follow. +failing judgment through the escalation behavior that held. + +**L2b tested 2026-06-11 (E35/E36, evals +`docs/experiments/2026-06-11-build-loop-autoresearch.md`): improves the +opus stack, does NOT rescue the sonnet rung.** Two rules: a reviewer +tripwire (a plan-mandated defect IS a finding — Important, labeled +plan-mandated; the human decides) and a controller escalation rule +(plan-mandated findings go to the human like any plan contradiction). +Micro on frozen sonnet-composed inputs: 0/6 → 6/6 labeled findings. +Full battery: opus controllers 2/2 internalized the rule, caught their +reviewer's miss as self-described backstop, and escalated for a +sanctioned fix (the 4241 ad-hoc behavior made structural); escalation +sanity 2/2 unbroken. Sonnet controllers: 1/5 full pass — paraphrase +drops the tripwire from dispatches (2/5 transmitted), transmission +alone doesn't fire it live (read-once dilution across the reviewer's +tool reads; placement within the dispatch refuted as the variable), +and no sonnet controller showed backstop behavior; 1/5 shipped the +defect. The L2b rules are a candidate commit for the opus stack. +A future L2c for the sonnet rung would pair the SKILL.md +constraints-recipe (the one channel sonnet transmits verbatim) with a +mandatory output-format slot for plan-mandated findings (the skeleton +survives every observed paraphrase and is consulted at composition +time); untested. Original recon notes follow. **Recon (superseded):** Sonnet-controller runs (claude-sonnet coding-agent): all gates green at diff --git a/evals b/evals index af05326467..7dc0be79bb 160000 --- a/evals +++ b/evals @@ -1 +1 @@ -Subproject commit af0532646754b0a1d164800c609ddfc16e6f91dc +Subproject commit 7dc0be79bb40bd00f757a64b1101ae6d52f262c2 diff --git a/skills/subagent-driven-development/SKILL.md b/skills/subagent-driven-development/SKILL.md index eb4623ab7f..2676043000 100644 --- a/skills/subagent-driven-development/SKILL.md +++ b/skills/subagent-driven-development/SKILL.md @@ -11,6 +11,9 @@ Execute plan by dispatching a fresh implementer subagent per task, a task review **Core principle:** Fresh subagent per task + task review (spec + quality) + broad final review = high quality, fast iteration +**Narration:** between tool calls, narrate at most one short line — the +ledger and the tool results carry the record. + **Continuous execution:** Do not pause to check in with your human partner between tasks. Execute all tasks from the plan without stopping. The only reasons to stop are: BLOCKED status you cannot resolve, ambiguity that genuinely prevents progress, or all tasks complete. "Should I continue?" prompts and progress summaries waste their time — they asked you to execute the plan, so execute it. ## When to Use @@ -79,6 +82,20 @@ digraph process { } ``` +## Pre-Flight Plan Review + +Before dispatching Task 1, scan the plan once for conflicts: + +- tasks that contradict each other or the plan's Global Constraints +- anything the plan explicitly mandates that the review rubric treats as a + defect (a test that asserts nothing, verbatim duplication of a logic block) + +Present everything you find to your human partner as one batched question — +each finding beside the plan text that mandates it, asking which governs — +before execution begins, not one interrupt per discovery mid-plan. If the +scan is clean, proceed without comment. The review loop remains the net for +conflicts that only emerge from implementation. + ## Model Selection Use the least powerful model that can handle each role to conserve cost and increase speed. @@ -88,6 +105,8 @@ Use the least powerful model that can handle each role to conserve cost and incr **Integration and judgment tasks** (multi-file coordination, pattern matching, debugging): use a standard model. **Architecture and design tasks**: use the most capable available model. +The final whole-branch review is one of these — dispatch it on the most +capable available model, not the session default. **Review tasks**: choose the model with the same judgment, scaled to the diff's size, complexity, and risk. A small mechanical diff does not need the @@ -100,8 +119,10 @@ most expensive — which silently defeats this section. **Turn count beats token price.** Wall-clock and context cost scale with how many turns a subagent takes, and the cheapest models routinely take 2-3× the turns on multi-step work — costing more overall. Use a mid-tier model as the -floor for implementers and reviewers; reserve the cheapest tier for -single-file mechanical fixes. +floor for reviewers and for implementers working from prose descriptions. +When the task's plan text contains the complete code to write, the +implementation is transcription plus testing: use the cheapest tier for +that implementer. Single-file mechanical fixes also take the cheapest tier. **Task complexity signals (implementation tasks):** - Touches 1-2 files with a complete spec → cheap model @@ -174,6 +195,11 @@ final whole-branch review. When you fill a reviewer template: findings in the progress ledger as you go, and point the final whole-branch review at that list so it can triage which must be fixed before merge. A roll-up nobody reads is a silent discard. +- A finding labeled plan-mandated — or any finding that conflicts with + what the plan's text requires — is the human's decision, like any plan + contradiction: present the finding and the plan text, ask which governs. + Do not dismiss the finding because the plan mandates it, and do not + dispatch a fix that contradicts the plan without asking. - The final whole-branch review gets a package too: run `scripts/review-package MERGE_BASE HEAD` (MERGE_BASE = the commit the branch started from, e.g. `git merge-base main HEAD`) and include the diff --git a/skills/subagent-driven-development/task-reviewer-prompt.md b/skills/subagent-driven-development/task-reviewer-prompt.md index d2177f0f54..588a40227a 100644 --- a/skills/subagent-driven-development/task-reviewer-prompt.md +++ b/skills/subagent-driven-development/task-reviewer-prompt.md @@ -115,6 +115,11 @@ Subagent (general-purpose): "yes." A tight report that cites lines gives the controller everything it needs. + Your final message is the report itself: begin directly with the + spec-compliance verdict. Every line is a verdict, a finding with + file:line, or a check you ran — no preamble, no process narration, + no closing summary. + ## Calibration Categorize issues by actual severity. Not everything is Critical. @@ -123,6 +128,11 @@ Subagent (general-purpose): would block a merge over — verbatim duplication of a logic block, swallowed errors, tests that assert nothing. "Coverage could be broader" and polish suggestions are Minor. + If the plan or brief explicitly mandates something this rubric calls a + defect (a test that asserts nothing, verbatim duplication of a logic + block), that IS a finding — report it as Important, labeled + plan-mandated. The plan's authorship does not grade its own work; the + human decides. Acknowledge what was done well before listing issues — accurate praise helps the implementer trust the rest of the feedback.