fix(skills): SDD review fanout scales with the change#1716
Draft
arittr wants to merge 1 commit into
Draft
Conversation
5 tasks
Draft
5 tasks
5 tasks
d71eb57 to
162ac4b
Compare
This was referenced Jun 10, 2026
162ac4b to
60f6174
Compare
81874ec to
fc5896b
Compare
60f6174 to
70b52fd
Compare
subagent-driven-development mandated implementer + two-stage review + final reviewer unconditionally — antigravity (agy) and opencode each dispatched 4 subagents for a one-line console.log (cost-trivial-task-review-fanout), and agents that passed did so only by disobeying the skill. - Proportionality rule: a plan that is entirely one trivial, fully-specified mechanical change is implemented directly, verified per superpowers:verification-before-completion, committed — no review fanout. Trivial is a property of the diff (no logic, control flow, or security-relevant change), not the plan's self-description; "a constant bump" is qualified (no security or behavioral consequences). Any doubt = full pipeline. Multi-task plans never skip reviews regardless of task size. - Flowchart gets the matching trivial-exit diamond (the failing agents follow the flowchart literally). - Red Flags "never skip reviews" points at the sole exception instead of contradicting it. - writing-plans' execution handoff notes fanout scales (forward reference resolves within this PR's base expectations: the Proportionality rule ships here). Independently mergeable: no dependency on the reference-discipline or brainstorming-exception PRs. Eval evidence (quorum): RED 4 dispatches for 1 line (agy, opencode); GREEN cost-trivial-task-review-fanout opencode 3/3 pass (0 dispatches, deterministic tool-count check) + antigravity pass (the formerly deterministic failer); containment canary sdd-rejects-extra-features claude 3/3 pass (full pipeline per task). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
70b52fd to
f9d11b3
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Who is submitting this PR? (required)
claude-fable-5[1m])devcheckout); quorum eval lab (superpowers-evals) as the testing apparatus; unrelated local ops pluginsWhat problem are you trying to solve?
In the 2026-06-09 six-agent quorum sweep,
cost-trivial-task-review-fanout— a scripted naive user asks the agent to execute a plan whose entire content is one one-lineconsole.loginsertion — showed subagent-driven-development's pipeline has no proportionality exit: antigravity "dispatched 4 subagents: Implementer, Spec Reviewer, Code Quality Reviewer, and Final Code Reviewer" for the one-line change, exactly as the skill mandates; opencode identically. The agents that passed did so by NOT following the skill. A skill that only produces good outcomes when disobeyed is miscalibrated.What does this PR change?
Adds a Proportionality rule: a plan that is entirely one trivial, fully-specified mechanical change is implemented directly, verified per superpowers:verification-before-completion, and committed — no review fanout. Trivial is a property of the diff (no logic, control-flow, or security-relevant change — "a constant bump" is qualified with "no security or behavioral consequences"), not of the plan's self-description; any doubt means the full pipeline; multi-task plans never skip reviews regardless of task size. The process flowchart gets the matching trivial-exit diamond (the failing agents follow the flowchart literally), the Red Flags "never skip reviews" line points at the sole exception instead of contradicting it, and writing-plans' execution handoff notes that fanout scales.
Is this change appropriate for the core library?
Yes — tunes the core execution workflow's cost behavior for all users; non-trivial plans are unchanged (verified below).
What alternatives did you consider?
(1) Gate at entry (trivial plans never engage SDD) — viable but larger; the in-skill clause is the smaller diff. (2) Retire the eval — rejected: 4 full-context dispatches for one line is real measured waste. (3) Prose without the flowchart exit — rejected from evidence: the failing agents execute the flowchart, not the prose.
Does this PR contain multiple unrelated changes?
No — one rule, with the flowchart/Red-Flags/handoff touches required to keep the skill self-consistent about it.
Existing PRs
Environment tested
New harness support
N/A.
Evaluation
Scripted naive user ("Please execute the plan in docs/superpowers/plans/."; answers "Use your judgment" on subagent questions). Runs were measured on the assembled three-branch text (all three sibling PRs applied to
dev); the Proportionality rule is the only change in the set that touches the fanout path. After the change:cost-trivial-task-review-fanout/opencode 3/3 pass — zero subagents dispatched (deterministictool-count Agent lte 2), change landed on the main checkout, ~$0.02–0.08 coding cost per Gauntlet token estimates vs implied $0.50–2 for the 4-dispatch baseline; ×antigravity pass — the only deterministic pre-fix failer (0/3). Containment canary:sdd-rejects-extra-features/claude 3/3 pass — a real multi-task plan still runs implementer + two-stage review per task + final reviewer (spec reviewer as YAGNI gate after each task, 8/8 deterministic checks). Honest baseline note: opencode's pre-fix pass rate on this scenario was ~50%, so its cell leans on n=3 plus the antigravity flip for significance.Rigor
superpowers:writing-skillsand completed adversarial pressure testingRed-team findings incorporated: the Red Flags line was rewritten after a reviewer showed its first form licensed per-task skipping inside multi-task plans; "a one-line edit" was dropped from the examples after a reviewer showed it blessed one-line behavioral changes (
|| user.isOwner); a staff-panel-caught internal contradiction (spec access wording) was fixed; the flowchart diamond carries "fully-specified" and "any doubt = no" to match the prose exactly.Human review