SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744
Draft
obra wants to merge 6 commits into
Draft
SDD: cost-tiering stack + plan-mandated-defect escalation + pre-flight plan review (stacked on #1717)#1744obra wants to merge 6 commits into
obra wants to merge 6 commits into
Conversation
…scription hypothesis)
…recipe + terse reviewer contract
Reviewer tripwire (Calibration): a plan-mandated defect IS a finding, reported as Important and labeled plan-mandated — the plan's authorship does not grade its own work. Controller rule (review loop): a plan-mandated finding, or any finding conflicting with the plan's text, escalates to the human like any plan contradiction — never dismissed because the plan mandates it. E35 micro (frozen 0a98 replay, sonnet reviewer, 6v6): without the tripwire 0/6 reports give the controller anything to escalate on (all Approved, defect endorsed as spec-required); with it 6/6 report the defect as a labeled finding.
… gap (E35/E36); bump evals to 9919b27
…uestion before Task 1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@arittr — sequencing: please don't stop the eval run you have going on #1717 /
sdd-e27-stack. Once that one is put to bed, this branch is the next thing to eval — it issdd-e27-stackplus three commits (the two plan-mandated-defect rules, the pre-flight plan review) and the evals/spec bumps.Who is submitting this PR? (required)
7c4a7741-5e94-44b9-8c0f-3800d1241f89What problem are you trying to solve?
Three measured failure modes in subagent-driven development, each from real eval sessions:
e8e9showed a controller noticing the defect at plan-read and deliberately deferring to the review loop, because nothing told it raising plan conflicts up front was allowed.What does this PR change?
Five additions to
skills/subagent-driven-development/on top of #1717: (1) conditional implementer tiering — cheapest tier when the plan carries complete code; (2) final whole-branch review pinned to the most capable model; (3) a one-line narration recipe and a terse reviewer report contract; (4) two plan-mandated-defect rules — reviewer tripwire ("that IS a finding — Important, labeled plan-mandated; the human decides") and controller escalation rule ("present the finding and the plan text, ask which governs"); (5) a Pre-Flight Plan Review section — scan once before Task 1, batch all plan conflicts into one question. Plus the evals-submodule bump making the planted-defect scenario escalation-aware.Is this change appropriate for the core library?
Yes — all changes are to the general-purpose SDD skill and benefit any project type. No third-party dependencies, no domain-specific content.
What alternatives did you consider?
Does this PR contain multiple unrelated changes?
The commits are one coherent layer: the cost stack and the judgment rules were co-developed against the same eval battery, and the pre-flight rule exists because the L2b battery exposed the dispatch-transmission gap. Splitting them would ship a cost stack whose planted-defect gate we know is weaker without the judgment rules.
Existing PRs
Environment tested
New harness support
N/A — no new harness.
Evaluation
Rigor
superpowers:writing-skillsmethodology — micro-tests with no-guidance controls before full runs, pre-registered predictions, negative results logged at equal billingHuman review