docs(spec): arm the full response-review stack end-to-end [DRAFT, approved:false]#389
Draft
JKHeadley wants to merge 3 commits into
Draft
docs(spec): arm the full response-review stack end-to-end [DRAFT, approved:false]#389JKHeadley wants to merge 3 commits into
JKHeadley wants to merge 3 commits into
Conversation
…roved:false) Draft spec for the dark-guard problem on the response-review safety stack, co-designed with instar-codey (thread 33fbbe35). Four independently-darkenable layers (Claude absent-from-Stop, Codex trust-disabled, config-off, gate unconstructed); layered liveness model that fails closed if any layer is dark; org-policy-pinned safety slots with boot-time reassertion + audit; test matrix asserting each layer AND the end-to-end chain. Awaiting Justin ratification + codey's Mode A/B clean-room repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
… final render bytes + four-part artifact identity; add repro evidence package Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the arm-the-full-response-review-stack draft with the design points from the latest thread (33fbbe35) not yet captured: - §1 dual-install corroboration: server floor (L2/L3) dark on both echo (Claude) and codey (Codex) installs — fleet-wide, not Codex-specific. - §1.3 four-state taxonomy + explicit dependency order (server floor first, harness second); host-arming necessary but insufficient. - §1.2 provenance non-claim: we do not assert the installer wrote enabled=false; live config remediated 15:14:24 PDT, no pre-fix history. - R3 "armed-but-dark" named first-class non-pass health state; distinguish server-config-dark from server-intelligence-dark. - R6 interim behavior: fail-open WITH explicit dark-state signal, never silent success (signal-vs-authority). - §4.5 reversible self-test as the named, spec-owned acceptance procedure (baseline -> floor -> arm -> trigger -> assert produced+consumed -> restore). - ELI16 companion updated to match. Still DRAFT (approved:false) — awaiting Justin ratification per instar-dev gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review surface only —
approved: false, docs-only, nosrc/changes. Not for merge until Justin ratifies.Draft spec for the dark-guard problem on the response-review safety stack, co-designed with instar-codey (Threadline thread
33fbbe35-065b-4024-88bf-acb4779480e6).What it covers
The response-review Stop-hook gate (catches unjustified self-termination, unsupported claims, tone violations) can be silently dark even when every individual layer reports healthy. Four independently-darkenable layers, all verified on
JKHeadley/main@ v1.2.80:response-review.jswritten to disk + listed as managed, but never added tosettings.json Stop[](PostUpdateMigrator.ts:1766/1885; onlyStop[].unshiftis the autonomous hook at:2258).enabled=falseon a matchingtrusted_hash; arm rule F3 never re-enables (codexHookArm.ts:16) → dark forever.responseReview.enabledfalsy → hookexit(0)before calling server (response-review.js:33-38).CoherenceGateonly constructed ifresponseReview.enabled && sharedIntelligence; else/review/evaluate→ 501 (server.ts:7311,routes.ts:13131; truth sourceCapabilityIndex.ts:553).Key requirements
trusted_hashre-stamp on every managed-hook body rewrite (primary defense if codey's repro confirms Mode B drift-quarantine).response-review/UnjustifiedStopGate org-policy-pinned: boot-timeenabled=false→ reassert + audit trail; non-pinned hooks surface as drift instead.Open before ratification
[hooks.state]block from codey.approved: true./ultrareview).Files:
docs/specs/arm-the-full-response-review-stack.md+.eli16.mdcompanion.🤖 Generated with Claude Code