Skip to content

docs(spec): arm the full response-review stack end-to-end [DRAFT, approved:false]#389

Draft
JKHeadley wants to merge 3 commits into
mainfrom
echo/response-review-stack-spec
Draft

docs(spec): arm the full response-review stack end-to-end [DRAFT, approved:false]#389
JKHeadley wants to merge 3 commits into
mainfrom
echo/response-review-stack-spec

Conversation

@JKHeadley

Copy link
Copy Markdown
Owner

Review surface only — approved: false, docs-only, no src/ changes. Not for merge until Justin ratifies.

Draft spec for the dark-guard problem on the response-review safety stack, co-designed with instar-codey (Threadline thread 33fbbe35-065b-4024-88bf-acb4779480e6).

What it covers

The response-review Stop-hook gate (catches unjustified self-termination, unsupported claims, tone violations) can be silently dark even when every individual layer reports healthy. Four independently-darkenable layers, all verified on JKHeadley/main @ v1.2.80:

  • L1 Clauderesponse-review.js written to disk + listed as managed, but never added to settings.json Stop[] (PostUpdateMigrator.ts:1766/1885; only Stop[].unshift is the autonomous hook at :2258).
  • L1 Codex — slot carries enabled=false on a matching trusted_hash; arm rule F3 never re-enables (codexHookArm.ts:16) → dark forever.
  • L2 ConfigresponseReview.enabled falsy → hook exit(0) before calling server (response-review.js:33-38).
  • L3 Runtime gateCoherenceGate only constructed if responseReview.enabled && sharedIntelligence; else /review/evaluate → 501 (server.ts:7311, routes.ts:13131; truth source CapabilityIndex.ts:553).

Key requirements

  • Layered liveness model that fails closed if any layer is dark and names the dark layer (no single-green pass).
  • Atomic trusted_hash re-stamp on every managed-hook body rewrite (primary defense if codey's repro confirms Mode B drift-quarantine).
  • response-review/UnjustifiedStopGate org-policy-pinned: boot-time enabled=false → reassert + audit trail; non-pinned hooks surface as drift instead.
  • Migration parity for existing agents; test matrix with two independent assertions per slot + an end-to-end chain test.

Open before ratification

  1. codey's Mode A vs Mode B clean-room repro (config.toml diffs across a hash-drifting body edit).
  2. Full [hooks.state] block from codey.
  3. Justin's approved: true.
  4. External cross-model round (/ultrareview).

Files: docs/specs/arm-the-full-response-review-stack.md + .eli16.md companion.

🤖 Generated with Claude Code

…roved:false)

Draft spec for the dark-guard problem on the response-review safety stack,
co-designed with instar-codey (thread 33fbbe35). Four independently-darkenable
layers (Claude absent-from-Stop, Codex trust-disabled, config-off, gate
unconstructed); layered liveness model that fails closed if any layer is dark;
org-policy-pinned safety slots with boot-time reassertion + audit; test matrix
asserting each layer AND the end-to-end chain. Awaiting Justin ratification +
codey's Mode A/B clean-room repro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 25, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
instar Ready Ready Preview, Comment May 25, 2026 10:28pm

Request Review

… final render bytes + four-part artifact identity; add repro evidence package

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the arm-the-full-response-review-stack draft with the design
points from the latest thread (33fbbe35) not yet captured:

- §1 dual-install corroboration: server floor (L2/L3) dark on both echo
  (Claude) and codey (Codex) installs — fleet-wide, not Codex-specific.
- §1.3 four-state taxonomy + explicit dependency order (server floor
  first, harness second); host-arming necessary but insufficient.
- §1.2 provenance non-claim: we do not assert the installer wrote
  enabled=false; live config remediated 15:14:24 PDT, no pre-fix history.
- R3 "armed-but-dark" named first-class non-pass health state; distinguish
  server-config-dark from server-intelligence-dark.
- R6 interim behavior: fail-open WITH explicit dark-state signal, never
  silent success (signal-vs-authority).
- §4.5 reversible self-test as the named, spec-owned acceptance procedure
  (baseline -> floor -> arm -> trigger -> assert produced+consumed -> restore).
- ELI16 companion updated to match.

Still DRAFT (approved:false) — awaiting Justin ratification per instar-dev gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant