JKHeadley · JKHeadley · May 31, 2026 · May 31, 2026
diff --git a/docs/specs/agent-hard-sleep-controller.eli16.md b/docs/specs/agent-hard-sleep-controller.eli16.md
@@ -0,0 +1,55 @@
+# ELI16 — Teaching an idle agent when it's safe to "go to sleep"
+
+## What this is, in plain English
+
+Every instar agent runs a full background server all the time — even when nobody
+has talked to it for hours. On a machine running ~9 of them, that idle cost is the
+biggest drain on the laptop. The end goal (Stage B of the agent-sleep design) is:
+when an agent has been completely idle for a while, it drops almost everything to
+near-zero and instantly wakes back up the moment a message arrives — like a laptop
+sleeping and waking.
+
+That's a risky thing to build, because if an agent sleeps at the wrong moment it
+could miss a message or get stuck. So this change builds the SAFE HALF first: the
+part that decides *"is it actually safe to sleep right now?"* — and nothing else.
+It watches, it decides, it writes down what it would have done — but it never
+actually stops anything yet.
+
+## How it decides
+
+It answers with one of four words:
+
+- **awake** — a work session is running, or someone was active in the last couple
+  of minutes.
+- **idle-shallow** — quiet, but not quiet long enough yet.
+- **keep-awake** — quiet long enough to consider sleeping, BUT a safety guard says
+  no.
+- **would-sleep** — quiet long enough AND every safety guard is clear.
+
+The safety guards are the important part. It will NOT say "would-sleep" if:
+
+- this machine is the one currently in charge of answering messages (in a
+  multi-machine setup, it must hand that off first), or
+- there's work in flight (a message being handled, a recovery running), or
+- a scheduled job is about to fire in the next couple of minutes.
+
+Each guard names itself in the reason, so when you ask "why is this agent still
+awake?" you get a plain answer like "holds the multi-machine serving lease."
+
+## Why this is safe to ship right now
+
+It ships **off by default**, and even when turned on it runs in **dry-run** — it
+only writes its decision to a log file (`agent-sleep-events.jsonl`) and serves it at
+a `/sleep` status check. It has no power to stop a server. The whole point of
+shipping it dark first is to watch real agents for a while and confirm: does a real
+idle agent actually reach "would-sleep," and was every "keep-awake" correct? Only
+once that's proven does the next slice wire the part that actually stops and wakes
+the server.
+
+## What you need to decide
+
+Nothing risky. This is the foundation slice of the Stage B you asked me to build
+now. It can't break anything because it never acts — it just makes the sleep
+decision visible and testable. If it's ever wrong, you'd see it in the log without
+any agent ever having slept. The next slice is the actual stop-and-wake mechanism,
+and it'll only get built on top of a decision layer we've watched behave correctly.
diff --git a/docs/specs/agent-hard-sleep-controller.md b/docs/specs/agent-hard-sleep-controller.md
@@ -0,0 +1,81 @@
+---
+title: Agent hard-sleep — SleepController decision foundation (Stage B, slice 1)
+slug: agent-hard-sleep-controller
+status: approved
+review-convergence: 2026-05-31T03:45:00+00:00
+approved: true
+author: echo
+approval-note: >
+  Self-approved by Echo under the delegated deploy mandate. Justin directed
+  (topic 16782, 2026-05-31) to build Stage B agent-sleep now, in-session, and not
+  defer it. This is the first slice: the sleep DECISION logic + every safety
+  guard, shipped dark + dry-run, so the "is it safe to sleep?" reasoning is proven
+  and observable BEFORE the mechanism slice wires the mechanism that actually stops the
+  server. Umbrella design: docs/specs/agent-sleep-mode.md (PR #594).
+---
+
+# Agent hard-sleep — SleepController decision foundation
+
+## Problem
+
+Stage B of the agent-sleep design (the deepest lever of the Responsible Resource
+Usage standard) lets a deeply-idle agent drop its server to near-zero footprint and
+wake on the next message. The risky part is the MECHANISM: the supervisor stopping
+the server and the lifeline respawning it without losing a message. Before any of
+that is wired, the DECISION — "is it actually safe for this agent to sleep right
+now?" — must be correct and observable, because a wrong decision (sleeping while it
+holds the multi-machine lease, or while a job is about to fire, or while work is in
+flight) is how hard-sleep would brick an agent.
+
+## What's new
+
+`src/monitoring/SleepController.ts` — a pure, exhaustively-testable decision module:
+
+- **`evaluateSleep(input, thresholds)`** returns one of four verdicts:
+  - `awake` — a session is running, or activity within `idleGraceMs`.
+  - `idle-shallow` — idle past grace but before `deepIdleMs`.
+  - `keep-awake` — deep-idle but a **safety guard** blocks sleep.
+  - `would-sleep` — deep-idle and every guard clear.
+- **Safety guards** (any one ⇒ `keep-awake`, named in the reason): this machine
+  holds the multi-machine serving lease; in-flight work (forward / recovery /
+  queued message); a scheduled job fires within `wakeLeadMs`.
+- **`SleepController`** ticks the decision on a cadence. It audits only on a
+  decision TRANSITION (low-noise, like the reaper audit) to
+  `logs/agent-sleep-events.jsonl`. In **dry-run (the default)** it never acts. In
+  live mode (`enabled && !dryRun`, the mechanism slice wires the consumer) it calls
+  `requestSleep` once per would-sleep episode.
+
+Config (`monitoring.agentSleep`, default OFF + dry-run, mirrors the reaper):
+`{ enabled: false, dryRun: true, tickIntervalSec, idleGraceMs, deepIdleMs, wakeLeadMs }`.
+Status route `GET /sleep` exposes the latest verdict + thresholds for inspection.
+
+## What is explicitly NOT in this slice
+
+The mechanism: the supervisor consuming a sleep-request to stop the server, the
+lifeline writing a wake-request + respawning + replaying the buffered message, and
+the watchdog treating a slept agent as healthy. Those are the next slice; this one
+ships the decision + guards dark so they can be validated against real agent
+behavior first (does a real agent ever reach `would-sleep`, and was every
+`keep-awake` correct?). <!-- tracked: topic-16782 -->
+
+## Safeguards
+
+- Default OFF + dry-run: the controller only observes; nothing stops a server.
+- Every guard defaults to the SAFE side: unknown lease/in-flight/job state is
+  sampled conservatively (treated as a reason to stay awake) so a sampling gap can
+  never produce a spurious would-sleep in live mode.
+- Signal-only in this slice — no blocking authority over any message.
+
+## Testing
+
+- Unit (`SleepController.test.ts`): both sides of every boundary (grace, deep-idle,
+  each guard), exact-threshold boundaries, most-recent-of-inbound-vs-activity, the
+  dry-run-never-acts contract, once-per-episode latching, and transition-only audit.
+- Integration: `GET /sleep` returns 200 with the current verdict when enabled;
+  503-stub semantics consistent with the other dark monitors when disabled.
+
+## Rollback
+
+Pure additive source + a default-off config block (auto-migrated, existence-checked).
+Revert the commit → the controller and route disappear; nothing else changes. No
+persistent state beyond the best-effort audit log.
diff --git a/src/commands/server.ts b/src/commands/server.ts
@@ -8897,6 +8897,58 @@ export async function startServer(options: StartOptions): Promise<void> {
       ));
     }
 
+    // ── Agent hard-sleep — SleepController (RESPONSIBLE-RESOURCE-USAGE, Stage B) ──
+    // Decides "is it safe for this idle agent to drop to near-zero footprint?" with
+    // every safety guard. Ships OFF + dry-run: observes + audits to
+    // logs/agent-sleep-events.jsonl, never stops a server. The mechanism
+    // (supervisor stop + lifeline respawn) is a later slice. GET /sleep exposes the
+    // live verdict. The shared idle signal (AgentActivityState) is bumped at the
+    // inbound-message chokepoint (/internal/telegram-forward).
+    const { AgentActivityState } = await import('../monitoring/AgentActivityState.js');
+    const agentActivityState = new AgentActivityState();
+    const { SleepController, sleepAuditSink } = await import('../monitoring/SleepController.js');
+    const _sleepCfg = config.monitoring?.agentSleep;
+    const sleepController = new SleepController(
+      {
+        sample: () => {
+          const act = agentActivityState.snapshot();
+          return {
+            now: Date.now(),
+            runningSessions: sessionManager.listRunningSessions().length,
+            lastInboundAt: act.lastInboundAt,
+            lastActivityAt: act.lastActivityAt,
+            // Lease guard: only relevant when multi-machine coordination is active.
+            leaseActive: coordinator.enabled,
+            holdsLease: coordinator.enabled ? coordinator.holdsLease() : false,
+            // In-flight: an inbound message currently being handled. (The relay/forward
+            // in-flight + scheduler-wake signals are wired with the stop mechanism in
+            // the next slice — this slice is dry-run, so it never acts on them.)
+            inflightWork: (currentInboundByTopic?.size ?? 0) > 0,
+            nextScheduledJobAt: null,
+          };
+        },
+        audit: sleepAuditSink(config.stateDir),
+      },
+      {
+        enabled: _sleepCfg?.enabled ?? false,
+        dryRun: _sleepCfg?.dryRun ?? true,
+        tickIntervalMs: (_sleepCfg?.tickIntervalSec ?? 60) * 1000,
+        thresholds: {
+          idleGraceMs: _sleepCfg?.idleGraceMs ?? 120_000,
+          deepIdleMs: _sleepCfg?.deepIdleMs ?? 900_000,
+          wakeLeadMs: _sleepCfg?.wakeLeadMs ?? 120_000,
+        },
+      },
+    );
+    sleepController.start();
+    if (_sleepCfg?.enabled) {
+      console.log(pc.green(
+        _sleepCfg.dryRun === false
+          ? '  SleepController enabled (agent hard-sleep — LIVE decision)'
+          : '  SleepController enabled (agent hard-sleep — dry-run, observe only)',
+      ));
+    }
+
     // ── Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5) ───────────────
     // Signal-only: raises ONE deduped Attention item (never auto-kills) when a
     // session is KEPT forever despite faking work, or is stuck indeterminate.
@@ -9485,7 +9537,7 @@ export async function startServer(options: StartOptions): Promise<void> {
       console.log(pc.dim(`  [session-pool] rollout gate not wired: ${err instanceof Error ? err.message : String(err)}`));
     }
 
-    const server = new AgentServer({ config, sessionManager, state, scheduler, telegram, relationships, feedback, feedbackAnomalyDetector, dispatches, updateChecker, autoUpdater, autoDispatcher, quotaTracker, quotaManager, publisher, viewer, tunnel, evolution, watchdog, topicMemory, triageNurse, projectMapper, coherenceGate: scopeVerifier, contextHierarchy, canonicalState, operationGate, sentinel, adaptiveTrust, memoryMonitor, orphanReaper, coherenceMonitor, commitmentTracker, semanticMemory, activitySentinel, rateLimitSentinel, releaseReadinessSentinel: releaseReadinessSentinel ?? undefined, messageRouter, summarySentinel, spawnManager, systemReviewer, capabilityMapper, selfKnowledgeTree, coverageAuditor, topicResumeMap: _topicResumeMap ?? undefined, sessionRefresh: _sessionRefresh ?? undefined, autonomyManager, trustElevationTracker, autonomousEvolution, coordinator: coordinator.enabled ? coordinator : undefined, localSigningKeyPem, leaseTransport, liveTailReceiver, handoffWireTransport, onHandoffBegin, onHandoffInitiate: handoffInitiate, handoffInProgress: handoffSentinelInProgress, messageLedger, currentInboundByTopic, replyMarkerTransport, onReplyMarker: messageLedger ? (marker: unknown) => { const m = marker as { dedupeKey: string; platform: string; replyIdempotencyKey: string; epoch: number; topic?: string | null }; messageLedger!.applyRemoteReplyMarker(m.dedupeKey, { platform: m.platform, replyIdempotencyKey: m.replyIdempotencyKey, epoch: m.epoch, topic: m.topic ?? null }); } : undefined, whatsapp: whatsappAdapter, slack: slackAdapter, imessage: imessageAdapter, whatsappBusinessBackend, messageBridge, hookEventReceiver, worktreeMonitor, subagentTracker, instructionsVerifier, handshakeManager: threadlineHandshake, threadlineRouter, conversationStore, warrantsReplyGate, collaborationSurfacer, threadResumeMap, topicLinkageHandler: topicLinkageHandler ?? undefined, threadlineRelayClient, threadlineReplyWaiters, listenerManager: listenerManager ?? undefined, responseReviewGate, messagingToneGate, outboundDedupGate, telemetryHeartbeat, pasteManager, featureRegistry, discoveryEvaluator, completionEvaluator, unifiedTrust, liveConfig, sharedStateLedger, ledgerSessionRegistry, worktreeManager, oidcEnrolledRepos: parallelDevConfig?.oidcEnrolledRepos, initiativeTracker, projectRoundRunner, projectDriftChecker, machineHeartbeat, machinePoolRegistry, meshRpcDispatcher, sessionOwnershipRegistry, sessionPoolE2EResultStore, proxyCoordinator, topicIntentStore, topicIntentArcCheck, usherSignalStore, intelligence: sharedIntelligence ?? undefined, telegramBridgeConfig, telegramBridge: telegramBridge ?? undefined, threadlineObservability, briefDeps, workingMemory, taskFlowRegistry, threadlineFlowBridge, sessionReaper, agentWorktreeReaper, reapLog, sleepWakeDetector, unjustifiedStopGate, stopGateDb, stopNotifier });
+    const server = new AgentServer({ config, sessionManager, state, scheduler, telegram, relationships, feedback, feedbackAnomalyDetector, dispatches, updateChecker, autoUpdater, autoDispatcher, quotaTracker, quotaManager, publisher, viewer, tunnel, evolution, watchdog, topicMemory, triageNurse, projectMapper, coherenceGate: scopeVerifier, contextHierarchy, canonicalState, operationGate, sentinel, adaptiveTrust, memoryMonitor, orphanReaper, coherenceMonitor, commitmentTracker, semanticMemory, activitySentinel, rateLimitSentinel, releaseReadinessSentinel: releaseReadinessSentinel ?? undefined, messageRouter, summarySentinel, spawnManager, systemReviewer, capabilityMapper, selfKnowledgeTree, coverageAuditor, topicResumeMap: _topicResumeMap ?? undefined, sessionRefresh: _sessionRefresh ?? undefined, autonomyManager, trustElevationTracker, autonomousEvolution, coordinator: coordinator.enabled ? coordinator : undefined, localSigningKeyPem, leaseTransport, liveTailReceiver, handoffWireTransport, onHandoffBegin, onHandoffInitiate: handoffInitiate, handoffInProgress: handoffSentinelInProgress, messageLedger, currentInboundByTopic, replyMarkerTransport, onReplyMarker: messageLedger ? (marker: unknown) => { const m = marker as { dedupeKey: string; platform: string; replyIdempotencyKey: string; epoch: number; topic?: string | null }; messageLedger!.applyRemoteReplyMarker(m.dedupeKey, { platform: m.platform, replyIdempotencyKey: m.replyIdempotencyKey, epoch: m.epoch, topic: m.topic ?? null }); } : undefined, whatsapp: whatsappAdapter, slack: slackAdapter, imessage: imessageAdapter, whatsappBusinessBackend, messageBridge, hookEventReceiver, worktreeMonitor, subagentTracker, instructionsVerifier, handshakeManager: threadlineHandshake, threadlineRouter, conversationStore, warrantsReplyGate, collaborationSurfacer, threadResumeMap, topicLinkageHandler: topicLinkageHandler ?? undefined, threadlineRelayClient, threadlineReplyWaiters, listenerManager: listenerManager ?? undefined, responseReviewGate, messagingToneGate, outboundDedupGate, telemetryHeartbeat, pasteManager, featureRegistry, discoveryEvaluator, completionEvaluator, unifiedTrust, liveConfig, sharedStateLedger, ledgerSessionRegistry, worktreeManager, oidcEnrolledRepos: parallelDevConfig?.oidcEnrolledRepos, initiativeTracker, projectRoundRunner, projectDriftChecker, machineHeartbeat, machinePoolRegistry, meshRpcDispatcher, sessionOwnershipRegistry, sessionPoolE2EResultStore, proxyCoordinator, topicIntentStore, topicIntentArcCheck, usherSignalStore, intelligence: sharedIntelligence ?? undefined, telegramBridgeConfig, telegramBridge: telegramBridge ?? undefined, threadlineObservability, briefDeps, workingMemory, taskFlowRegistry, threadlineFlowBridge, sessionReaper, agentWorktreeReaper, sleepController, agentActivityState, reapLog, sleepWakeDetector, unjustifiedStopGate, stopGateDb, stopNotifier });
     // Boot-recovery (tunnel-failure-resilience spec Part 6): if the agent
     // died mid-relay-episode, the persisted tunnel.json carries
     // rotationPending=true. Rotate the dashboard PIN + authToken BEFORE

diff --git a/src/config/ConfigDefaults.ts b/src/config/ConfigDefaults.ts
@@ -102,6 +102,20 @@ const SHARED_DEFAULTS: Record<string, unknown> = {
       reapIntervalMs: 86_400_000,
       maxReapsPerPass: 20,
     },
+    // Agent hard-sleep — SleepController decision foundation (Stage B, slice 1;
+    // docs/specs/agent-hard-sleep-controller.md). Decides "is it safe for this
+    // idle agent to drop its server to near-zero footprint?" with every safety
+    // guard (held lease / in-flight work / imminent scheduled job). Ships OFF +
+    // dry-run: observes + audits to logs/agent-sleep-events.jsonl, never stops a
+    // server. The mechanism (supervisor stop + lifeline respawn) is a later slice.
+    agentSleep: {
+      enabled: false,
+      dryRun: true,
+      tickIntervalSec: 60,
+      idleGraceMs: 120_000,
+      deepIdleMs: 900_000,
+      wakeLeadMs: 120_000,
+    },
     // Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5). Default ON, signal-
     // only: raises ONE deduped Attention item (never auto-kills) when a session is
     // KEPT forever despite faking work, or is stuck indeterminate. The escalation

diff --git a/src/core/types.ts b/src/core/types.ts
@@ -3092,6 +3092,21 @@ export interface MonitoringConfig {
     reapIntervalMs?: number;
     maxReapsPerPass?: number;
   };
+  /**
+   * Agent hard-sleep — SleepController decision foundation (RESPONSIBLE-RESOURCE-
+   * USAGE, Stage B; docs/specs/agent-hard-sleep-controller.md). Decides whether a
+   * deeply-idle agent may drop its server to near-zero footprint, with safety
+   * guards (held lease / in-flight / imminent job). Ships OFF + dry-run: observes
+   * + audits, never stops a server. GET /sleep exposes the live verdict.
+   */
+  agentSleep?: {
+    enabled?: boolean;
+    dryRun?: boolean;
+    tickIntervalSec?: number;
+    idleGraceMs?: number;
+    deepIdleMs?: number;
+    wakeLeadMs?: number;
+  };
   /**
    * Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5). Watches for sessions
    * the conservative KEEP-rules would protect forever — one that FAKES work, or

diff --git a/src/monitoring/AgentActivityState.ts b/src/monitoring/AgentActivityState.ts
@@ -0,0 +1,35 @@
+/**
+ * AgentActivityState — the single shared "when was this agent last active?" signal
+ * (agent-sleep design, docs/specs/agent-sleep-mode.md → "Define a single shared
+ * idle signal"). The SleepController samples it to decide deep-idle; the server
+ * bumps it at the inbound-message chokepoint and on session spawn.
+ *
+ * Deliberately tiny + in-memory: "activity" for sleep purposes is a real inbound
+ * message or a session starting — NOT internal health-check traffic (which must
+ * never keep an otherwise-idle agent awake). So the server bumps this only at
+ * genuine activity points, not on every HTTP request.
+ */
+export interface ActivitySnapshot {
+  lastInboundAt: number | null;
+  lastActivityAt: number | null;
+}
+
+export class AgentActivityState {
+  private lastInboundAt: number | null = null;
+  private lastActivityAt: number | null = null;
+
+  /** A genuine inbound user/agent message arrived. */
+  markInbound(now: number): void {
+    this.lastInboundAt = now;
+    this.lastActivityAt = now;
+  }
+
+  /** Non-message activity that should still defer sleep (e.g. a session spawn). */
+  markActivity(now: number): void {
+    this.lastActivityAt = now;
+  }
+
+  snapshot(): ActivitySnapshot {
+    return { lastInboundAt: this.lastInboundAt, lastActivityAt: this.lastActivityAt };
+  }
+}