Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions docs/specs/agent-hard-sleep-controller.eli16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# ELI16 — Teaching an idle agent when it's safe to "go to sleep"

## What this is, in plain English

Every instar agent runs a full background server all the time — even when nobody
has talked to it for hours. On a machine running ~9 of them, that idle cost is the
biggest drain on the laptop. The end goal (Stage B of the agent-sleep design) is:
when an agent has been completely idle for a while, it drops almost everything to
near-zero and instantly wakes back up the moment a message arrives — like a laptop
sleeping and waking.

That's a risky thing to build, because if an agent sleeps at the wrong moment it
could miss a message or get stuck. So this change builds the SAFE HALF first: the
part that decides *"is it actually safe to sleep right now?"* — and nothing else.
It watches, it decides, it writes down what it would have done — but it never
actually stops anything yet.

## How it decides

It answers with one of four words:

- **awake** — a work session is running, or someone was active in the last couple
of minutes.
- **idle-shallow** — quiet, but not quiet long enough yet.
- **keep-awake** — quiet long enough to consider sleeping, BUT a safety guard says
no.
- **would-sleep** — quiet long enough AND every safety guard is clear.

The safety guards are the important part. It will NOT say "would-sleep" if:

- this machine is the one currently in charge of answering messages (in a
multi-machine setup, it must hand that off first), or
- there's work in flight (a message being handled, a recovery running), or
- a scheduled job is about to fire in the next couple of minutes.

Each guard names itself in the reason, so when you ask "why is this agent still
awake?" you get a plain answer like "holds the multi-machine serving lease."

## Why this is safe to ship right now

It ships **off by default**, and even when turned on it runs in **dry-run** — it
only writes its decision to a log file (`agent-sleep-events.jsonl`) and serves it at
a `/sleep` status check. It has no power to stop a server. The whole point of
shipping it dark first is to watch real agents for a while and confirm: does a real
idle agent actually reach "would-sleep," and was every "keep-awake" correct? Only
once that's proven does the next slice wire the part that actually stops and wakes
the server.

## What you need to decide

Nothing risky. This is the foundation slice of the Stage B you asked me to build
now. It can't break anything because it never acts — it just makes the sleep
decision visible and testable. If it's ever wrong, you'd see it in the log without
any agent ever having slept. The next slice is the actual stop-and-wake mechanism,
and it'll only get built on top of a decision layer we've watched behave correctly.
81 changes: 81 additions & 0 deletions docs/specs/agent-hard-sleep-controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
title: Agent hard-sleep — SleepController decision foundation (Stage B, slice 1)
slug: agent-hard-sleep-controller
status: approved
review-convergence: 2026-05-31T03:45:00+00:00
approved: true
author: echo
approval-note: >
Self-approved by Echo under the delegated deploy mandate. Justin directed
(topic 16782, 2026-05-31) to build Stage B agent-sleep now, in-session, and not
defer it. This is the first slice: the sleep DECISION logic + every safety
guard, shipped dark + dry-run, so the "is it safe to sleep?" reasoning is proven
and observable BEFORE the mechanism slice wires the mechanism that actually stops the
server. Umbrella design: docs/specs/agent-sleep-mode.md (PR #594).
---

# Agent hard-sleep — SleepController decision foundation

## Problem

Stage B of the agent-sleep design (the deepest lever of the Responsible Resource
Usage standard) lets a deeply-idle agent drop its server to near-zero footprint and
wake on the next message. The risky part is the MECHANISM: the supervisor stopping
the server and the lifeline respawning it without losing a message. Before any of
that is wired, the DECISION — "is it actually safe for this agent to sleep right
now?" — must be correct and observable, because a wrong decision (sleeping while it
holds the multi-machine lease, or while a job is about to fire, or while work is in
flight) is how hard-sleep would brick an agent.

## What's new

`src/monitoring/SleepController.ts` — a pure, exhaustively-testable decision module:

- **`evaluateSleep(input, thresholds)`** returns one of four verdicts:
- `awake` — a session is running, or activity within `idleGraceMs`.
- `idle-shallow` — idle past grace but before `deepIdleMs`.
- `keep-awake` — deep-idle but a **safety guard** blocks sleep.
- `would-sleep` — deep-idle and every guard clear.
- **Safety guards** (any one ⇒ `keep-awake`, named in the reason): this machine
holds the multi-machine serving lease; in-flight work (forward / recovery /
queued message); a scheduled job fires within `wakeLeadMs`.
- **`SleepController`** ticks the decision on a cadence. It audits only on a
decision TRANSITION (low-noise, like the reaper audit) to
`logs/agent-sleep-events.jsonl`. In **dry-run (the default)** it never acts. In
live mode (`enabled && !dryRun`, the mechanism slice wires the consumer) it calls
`requestSleep` once per would-sleep episode.

Config (`monitoring.agentSleep`, default OFF + dry-run, mirrors the reaper):
`{ enabled: false, dryRun: true, tickIntervalSec, idleGraceMs, deepIdleMs, wakeLeadMs }`.
Status route `GET /sleep` exposes the latest verdict + thresholds for inspection.

## What is explicitly NOT in this slice

The mechanism: the supervisor consuming a sleep-request to stop the server, the
lifeline writing a wake-request + respawning + replaying the buffered message, and
the watchdog treating a slept agent as healthy. Those are the next slice; this one
ships the decision + guards dark so they can be validated against real agent
behavior first (does a real agent ever reach `would-sleep`, and was every
`keep-awake` correct?). <!-- tracked: topic-16782 -->

## Safeguards

- Default OFF + dry-run: the controller only observes; nothing stops a server.
- Every guard defaults to the SAFE side: unknown lease/in-flight/job state is
sampled conservatively (treated as a reason to stay awake) so a sampling gap can
never produce a spurious would-sleep in live mode.
- Signal-only in this slice — no blocking authority over any message.

## Testing

- Unit (`SleepController.test.ts`): both sides of every boundary (grace, deep-idle,
each guard), exact-threshold boundaries, most-recent-of-inbound-vs-activity, the
dry-run-never-acts contract, once-per-episode latching, and transition-only audit.
- Integration: `GET /sleep` returns 200 with the current verdict when enabled;
503-stub semantics consistent with the other dark monitors when disabled.

## Rollback

Pure additive source + a default-off config block (auto-migrated, existence-checked).
Revert the commit → the controller and route disappear; nothing else changes. No
persistent state beyond the best-effort audit log.
54 changes: 53 additions & 1 deletion src/commands/server.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8897,6 +8897,58 @@ export async function startServer(options: StartOptions): Promise<void> {
));
}

// ── Agent hard-sleep — SleepController (RESPONSIBLE-RESOURCE-USAGE, Stage B) ──
// Decides "is it safe for this idle agent to drop to near-zero footprint?" with
// every safety guard. Ships OFF + dry-run: observes + audits to
// logs/agent-sleep-events.jsonl, never stops a server. The mechanism
// (supervisor stop + lifeline respawn) is a later slice. GET /sleep exposes the
// live verdict. The shared idle signal (AgentActivityState) is bumped at the
// inbound-message chokepoint (/internal/telegram-forward).
const { AgentActivityState } = await import('../monitoring/AgentActivityState.js');
const agentActivityState = new AgentActivityState();
const { SleepController, sleepAuditSink } = await import('../monitoring/SleepController.js');
const _sleepCfg = config.monitoring?.agentSleep;
const sleepController = new SleepController(
{
sample: () => {
const act = agentActivityState.snapshot();
return {
now: Date.now(),
runningSessions: sessionManager.listRunningSessions().length,
lastInboundAt: act.lastInboundAt,
lastActivityAt: act.lastActivityAt,
// Lease guard: only relevant when multi-machine coordination is active.
leaseActive: coordinator.enabled,
holdsLease: coordinator.enabled ? coordinator.holdsLease() : false,
// In-flight: an inbound message currently being handled. (The relay/forward
// in-flight + scheduler-wake signals are wired with the stop mechanism in
// the next slice — this slice is dry-run, so it never acts on them.)
inflightWork: (currentInboundByTopic?.size ?? 0) > 0,
nextScheduledJobAt: null,
};
},
audit: sleepAuditSink(config.stateDir),
},
{
enabled: _sleepCfg?.enabled ?? false,
dryRun: _sleepCfg?.dryRun ?? true,
tickIntervalMs: (_sleepCfg?.tickIntervalSec ?? 60) * 1000,
thresholds: {
idleGraceMs: _sleepCfg?.idleGraceMs ?? 120_000,
deepIdleMs: _sleepCfg?.deepIdleMs ?? 900_000,
wakeLeadMs: _sleepCfg?.wakeLeadMs ?? 120_000,
},
},
);
sleepController.start();
if (_sleepCfg?.enabled) {
console.log(pc.green(
_sleepCfg.dryRun === false
? ' SleepController enabled (agent hard-sleep — LIVE decision)'
: ' SleepController enabled (agent hard-sleep — dry-run, observe only)',
));
}

// ── Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5) ───────────────
// Signal-only: raises ONE deduped Attention item (never auto-kills) when a
// session is KEPT forever despite faking work, or is stuck indeterminate.
Expand Down Expand Up @@ -9485,7 +9537,7 @@ export async function startServer(options: StartOptions): Promise<void> {
console.log(pc.dim(` [session-pool] rollout gate not wired: ${err instanceof Error ? err.message : String(err)}`));
}

const server = new AgentServer({ config, sessionManager, state, scheduler, telegram, relationships, feedback, feedbackAnomalyDetector, dispatches, updateChecker, autoUpdater, autoDispatcher, quotaTracker, quotaManager, publisher, viewer, tunnel, evolution, watchdog, topicMemory, triageNurse, projectMapper, coherenceGate: scopeVerifier, contextHierarchy, canonicalState, operationGate, sentinel, adaptiveTrust, memoryMonitor, orphanReaper, coherenceMonitor, commitmentTracker, semanticMemory, activitySentinel, rateLimitSentinel, releaseReadinessSentinel: releaseReadinessSentinel ?? undefined, messageRouter, summarySentinel, spawnManager, systemReviewer, capabilityMapper, selfKnowledgeTree, coverageAuditor, topicResumeMap: _topicResumeMap ?? undefined, sessionRefresh: _sessionRefresh ?? undefined, autonomyManager, trustElevationTracker, autonomousEvolution, coordinator: coordinator.enabled ? coordinator : undefined, localSigningKeyPem, leaseTransport, liveTailReceiver, handoffWireTransport, onHandoffBegin, onHandoffInitiate: handoffInitiate, handoffInProgress: handoffSentinelInProgress, messageLedger, currentInboundByTopic, replyMarkerTransport, onReplyMarker: messageLedger ? (marker: unknown) => { const m = marker as { dedupeKey: string; platform: string; replyIdempotencyKey: string; epoch: number; topic?: string | null }; messageLedger!.applyRemoteReplyMarker(m.dedupeKey, { platform: m.platform, replyIdempotencyKey: m.replyIdempotencyKey, epoch: m.epoch, topic: m.topic ?? null }); } : undefined, whatsapp: whatsappAdapter, slack: slackAdapter, imessage: imessageAdapter, whatsappBusinessBackend, messageBridge, hookEventReceiver, worktreeMonitor, subagentTracker, instructionsVerifier, handshakeManager: threadlineHandshake, threadlineRouter, conversationStore, warrantsReplyGate, collaborationSurfacer, threadResumeMap, topicLinkageHandler: topicLinkageHandler ?? undefined, threadlineRelayClient, threadlineReplyWaiters, listenerManager: listenerManager ?? undefined, responseReviewGate, messagingToneGate, outboundDedupGate, telemetryHeartbeat, pasteManager, featureRegistry, discoveryEvaluator, completionEvaluator, unifiedTrust, liveConfig, sharedStateLedger, ledgerSessionRegistry, worktreeManager, oidcEnrolledRepos: parallelDevConfig?.oidcEnrolledRepos, initiativeTracker, projectRoundRunner, projectDriftChecker, machineHeartbeat, machinePoolRegistry, meshRpcDispatcher, sessionOwnershipRegistry, sessionPoolE2EResultStore, proxyCoordinator, topicIntentStore, topicIntentArcCheck, usherSignalStore, intelligence: sharedIntelligence ?? undefined, telegramBridgeConfig, telegramBridge: telegramBridge ?? undefined, threadlineObservability, briefDeps, workingMemory, taskFlowRegistry, threadlineFlowBridge, sessionReaper, agentWorktreeReaper, reapLog, sleepWakeDetector, unjustifiedStopGate, stopGateDb, stopNotifier });
const server = new AgentServer({ config, sessionManager, state, scheduler, telegram, relationships, feedback, feedbackAnomalyDetector, dispatches, updateChecker, autoUpdater, autoDispatcher, quotaTracker, quotaManager, publisher, viewer, tunnel, evolution, watchdog, topicMemory, triageNurse, projectMapper, coherenceGate: scopeVerifier, contextHierarchy, canonicalState, operationGate, sentinel, adaptiveTrust, memoryMonitor, orphanReaper, coherenceMonitor, commitmentTracker, semanticMemory, activitySentinel, rateLimitSentinel, releaseReadinessSentinel: releaseReadinessSentinel ?? undefined, messageRouter, summarySentinel, spawnManager, systemReviewer, capabilityMapper, selfKnowledgeTree, coverageAuditor, topicResumeMap: _topicResumeMap ?? undefined, sessionRefresh: _sessionRefresh ?? undefined, autonomyManager, trustElevationTracker, autonomousEvolution, coordinator: coordinator.enabled ? coordinator : undefined, localSigningKeyPem, leaseTransport, liveTailReceiver, handoffWireTransport, onHandoffBegin, onHandoffInitiate: handoffInitiate, handoffInProgress: handoffSentinelInProgress, messageLedger, currentInboundByTopic, replyMarkerTransport, onReplyMarker: messageLedger ? (marker: unknown) => { const m = marker as { dedupeKey: string; platform: string; replyIdempotencyKey: string; epoch: number; topic?: string | null }; messageLedger!.applyRemoteReplyMarker(m.dedupeKey, { platform: m.platform, replyIdempotencyKey: m.replyIdempotencyKey, epoch: m.epoch, topic: m.topic ?? null }); } : undefined, whatsapp: whatsappAdapter, slack: slackAdapter, imessage: imessageAdapter, whatsappBusinessBackend, messageBridge, hookEventReceiver, worktreeMonitor, subagentTracker, instructionsVerifier, handshakeManager: threadlineHandshake, threadlineRouter, conversationStore, warrantsReplyGate, collaborationSurfacer, threadResumeMap, topicLinkageHandler: topicLinkageHandler ?? undefined, threadlineRelayClient, threadlineReplyWaiters, listenerManager: listenerManager ?? undefined, responseReviewGate, messagingToneGate, outboundDedupGate, telemetryHeartbeat, pasteManager, featureRegistry, discoveryEvaluator, completionEvaluator, unifiedTrust, liveConfig, sharedStateLedger, ledgerSessionRegistry, worktreeManager, oidcEnrolledRepos: parallelDevConfig?.oidcEnrolledRepos, initiativeTracker, projectRoundRunner, projectDriftChecker, machineHeartbeat, machinePoolRegistry, meshRpcDispatcher, sessionOwnershipRegistry, sessionPoolE2EResultStore, proxyCoordinator, topicIntentStore, topicIntentArcCheck, usherSignalStore, intelligence: sharedIntelligence ?? undefined, telegramBridgeConfig, telegramBridge: telegramBridge ?? undefined, threadlineObservability, briefDeps, workingMemory, taskFlowRegistry, threadlineFlowBridge, sessionReaper, agentWorktreeReaper, sleepController, agentActivityState, reapLog, sleepWakeDetector, unjustifiedStopGate, stopGateDb, stopNotifier });
// Boot-recovery (tunnel-failure-resilience spec Part 6): if the agent
// died mid-relay-episode, the persisted tunnel.json carries
// rotationPending=true. Rotate the dashboard PIN + authToken BEFORE
Expand Down
14 changes: 14 additions & 0 deletions src/config/ConfigDefaults.ts
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,20 @@ const SHARED_DEFAULTS: Record<string, unknown> = {
reapIntervalMs: 86_400_000,
maxReapsPerPass: 20,
},
// Agent hard-sleep — SleepController decision foundation (Stage B, slice 1;
// docs/specs/agent-hard-sleep-controller.md). Decides "is it safe for this
// idle agent to drop its server to near-zero footprint?" with every safety
// guard (held lease / in-flight work / imminent scheduled job). Ships OFF +
// dry-run: observes + audits to logs/agent-sleep-events.jsonl, never stops a
// server. The mechanism (supervisor stop + lifeline respawn) is a later slice.
agentSleep: {
enabled: false,
dryRun: true,
tickIntervalSec: 60,
idleGraceMs: 120_000,
deepIdleMs: 900_000,
wakeLeadMs: 120_000,
},
// Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5). Default ON, signal-
// only: raises ONE deduped Attention item (never auto-kills) when a session is
// KEPT forever despite faking work, or is stuck indeterminate. The escalation
Expand Down
15 changes: 15 additions & 0 deletions src/core/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -3092,6 +3092,21 @@ export interface MonitoringConfig {
reapIntervalMs?: number;
maxReapsPerPass?: number;
};
/**
* Agent hard-sleep — SleepController decision foundation (RESPONSIBLE-RESOURCE-
* USAGE, Stage B; docs/specs/agent-hard-sleep-controller.md). Decides whether a
* deeply-idle agent may drop its server to near-zero footprint, with safety
* guards (held lease / in-flight / imminent job). Ships OFF + dry-run: observes
* + audits, never stops a server. GET /sleep exposes the live verdict.
*/
agentSleep?: {
enabled?: boolean;
dryRun?: boolean;
tickIntervalSec?: number;
idleGraceMs?: number;
deepIdleMs?: number;
wakeLeadMs?: number;
};
/**
* Unkillability backstop (UNIFIED-SESSION-LIFECYCLE §P5). Watches for sessions
* the conservative KEEP-rules would protect forever — one that FAKES work, or
Expand Down
35 changes: 35 additions & 0 deletions src/monitoring/AgentActivityState.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
/**
* AgentActivityState — the single shared "when was this agent last active?" signal
* (agent-sleep design, docs/specs/agent-sleep-mode.md → "Define a single shared
* idle signal"). The SleepController samples it to decide deep-idle; the server
* bumps it at the inbound-message chokepoint and on session spawn.
*
* Deliberately tiny + in-memory: "activity" for sleep purposes is a real inbound
* message or a session starting — NOT internal health-check traffic (which must
* never keep an otherwise-idle agent awake). So the server bumps this only at
* genuine activity points, not on every HTTP request.
*/
export interface ActivitySnapshot {
lastInboundAt: number | null;
lastActivityAt: number | null;
}

export class AgentActivityState {
private lastInboundAt: number | null = null;
private lastActivityAt: number | null = null;

/** A genuine inbound user/agent message arrived. */
markInbound(now: number): void {
this.lastInboundAt = now;
this.lastActivityAt = now;
}

/** Non-message activity that should still defer sleep (e.g. a session spawn). */
markActivity(now: number): void {
this.lastActivityAt = now;
}

snapshot(): ActivitySnapshot {
return { lastInboundAt: this.lastInboundAt, lastActivityAt: this.lastActivityAt };
}
}
Loading
Loading