Skip to content

feat(agent-sleep): hard-sleep stop+wake mechanism (Stage B slice 2, dark + dry-run)#603

Merged
JKHeadley merged 1 commit into
mainfrom
echo/agent-hard-sleep-mech
May 31, 2026
Merged

feat(agent-sleep): hard-sleep stop+wake mechanism (Stage B slice 2, dark + dry-run)#603
JKHeadley merged 1 commit into
mainfrom
echo/agent-hard-sleep-mech

Conversation

@JKHeadley

Copy link
Copy Markdown
Owner

What & why

Stage B's mechanism — the part that ACTS on the slice-1 SleepController decision. When sleep is enabled and an agent is deeply idle + safe, its heavy server is stopped to save the machine's resources and respawned the instant a message arrives. Ships dark + dry-run (monitoring.agentSleep, off by default; the live requestSleep flag is only written when enabled && !dryRun). Builds on merged #599.

The handshake (reuses the proven restart lifecycle)

  • Sleep: live SleepController writes state/sleep-requested.jsonServerSupervisor.checkSleepRequest() stops the server tmux session, sets slept, writes state/slept-marker.json.
  • Suppress auto-respawn: the health loop short-circuits at the top — if (this.slept) { checkWakeRequest(); return; } — so a slept server isn't treated as crashed. This is the only loop-flow change, and it's a no-op until a live sleep-request lands.
  • Wake: an inbound message → TelegramLifeline.requestWakeIfSlept() (at the top of processUpdate, ungated) writes state/wake-requested.jsoncheckWakeRequest() respawns via spawnServer(). The held message replays through the existing forward-retry queue once healthy (zero loss).
  • Brick defense: a slept-marker keeps a rebooted (or fleet-watchdog-bounced) supervisor asleep; wakeFromSleep() is the operator escape hatch wired into /lifeline restart + /reset.

⚠️ Adversarial second-pass review caught a critical brick — fixed

The review found the wake-trigger was originally inside forwardToServer(), which is gated on supervisor.healthy — but a slept server is not healthy, so an inbound message would queue and never write the wake flag → the server would never wake. Fixed: moved requestWakeIfSlept() to the top of processUpdate(), before any health gate. The review also found /lifeline restart didn't clear the slept-marker (manual recovery also bricked) — fixed via wakeFromSleep(). Both fixes are tested. The reviewer confirmed the dark code is inert by default.

Tests (unit + regression)

  • ServerSupervisor-sleep-wake.test.ts — sleep stops + marks + enters slept; no-request no-op; expired-request ignored; wake respawns + clears; wake-when-not-slept no-op; idempotent re-sleep; boot-marker signal; wakeFromSleep clears slept+marker (escape hatch).
  • agentSleepWake.test.ts — marker→wake-request; no-marker→no-op.
  • SleepController.test.tssleepRequestWriter writes the TTL-stamped flag.
  • Regression: ServerSupervisor-handshake / supervisor-health-check / supervisor-cpu-starvation stay green — existing crash-recovery is byte-identical when slept===false.

Safety / process

  • Dark + dry-run + additive. No new config/route/agent-installed-file. Revert = the handlers + the slept short-circuit disappear; supervisor/lifeline behave as before.
  • Spec docs/specs/agent-hard-sleep-mechanism.md (converged + approved: true) + ELI16. Side-effects upgrades/side-effects/agent-hard-sleep-mechanism.md.
  • ⚠️ Self-approved under the delegated deploy mandate (Justin: build Stage B now, topic 16782). The enablement — not this dark ship — is the reviewed gate: turn it on first on a test agent with Justin watching. A few enablement-gated refinements (scheduler-wake, lifeline-serves-health-while-asleep) are tracked.

🤖 Generated with Claude Code

…ark + dry-run)

Acts on the slice-1 SleepController verdict: in live mode it writes
state/sleep-requested.json; the ServerSupervisor stops the server tmux session +
enters 'slept' (the health loop short-circuits, suppressing auto-respawn) and only
watches for state/wake-requested.json, which the lifeline writes on the next inbound
message → supervisor respawns + the existing forward-retry queue replays the buffered
message. A slept-marker keeps a rebooted/watchdog-bounced supervisor asleep.

- ServerSupervisor: checkSleepRequest/checkWakeRequest, slept short-circuit (the only
  loop-flow change; no-op until a live sleep-request lands), boot-marker stay-asleep,
  wakeFromSleep() escape hatch.
- SleepController.sleepRequestWriter (live-mode flag, TTL-stamped).
- TelegramLifeline: requestWakeIfSlept() at the TOP of processUpdate (ungated) via the
  pure agentSleepWake helper; /lifeline restart + reset clear slept state.

⚠️ Adversarial 2nd-pass review CAUGHT A CRITICAL BRICK (wake-trigger was gated on
supervisor.healthy → a slept server is unhealthy → inbound queued, never woke) + a
broken manual escape hatch (/lifeline restart didn't clear the marker). BOTH fixed +
tested before commit. Dark code verified inert. 50 tests incl. regression on the
existing supervisor suites (green). Self-approved under the deploy mandate; ships dark,
enablement is the reviewed gate (topic 16782).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented May 31, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
instar Ready Ready Preview, Comment May 31, 2026 4:40am

Request Review

@JKHeadley JKHeadley merged commit 588f9ab into main May 31, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant