feat(agent-sleep): hard-sleep stop+wake mechanism (Stage B slice 2, dark + dry-run)#603
Merged
Merged
Conversation
…ark + dry-run) Acts on the slice-1 SleepController verdict: in live mode it writes state/sleep-requested.json; the ServerSupervisor stops the server tmux session + enters 'slept' (the health loop short-circuits, suppressing auto-respawn) and only watches for state/wake-requested.json, which the lifeline writes on the next inbound message → supervisor respawns + the existing forward-retry queue replays the buffered message. A slept-marker keeps a rebooted/watchdog-bounced supervisor asleep. - ServerSupervisor: checkSleepRequest/checkWakeRequest, slept short-circuit (the only loop-flow change; no-op until a live sleep-request lands), boot-marker stay-asleep, wakeFromSleep() escape hatch. - SleepController.sleepRequestWriter (live-mode flag, TTL-stamped). - TelegramLifeline: requestWakeIfSlept() at the TOP of processUpdate (ungated) via the pure agentSleepWake helper; /lifeline restart + reset clear slept state.⚠️ Adversarial 2nd-pass review CAUGHT A CRITICAL BRICK (wake-trigger was gated on supervisor.healthy → a slept server is unhealthy → inbound queued, never woke) + a broken manual escape hatch (/lifeline restart didn't clear the marker). BOTH fixed + tested before commit. Dark code verified inert. 50 tests incl. regression on the existing supervisor suites (green). Self-approved under the deploy mandate; ships dark, enablement is the reviewed gate (topic 16782). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Stage B's mechanism — the part that ACTS on the slice-1 SleepController decision. When sleep is enabled and an agent is deeply idle + safe, its heavy server is stopped to save the machine's resources and respawned the instant a message arrives. Ships dark + dry-run (
monitoring.agentSleep, off by default; the liverequestSleepflag is only written whenenabled && !dryRun). Builds on merged #599.The handshake (reuses the proven restart lifecycle)
state/sleep-requested.json→ServerSupervisor.checkSleepRequest()stops the server tmux session, setsslept, writesstate/slept-marker.json.if (this.slept) { checkWakeRequest(); return; }— so a slept server isn't treated as crashed. This is the only loop-flow change, and it's a no-op until a live sleep-request lands.TelegramLifeline.requestWakeIfSlept()(at the top ofprocessUpdate, ungated) writesstate/wake-requested.json→checkWakeRequest()respawns viaspawnServer(). The held message replays through the existing forward-retry queue once healthy (zero loss).slept-markerkeeps a rebooted (or fleet-watchdog-bounced) supervisor asleep;wakeFromSleep()is the operator escape hatch wired into/lifeline restart+/reset.The review found the wake-trigger was originally inside
forwardToServer(), which is gated onsupervisor.healthy— but a slept server is not healthy, so an inbound message would queue and never write the wake flag → the server would never wake. Fixed: movedrequestWakeIfSlept()to the top ofprocessUpdate(), before any health gate. The review also found/lifeline restartdidn't clear the slept-marker (manual recovery also bricked) — fixed viawakeFromSleep(). Both fixes are tested. The reviewer confirmed the dark code is inert by default.Tests (unit + regression)
ServerSupervisor-sleep-wake.test.ts— sleep stops + marks + enters slept; no-request no-op; expired-request ignored; wake respawns + clears; wake-when-not-slept no-op; idempotent re-sleep; boot-marker signal; wakeFromSleep clears slept+marker (escape hatch).agentSleepWake.test.ts— marker→wake-request; no-marker→no-op.SleepController.test.ts—sleepRequestWriterwrites the TTL-stamped flag.ServerSupervisor-handshake/supervisor-health-check/supervisor-cpu-starvationstay green — existing crash-recovery is byte-identical whenslept===false.Safety / process
sleptshort-circuit disappear; supervisor/lifeline behave as before.docs/specs/agent-hard-sleep-mechanism.md(converged +approved: true) + ELI16. Side-effectsupgrades/side-effects/agent-hard-sleep-mechanism.md.🤖 Generated with Claude Code