fix: stop Windows Max-plan drain loop#2083
Conversation
…guard trip, tighten RestartGuard, catch OAuth expiry Discord report: Windows 11 user's Max-plan usage was being consumed in the background. Root cause is a regression introduced alongside the v12.3.3 RestartGuard: (1) when the guard trips in the HTTP-layer crash-recovery path, pending messages were left in 'pending' state and auto-replayed by processPendingQueues on the next daemon start; (2) any single successful message cleared the restart-window decay, so a 9-of-10 failure regime never tripped the guard; (3) OAuth-token expiry errors are not in the unrecoverable list and loop forever on Max-plan subscriptions. SessionRoutes.ts: on restart-guard trip, abandon pending messages and remove the session (mirrors WorkerService.terminateSession which is private). Pattern copied from the sibling block at SessionRoutes.ts:123. RestartGuard.ts: require 5 consecutive recordSuccess() calls before the window-decay path becomes eligible; add a terminal lifetime cap of 50 total restarts that never resets. Public API unchanged. worker-service.ts: extract unrecoverablePatterns into a testable helper module and add OAuth/OpenRouter-auth patterns: 'OAuth token expired', 'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401', 'OpenRouter API error: 403'. Bare '401' deliberately not added (too broad). Investigation and phased plan docs included for reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary by CodeRabbit
WalkthroughImplements fixes for a Windows regression by extracting unrecoverable error patterns into a dedicated module, enhancing RestartGuard with lifetime caps and improved decay behavior, and ensuring pending messages are abandoned when restart guard trips during crash recovery. Changes
Sequence DiagramsequenceDiagram
participant Session as Session<br/>(Generator)
participant RestartGuard as RestartGuard
participant CrashRecovery as Crash Recovery<br/>Handler
participant PendingStore as Pending Message<br/>Store
participant SessionMgmt as Session<br/>Management
Note over Session,SessionMgmt: Crash occurs during processing
Session->>RestartGuard: recordRestart()
RestartGuard->>RestartGuard: Increment totalRestarts<br/>Check lifetime cap
alt Lifetime Cap Exceeded
RestartGuard-->>Session: false (block restart)
Session->>CrashRecovery: Crash recovery triggered
else Within Cap
RestartGuard-->>Session: true (allow restart)
end
CrashRecovery->>PendingStore: Mark all pending<br/>messages as abandoned
PendingStore-->>CrashRecovery: Confirmed
CrashRecovery->>SessionMgmt: Remove session<br/>from memory
SessionMgmt-->>CrashRecovery: Removed
CrashRecovery->>SessionMgmt: Broadcast session<br/>completion
Note over PendingStore: Prevents respawn/replay<br/>of stale messages
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Review: fix: stop Windows Max-plan drain loopOverviewThis PR correctly identifies and fixes a real-money regression: a Windows user's Max-plan subscription was drained in the background by a crash loop that the What's done well
Issues1. No tests shipped (blocking for me)The Phase 4 test plan is detailed and the PR checklist is unchecked. Three test files were planned:
None appear in the diff. Given that the investigation notes "no existing tests for RestartGuard or unrecoverablePatterns", and this is a safety-critical guard against billing drain, shipping without coverage is risky. The 2.
|
Greptile SummaryThis PR closes a Windows Max-plan drain loop by addressing three root causes: (1) the restart guard's trip branch in
Confidence Score: 4/5Safe to merge after addressing the 'Unauthorized' pattern scope — all other changes are well-targeted and correct. There is one P1 finding: the bare 'Unauthorized' substring in the unrecoverable-patterns list is too broad and could prematurely kill sessions due to transient 401 errors from MCP-called APIs. All other changes (SessionRoutes guard-trip cleanup, RestartGuard tightening, module extraction) are logically sound and directly address the described drain loop. src/services/worker/unrecoverable-patterns.ts — the 'Unauthorized' pattern needs scoping before merge. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Generator exits with pending work] --> B{isUnrecoverableError?}
B -- Yes --> C[terminateSession\nmark abandoned + remove]
B -- No --> D[restartGuard.recordRestart]
D --> E{Windowed cap\nor lifetime cap exceeded?}
E -- No --> F[Restart generator\nwith backoff]
F --> A
E -- Yes --> G[handleRestartGuardTripped\nSessionRoutes path]
G --> H[abortController.abort]
H --> I[markAllSessionMessagesAbandoned]
I --> J[removeSessionImmediate]
J --> K[broadcastSessionCompleted]
K --> L[STOP — messages not replayed\non next daemon startup]
Reviews (1): Last reviewed commit: "fix: stop Windows Max-plan drain loop — ..." | Re-trigger Greptile |
| // is no longer valid. | ||
| 'OAuth token expired', | ||
| 'token has been revoked', | ||
| 'Unauthorized', |
There was a problem hiding this comment.
'Unauthorized' pattern too broad — may prematurely kill healthy sessions
'Unauthorized' is a plain-English word that can appear in error messages from MCP tools, external HTTP APIs called through MCP, database permission errors, or any third-party library. Because isUnrecoverableError is a substring match, a transient "Unauthorized" from an MCP-called REST endpoint would permanently terminate the session and abandon all pending messages — the same effect as an expired OAuth token.
The PR description explicitly rejected bare '401' as "too broad," but 'Unauthorized' carries the same risk. The OAuth-specific patterns already added ('OAuth token expired', 'token has been revoked') should be sufficient to catch Max-plan token expiry. If Claude SDK specifically emits an "Unauthorized" string for expired tokens, a more scoped prefix (e.g., 'claude: Unauthorized' or 'SDK error: Unauthorized') would be safer.
| if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Lifetime-cap check is off-by-one relative to the stated limit
totalRestartsAllTime is incremented before the > ABSOLUTE_LIFETIME_RESTART_CAP check, so restart #50 sets the counter to 50, 50 > 50 is false, and the restart is allowed. The cap therefore permits exactly 50 restarts (not "trips at 50"). This is consistent with the PR description ("50 total restarts"), but a brief comment clarifying that > CAP means "CAP restarts are allowed, CAP+1 is the first blocked one" would prevent future confusion.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/worker-service.ts (1)
816-827:⚠️ Potential issue | 🟠 MajorEmit the session-complete event from
terminateSession()too.This branch now logs the richer guard state, but its terminal path still funnels through
this.terminateSession(...), which only abandons messages and removes the session.runFallbackForTerminatedSession()andSessionRoutes.handleRestartGuardTripped()already callbroadcastSessionCompleted(), so worker-service guard trips and unrecoverable-error shutdowns still miss that terminal event.🔧 Proposed fix
private terminateSession(sessionDbId: number, reason: string): void { const pendingStore = this.sessionManager.getPendingMessageStore(); const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId); logger.info('SYSTEM', 'Session terminated', { sessionId: sessionDbId, reason, abandonedMessages: abandoned }); // removeSessionImmediate fires onSessionDeletedCallback → broadcastProcessingStatus() this.sessionManager.removeSessionImmediate(sessionDbId); + this.sessionEventBroadcaster.broadcastSessionCompleted(sessionDbId); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/worker-service.ts` around lines 816 - 827, The restart-guard branch ends by calling this.terminateSession(session.sessionDbId, 'max_restarts_exceeded') but terminateSession does not emit the session-complete event, so consumers miss the terminal event; update terminateSession(sessionId, reason) to also call broadcastSessionCompleted(sessionId, { reason }) (or the equivalent session-complete emitter used by runFallbackForTerminatedSession()/SessionRoutes.handleRestartGuardTripped) after it finishes abandoning messages/removing the session, ensuring all terminal paths (restart guard trips and unrecoverable shutdowns) emit the same session-complete event.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@PLAN-windows-max-plan-drain-fix.md`:
- Around line 385-390: Update Phase 5 verification to look in the new module
instead of the old file: replace the grep that targets
src/services/worker-service.ts with a search against
src/services/worker/unrecoverable-patterns.ts (e.g., grep -n "'OAuth token
expired'" src/services/worker/unrecoverable-patterns.ts) so the
OAuth/unrecoverable matcher check points to the extracted symbol in
unrecoverable-patterns.ts.
- Around line 241-245: The documentation and test plan disagree about the
lifetime-cap boundary for restarts (one place says the 50th restart should be
blocked, another says 50 are allowed and the 51st blocks); decide on the single
canonical behavior (either "max 50 allowed, 51st blocked" or "50th blocked") and
update both the test expectation in RestartGuard.test.ts (referenced as the
Phase 4 test) and the descriptive lines in this doc (the two bullets describing
49/50/51 behavior and the later lines 326-327) so they match the chosen rule and
the term "lifetime-cap" is used consistently throughout.
In `@src/services/worker/RestartGuard.ts`:
- Around line 31-50: In recordRestart(), after you set
this.consecutiveSuccessCount = 0 (which breaks the success streak), also clear
the decay flag by setting this.decayEligible = false so a broken streak cannot
later trigger the decay path; update the recordRestart() function (referencing
recordRestart(), consecutiveSuccessCount, decayEligible, restartTimestamps,
lastSuccessfulProcessing) to flip decayEligible off immediately when a restart
breaks the streak.
---
Outside diff comments:
In `@src/services/worker-service.ts`:
- Around line 816-827: The restart-guard branch ends by calling
this.terminateSession(session.sessionDbId, 'max_restarts_exceeded') but
terminateSession does not emit the session-complete event, so consumers miss the
terminal event; update terminateSession(sessionId, reason) to also call
broadcastSessionCompleted(sessionId, { reason }) (or the equivalent
session-complete emitter used by
runFallbackForTerminatedSession()/SessionRoutes.handleRestartGuardTripped) after
it finishes abandoning messages/removing the session, ensuring all terminal
paths (restart guard trips and unrecoverable shutdowns) emit the same
session-complete event.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 135d28be-068a-4f92-82a3-e7a6b7bff82b
📒 Files selected for processing (6)
INVESTIGATION-windows-max-plan-drain.mdPLAN-windows-max-plan-drain-fix.mdsrc/services/worker-service.tssrc/services/worker/RestartGuard.tssrc/services/worker/http/routes/SessionRoutes.tssrc/services/worker/unrecoverable-patterns.ts
| - `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4). | ||
| - Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st → | ||
| blocked (lifetime cap persists). | ||
| - Behavior check: 4 successes then restart → full window still counted. | ||
| 5th success then restart 5 min later → window cleared. |
There was a problem hiding this comment.
Reconcile the lifetime-cap expectation.
This section says the 50th restart should already be blocked, but Lines 326-327 later say 50 restarts are still allowed and only the 51st blocks. Please pick one so the tests and the implementation don't drift apart.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@PLAN-windows-max-plan-drain-fix.md` around lines 241 - 245, The documentation
and test plan disagree about the lifetime-cap boundary for restarts (one place
says the 50th restart should be blocked, another says 50 are allowed and the
51st blocks); decide on the single canonical behavior (either "max 50 allowed,
51st blocked" or "50th blocked") and update both the test expectation in
RestartGuard.test.ts (referenced as the Phase 4 test) and the descriptive lines
in this doc (the two bullets describing 49/50/51 behavior and the later lines
326-327) so they match the chosen rule and the term "lifetime-cap" is used
consistently throughout.
| 1. `bun run build` — builds cleanly (no TS errors). | ||
| 2. `bun test` — full suite green. | ||
| 3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase | ||
| is gone from the codebase). | ||
| 4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches. | ||
| 5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts` |
There was a problem hiding this comment.
Phase 5 still points verification at the old file.
The OAuth/unrecoverable matcher was extracted to src/services/worker/unrecoverable-patterns.ts, so this grep will now fail and send follow-up edits to the wrong place.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@PLAN-windows-max-plan-drain-fix.md` around lines 385 - 390, Update Phase 5
verification to look in the new module instead of the old file: replace the grep
that targets src/services/worker-service.ts with a search against
src/services/worker/unrecoverable-patterns.ts (e.g., grep -n "'OAuth token
expired'" src/services/worker/unrecoverable-patterns.ts) so the
OAuth/unrecoverable matcher check points to the extracted symbol in
unrecoverable-patterns.ts.
| recordRestart(): boolean { | ||
| this.totalRestartsAllTime += 1; | ||
| this.consecutiveSuccessCount = 0; // streak broken by any restart | ||
|
|
||
| // Terminal: lifetime cap reached — never resets, even if successes follow. | ||
| if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) { | ||
| return false; | ||
| } | ||
|
|
||
| const now = Date.now(); | ||
|
|
||
| // Decay: clear history only after real success + 5min of uninterrupted success | ||
| if (this.lastSuccessfulProcessing !== null | ||
| // Decay: only fires if we accumulated the required consecutive successes | ||
| // AND 5min has elapsed since the last success. One-off successes cannot | ||
| // clear the windowed-restart history. | ||
| if (this.decayEligible | ||
| && this.lastSuccessfulProcessing !== null | ||
| && now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) { | ||
| this.restartTimestamps = []; | ||
| this.lastSuccessfulProcessing = null; | ||
| this.decayEligible = false; |
There was a problem hiding this comment.
Clear decayEligible when a restart breaks the success streak.
recordRestart() zeroes consecutiveSuccessCount, but it leaves decayEligible armed. After 5 successes, one restart, and then a 5-minute gap, the next restart will still wipe restartTimestamps even though the streak was already broken. That reopens the slow-drip path this change is trying to close.
🔧 Proposed fix
recordRestart(): boolean {
this.totalRestartsAllTime += 1;
this.consecutiveSuccessCount = 0; // streak broken by any restart
+ this.decayEligible = false;
// Terminal: lifetime cap reached — never resets, even if successes follow.
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
return false;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/worker/RestartGuard.ts` around lines 31 - 50, In
recordRestart(), after you set this.consecutiveSuccessCount = 0 (which breaks
the success streak), also clear the decay flag by setting this.decayEligible =
false so a broken streak cannot later trigger the decay path; update the
recordRestart() function (referencing recordRestart(), consecutiveSuccessCount,
decayEligible, restartTimestamps, lastSuccessfulProcessing) to flip
decayEligible off immediately when a restart breaks the streak.
Summary
RestartGuard. Full write-up inINVESTIGATION-windows-max-plan-drain.md; phased implementation plan inPLAN-windows-max-plan-drain-fix.md.SessionRoutes.ts:318-340) — when the guard trips, the code path previously left pending messages in'pending'state, and the next daemon startup'sprocessPendingQueues(50)re-replayed them. Now it callsmarkAllSessionMessagesAbandoned+removeSessionImmediate+broadcastSessionCompletedinline (mirroring the privateWorkerService.terminateSession, pattern copied from the sibling SDK-terminated branch atSessionRoutes.ts:123).recordSuccess()no longer clears the window on a single success; it requires 5 consecutive successes before the decay path becomes eligible. Adds a terminal lifetime cap of 50 total restarts that never resets. Public API unchanged (recordRestart(): boolean,recordSuccess(): void, existing getters preserved; addedtotalRestarts,lifetimeCap).src/services/worker/unrecoverable-patterns.ts(testable in isolation); added'OAuth token expired','token has been revoked','Unauthorized','OpenRouter API error: 401','OpenRouter API error: 403'. Deliberately not adding bare'401'— too broad.Why the loop drained Max-plan specifically
The Claude Agent SDK subprocess inherits
CLAUDE_CODE_OAUTH_TOKENfrom the parent Claude Code session (EnvManager.ts:255). Every worker-driven Claude call is therefore billed against the user's subscription. Combined with the 91.6% MCP-loopback failure regime reported in memory (obs 71051), the old guard never tripped,processPendingQueueskept re-feeding the loop, and nothing in the unrecoverable list recognized OAuth expiry. The fix closes each gap.Test plan
bun test tests/worker/agents/fallback-error-handler.test.ts— must stay green (adjacent logic, 25 tests passing locally after change).RestartGuard(window cap, consecutive-success requirement, lifetime-cap terminality).isUnrecoverableErroragainst each new pattern; negative case for bare'401'in an unrelated string.bun run buildsucceeds (pre-existing TS errors on unrelated lines are unaffected).🤖 Generated with Claude Code