diff --git a/INVESTIGATION-windows-max-plan-drain.md b/INVESTIGATION-windows-max-plan-drain.md new file mode 100644 index 0000000000..98642638e9 --- /dev/null +++ b/INVESTIGATION-windows-max-plan-drain.md @@ -0,0 +1,163 @@ +# Investigation: Windows 11 Max-Plan Usage Drain (v12.3.3+ regression) + +**Branch:** `investigate/windows-infinite-loop-usage-drain` +**Date:** 2026-04-20 +**Reporter:** Discord user (Win11 + Claude Code CLI + claude-mem latest) +**Symptoms:** +- "Going into a loop with Claude" +- "Consuming entire Max plan usage in the background" +- "Continuous failed Python hooks" in the morning after updating +- Issue stops only after uninstalling claude-mem + +## TL;DR + +Between v12.3.2 (last "safe") and v12.3.7 (current), the flat 3-restart cap on +SDK generator crashes was replaced with a time-windowed `RestartGuard` that +**resets its decay window on any successful message**. On Windows, where the +MCP loopback and Claude-executable resolution have additional failure modes +(observation 71051 — *"MCP loopback failure causes 91.6% session failure rate"*), +the worker enters a slow-drip crash loop that never trips the new guard, and +burns the user's Max-plan OAuth token on each restart's Claude Agent SDK call. + +## Root-cause chain + +1. **Auth path** — `src/shared/EnvManager.ts:215` builds an isolated env for + every SDK subprocess spawn. When the user has no `ANTHROPIC_API_KEY` in + `~/.claude-mem/.env`, line 255 passes through `CLAUDE_CODE_OAUTH_TOKEN` from + the parent Claude Code session. **Every worker-driven Claude call is billed + against the user's Max subscription.** That is by design, but it's the + blast-radius multiplier for every other bug in the chain. + +2. **Automatic replay on every worker start** — + `src/services/worker-service.ts:592` calls `processPendingQueues(50)` during + worker initialization. It re-spawns an SDK subprocess for every session + with `status IN ('pending', 'processing')` messages + (`PendingMessageStore.getSessionsWithPendingMessages()`, line 447). A single + accumulated backlog from a previous failed run becomes a fresh storm of + Claude calls on the next daemon restart. + +3. **RestartGuard is too permissive for slow-drip failures** — + `src/services/worker/RestartGuard.ts`: + - `MAX_WINDOWED_RESTARTS = 10` restarts per 60 s + - `DECAY_AFTER_SUCCESS_MS = 5 min` — on *any* `recordSuccess()` call, the + restart timestamp array is wiped + - `ResponseProcessor.ts:211` calls `recordSuccess()` after *any* batch where + messages were confirmed + If 1-of-11 SDK invocations succeeds, the window never fills, the decay + clears history, and restarts continue indefinitely. At the observed 91.6 % + MCP failure rate this is exactly the regime the user is in. + +4. **Two divergent crash-recovery paths** — + - `src/services/worker-service.ts:822-857` — on restart-guard trip calls + `terminateSession()` which calls `pendingStore.markAllSessionMessagesAbandoned()` + (PendingMessageStore.ts:293). Correct behavior. + - `src/services/worker/http/routes/SessionRoutes.ts:318-330` — on restart- + guard trip only `session.abortController.abort()`. **Messages remain in + `pending` state** (explicitly acknowledged in the log message on line 325). + Next worker startup's `processPendingQueues()` grabs them again, starting + the loop over. + +5. **Exponential-backoff ceiling amplifies damage** — + `SessionRoutes.ts:348` and `worker-service.ts` both cap backoff at 8 s after + 4+ restarts. Steady state is ~7 restarts/minute ≈ 10 000 SDK invocations/day + before the guard trips — if it ever does (see #3). + +6. **OAuth-expiry has no special handling** — the `unrecoverablePatterns` list + (`worker-service.ts:713-727`) matches on `'Invalid API key'`, `'API_KEY_INVALID'`, + `'API key expired'`, `'API key not valid'`. None of these match OAuth-token + failures. An expired/revoked `CLAUDE_CODE_OAUTH_TOKEN` produces errors that + the worker treats as transient and retries. Observation 55605 records PR + #1180 as a prior "OAuth Token Expiry Infinite Retry Loop" fix — the same + class of bug has re-surfaced against a new token type. + +## "Failed Python hooks" + +The user's wording is a misattribution. claude-mem's hooks are TypeScript +compiled to `plugin/scripts/*-hook.cjs` and run via Bun/Node. However: + +- `uv` (Python toolchain) is installed for ChromaDB, and the `ChromaMcpManager` + spawns a Python process for vector sync. +- When Chroma sync fails, errors surface in the hook's stderr alongside the + real Bun/Node hook failure. +- On Windows the message the user most likely saw was the Chroma uv/Python + subprocess failing, conflated with the hook wrapper's own failure output. + +It is **not** that hooks themselves are Python. The underlying bug is the +worker's SDK retry loop. + +## Windows-specific amplifiers + +- `SDKAgent.ts:466-473` — `where claude.cmd` resolution is tried first on Win32. + Any environment where `PATHEXT` or `where` behaves oddly (mingw, Git Bash + with stripped PATH) returns an error that is caught silently and falls + through to the generic "Claude executable not found". That string IS in + `unrecoverablePatterns`, so it should abort cleanly — but only if the SDK + surfaces it. In practice, transient subprocess spawn races on Windows surface + as generic errors that **don't** match the list, and then restart. +- `worker-spawner.ts:39` — the Windows spawn-cooldown lock (2 min) only + suppresses *daemon* spawn attempts, not the SDK subprocess spawn-storm + described here. +- `hook-constants.ts:30-34` — Windows gets a hook-timeout multiplier. Longer + hook windows = more time for the crash loop to run per session. + +## What changed vs v12.3.2 + +``` +src/services/worker/RestartGuard.ts | 70 ++++++ (NEW) +src/services/worker/http/routes/SessionRoutes.ts | 24 +/- (MAX=3 → RestartGuard) +src/services/worker-service.ts | 28 +/- (MAX=3 → RestartGuard) +src/services/sqlite/PendingMessageStore.ts | 19 ++ (clearFailed → clearFailedOlderThan(1h)) +``` + +The regression is squarely in the restart-guard swap. The old flat counter +would have tripped after 3 crashes and stopped the SDK spawn, regardless of +whether anything eventually succeeded. + +## Recommended fixes (in priority order) + +1. **SessionRoutes.ts restart-guard trip must call `terminateSession`** (or + `markAllSessionMessagesAbandoned`) — mirror the behavior in + `worker-service.ts:837`. Today it explicitly leaves messages pending, which + guarantees re-replay on daemon restart. + +2. **Tighten RestartGuard decay** — either + - require N consecutive successes before decay, not a single one, or + - track a separate failure rate; if fail-rate > 50 % over the window, trip + regardless of `recordSuccess()` calls. + +3. **Add OAuth-expiry to `unrecoverablePatterns`** — common SDK error strings + from expired OAuth tokens (`Unauthorized`, `OAuth token expired`, + `token has been revoked`, 401 responses) should be treated the same as + `'Invalid API key'`. + +4. **Cap absolute restart count per session-lifetime** — RestartGuard caps per + window but has no absolute ceiling. A hard cap (e.g. 50 restarts regardless + of window) protects users from the decay-loop regime. + +5. **Kill-switch** — a `CLAUDE_MEM_PAUSE_WORKER` setting the user can flip + without uninstalling, so the next Discord user isn't forced to uninstall + to stop the bleeding. Hook entrypoints would short-circuit if set. + +6. **Telemetry on worker startup** — emit a warning if + `processPendingQueues()` finds > N orphaned sessions or > M orphaned + messages. Today the auto-recovery is silent for backlogs of any size. + +## Files to touch for the fix + +- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — call + `terminateSession` instead of bare `abort()`. +- `src/services/worker/RestartGuard.ts` — stricter decay semantics + absolute + cap. +- `src/services/worker-service.ts:713-727` — extend `unrecoverablePatterns`. +- `src/shared/SettingsDefaultsManager.ts` — `CLAUDE_MEM_PAUSE_WORKER` flag. + +## What the user should do *right now* + +Until a fix ships, the Discord user's mitigation (uninstall) is correct. As +a less-drastic workaround: + +1. Stop the worker: `curl -X POST http://localhost:37777/api/shutdown` (or + kill the `bun` process in Task Manager). +2. Delete `~/.claude-mem/claude-mem.db-wal` and empty the `pending_messages` + table via sqlite3 to break any stored replay queue. +3. Remove the plugin from `~/.claude.json` until the fix ships. diff --git a/PLAN-windows-max-plan-drain-fix.md b/PLAN-windows-max-plan-drain-fix.md new file mode 100644 index 0000000000..400b0b9fc9 --- /dev/null +++ b/PLAN-windows-max-plan-drain-fix.md @@ -0,0 +1,413 @@ +# Plan: Fix Windows Max-Plan Drain (v12.3.3+ regression) + +Context: `INVESTIGATION-windows-max-plan-drain.md` in repo root. Branch: +`investigate/windows-infinite-loop-usage-drain`. Test runner: `bun test`. + +## Phase 0 — Documentation Discovery (COMPLETED) + +Findings consolidated from three discovery subagents: + +### Codebase APIs available (verified) + +- **`SessionRoutes` already has** `this.sessionManager`, `this.eventBroadcaster`, + `this.workerService` (`src/services/worker/http/routes/SessionRoutes.ts:34-49`). +- **`WorkerService.terminateSession` is `private`** (`worker-service.ts:964`) — + cannot call from SessionRoutes. Must replicate the three-step pattern + inline. Reference implementation: `worker-service.ts:964-976`. +- **Abandonment pattern already used in SessionRoutes** at + `SessionRoutes.ts:123-125` — `pendingStore.markAllSessionMessagesAbandoned()` + + `sessionManager.removeSessionImmediate()`. Copy this pattern. +- **`SessionManager.removeSessionImmediate`** is public + (`SessionManager.ts:453-468`) and fires `onSessionDeletedCallback` → + `broadcastProcessingStatus()` automatically. +- **`SessionEventBroadcaster.broadcastSessionCompleted`** exists + (`events/SessionEventBroadcaster.ts:73-82`) — `worker-service.ts:951` + does call it after `terminateSession`, so mirror that for parity. + +### `RestartGuard` facts + +- Full impl is only 70 lines (`src/services/worker/RestartGuard.ts`). +- **No existing tests** for RestartGuard. +- Public surface: `recordRestart(): boolean`, `recordSuccess(): void`, + getters `restartsInWindow`, `windowMs`, `maxRestarts`. +- Call sites (must remain compatible): + - `SessionRoutes.ts:314-315, 322-324, 336-337` + - `worker-service.ts:824-825, 832-834, 854` + - `ResponseProcessor.ts:211` — `recordSuccess()` called only after a batch is + confirmed to storage. +- Field: `ActiveSession.restartGuard?: RestartGuard` (`worker-types.ts:39`). + +### `unrecoverablePatterns` facts + +- Location: `worker-service.ts:713-727`. Match is + `unrecoverablePatterns.some(p => errorMessage.includes(p))` + — simple substring on `(error as Error)?.message`. +- **DO NOT add bare `'401'`** — matches request IDs, log lines, etc. Use + agent-prefixed forms instead. +- Gap found: **OpenRouter 401 is not currently caught** + (`OpenRouterAgent.ts:458`). Mirror the Gemini pattern for consistency. +- Existing test file that covers transient/terminal classification: + `tests/worker/agents/fallback-error-handler.test.ts` (line 75-77 asserts + `'401 Unauthorized'` is terminal). No direct tests exist for + `unrecoverablePatterns` — add them. + +### Anti-patterns to avoid + +- ❌ Making `terminateSession` public and calling it from SessionRoutes — couples + the HTTP layer to WorkerService internals further; the three-step pattern is + already public API surface. +- ❌ Adding bare `'401'` to `unrecoverablePatterns` — too broad. +- ❌ Refactoring RestartGuard public API — call sites depend on getters. +- ❌ Creating new test framework config — project already uses `bun test`. + +--- + +## Phase 1 — Fix SessionRoutes restart-guard trip (stops the drip) + +**Goal:** When the windowed restart guard trips in the HTTP-layer crash-recovery +path, abandon pending messages instead of leaving them in `'pending'` state, so +the next worker startup's `processPendingQueues()` does not replay them. + +**Target file:** `src/services/worker/http/routes/SessionRoutes.ts` + +**Doc reference (pattern to copy, verbatim except for the reason string):** + +1. `SessionRoutes.ts:123-125` — existing abandonment pattern in the same class. +2. `worker-service.ts:964-976` — the private `terminateSession` body to + replicate (logger message + reason + broadcastSessionCompleted call). +3. `worker-service.ts:951` — confirms `broadcastSessionCompleted` is called + after abandonment in the sibling code path. + +### Task 1.1 — Replace `abort()`-only trip with full abandonment + +At `SessionRoutes.ts:326-329`, replace: + +```ts +// Don't restart - abort to prevent further API calls +session.abortController.abort(); +return; +``` + +With (copy-adapt from the pattern at SessionRoutes.ts:123-125 + the +log shape from worker-service.ts:968-972): + +```ts +// Restart guard tripped — abandon pending messages so the next worker +// startup's processPendingQueues() does NOT replay them, and kill the +// subprocess. Mirrors WorkerService.terminateSession() semantics. +// (Cannot call terminateSession directly — it's private in WorkerService.) +session.abortController.abort(); +const pendingStore = this.sessionManager.getPendingMessageStore(); +const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId); +logger.info('SYSTEM', 'Session terminated', { + sessionId: sessionDbId, + reason: 'max_restarts_exceeded', + abandonedMessages: abandoned +}); +this.sessionManager.removeSessionImmediate(sessionDbId); +this.eventBroadcaster.broadcastSessionCompleted(sessionDbId); +return; +``` + +Also update the log at `SessionRoutes.ts:319-326`: change the `action:` string +from `'Messages remain in pending state.'` to +`'Pending messages abandoned to prevent replay loop.'`. + +### Verification + +- `grep -n 'Messages remain in pending state' src/` → no matches. +- `grep -n 'markAllSessionMessagesAbandoned' src/services/worker/http/routes/SessionRoutes.ts` + → should now find TWO call sites (existing L124 + new one in restart-trip). +- Manual trace: after trip, `pending_messages.status` for that session is + `'failed'` (or whatever `markAllSessionMessagesAbandoned` sets it to — + confirm by reading `PendingMessageStore.ts:293`). + +### Anti-pattern guard + +- Don't make `terminateSession` public. Replicate inline — the three calls are + already public API. +- Don't skip `abortController.abort()` before abandonment — the SDK subprocess + must be killed or it will keep consuming Claude. + +--- + +## Phase 2 — Tighten RestartGuard (stops the loop from getting far enough to drip) + +**Goal:** Require N=5 consecutive `recordSuccess()` calls before clearing the +restart history, and add an absolute lifetime cap of 50 restarts that, once +tripped, never resets. + +**Target file:** `src/services/worker/RestartGuard.ts` + +**Doc reference:** + +- Current impl quoted in full in Phase 0. Consumers at: + - `SessionRoutes.ts:314-315, 322-324, 336-337` + - `worker-service.ts:824-825, 832-834, 854` + - `ResponseProcessor.ts:211` +- Public surface must remain compatible: `recordRestart(): boolean`, + `recordSuccess(): void`, getters `restartsInWindow`, `windowMs`, + `maxRestarts`. + +### Task 2.1 — Add constants and private fields + +Add to top of file: + +```ts +const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5; +const ABSOLUTE_LIFETIME_RESTART_CAP = 50; +``` + +Add to class: + +```ts +private consecutiveSuccessCount = 0; +private totalRestartsAllTime = 0; +``` + +### Task 2.2 — Rewrite `recordRestart()` semantics + +Behavior: + +1. Increment `totalRestartsAllTime`. +2. If `totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP`, return `false` + immediately (never resets — terminal). +3. Any restart resets `consecutiveSuccessCount` to 0 (a failure interrupts the + success streak). +4. Existing decay check: only fire if + `lastSuccessfulProcessing !== null` **AND** we had + `consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY` the + last time `recordSuccess` was called **AND** 5 min elapsed. To make the + logic clean: only gate decay on an existing flag `decayEligible` that is + set to `true` by `recordSuccess` once the streak threshold is reached. +5. Otherwise preserve current window-based logic. + +Simplest structure: + +```ts +recordRestart(): boolean { + this.totalRestartsAllTime += 1; + this.consecutiveSuccessCount = 0; // streak broken + if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) { + return false; // terminal — lifetime cap reached + } + + const now = Date.now(); + if (this.decayEligible + && this.lastSuccessfulProcessing !== null + && now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) { + this.restartTimestamps = []; + this.lastSuccessfulProcessing = null; + this.decayEligible = false; + } + + this.restartTimestamps = this.restartTimestamps.filter( + ts => now - ts < RESTART_WINDOW_MS + ); + this.restartTimestamps.push(now); + return this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS; +} +``` + +### Task 2.3 — Update `recordSuccess()` to require N consecutive calls + +```ts +recordSuccess(): void { + this.consecutiveSuccessCount += 1; + this.lastSuccessfulProcessing = Date.now(); + if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) { + this.decayEligible = true; + } +} +``` + +Add private field `decayEligible = false`. + +### Task 2.4 — Add introspection getters for logging (optional but useful) + +```ts +get totalRestarts(): number { return this.totalRestartsAllTime; } +get lifetimeCap(): number { return ABSOLUTE_LIFETIME_RESTART_CAP; } +``` + +### Task 2.5 — Update log statements at trip sites to include new fields + +- `SessionRoutes.ts:319-326`: add `totalRestarts: session.restartGuard.totalRestarts` + and `lifetimeCap: session.restartGuard.lifetimeCap` to the error log payload. +- `worker-service.ts:828-835`: same additions. + +### Verification + +- `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4). +- Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st → + blocked (lifetime cap persists). +- Behavior check: 4 successes then restart → full window still counted. + 5th success then restart 5 min later → window cleared. + +### Anti-pattern guard + +- Don't make `recordSuccess` reset `totalRestartsAllTime` — the lifetime cap is + meant to be terminal. +- Don't silently change the meaning of the existing getters — add new ones. + +--- + +## Phase 3 — Extend `unrecoverablePatterns` (handle OAuth expiry) + +**Goal:** When Max-plan OAuth tokens expire/are revoked, treat the SDK error as +unrecoverable so no restart is attempted. + +**Target file:** `src/services/worker-service.ts:713-727` + +**Doc reference:** + +- Current array + match logic quoted in Phase 0 discovery. +- `OpenRouterAgent.ts:458` — error throw format is + `` `OpenRouter API error: ${response.status} - ${errorText}` ``. +- `tests/worker/agents/fallback-error-handler.test.ts:75-77` — existing + assertion `'401 Unauthorized'` is terminal, confirming it is safe to treat + as unrecoverable. + +### Task 3.1 — Add OAuth / OpenRouter auth strings + +Add these entries (in a clear grouping comment) to the array at +`worker-service.ts:713-727`: + +```ts +// OAuth / subscription-token expiry (Max plan users) — matches SDK +// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN +// is no longer valid. +'OAuth token expired', +'token has been revoked', +'Unauthorized', +// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures. +'OpenRouter API error: 401', +'OpenRouter API error: 403', +``` + +**Do not add bare `'401'`** — too broad. Anti-pattern confirmed by discovery. + +### Verification + +- `grep -n "'Unauthorized'" src/services/worker-service.ts` → should match. +- Run `bun test tests/worker/worker-service-unrecoverable.test.ts` (new file — + see Phase 4) — assertions cover each new pattern. +- Run `bun test tests/worker/agents/fallback-error-handler.test.ts` — must + still pass (no regressions). + +### Anti-pattern guard + +- No regex or `startsWith` — keep substring match semantics to avoid breaking + existing entries. +- Don't unpack `error.code` or nested JSON bodies in this phase — discovery + confirmed all current agent error throws embed the status/message in + `.message`. + +--- + +## Phase 4 — Tests (no tests existed for RestartGuard or unrecoverablePatterns) + +**Goal:** Prevent regression on all three fixes. + +**Test framework:** `bun test` (see `package.json` scripts). + +### Task 4.1 — Create `tests/worker/RestartGuard.test.ts` + +Test suites (doc: `src/services/worker/RestartGuard.ts` as modified in Phase 2): + +- `recordRestart respects window`: push 10 restarts in <60s → allowed; 11th → + blocked. +- `recordSuccess requires N consecutive before decay`: 4 successes then + restart 6 min later → window still populated; 5 successes then restart 6 min + later → window cleared. +- `restart breaks success streak`: 3 successes → 1 restart → 4 more successes + (total 7 successes with gap) → streak counter is 4 at end, not 7 (decay + NOT yet eligible). +- `lifetime cap is terminal`: 50 restarts → OK; 51st → blocked even if window + is empty; recordSuccess cannot un-block it. +- `getters return expected values`: `totalRestarts`, `lifetimeCap`, + `restartsInWindow`, `maxRestarts`, `windowMs`. + +Use `Date.now` mocking via Bun's `spyOn(Date, 'now')` or an injected clock. +Check project-local patterns first — if other tests in `tests/worker/` already +use a specific clock helper, copy that pattern. Otherwise straight `spyOn`. + +### Task 4.2 — Create `tests/worker/worker-service-unrecoverable.test.ts` + +Focused unit test on the `.some(p => errorMessage.includes(p))` matcher for +each new pattern. If extracting `unrecoverablePatterns` into a standalone +exported helper makes testing easier, do that as part of this task (small +refactor — pull the array + predicate into a named export at the top of +`worker-service.ts` and import it in the `.catch` block). Verify: + +- Each of `'OAuth token expired'`, `'token has been revoked'`, + `'Unauthorized'`, `'OpenRouter API error: 401'`, + `'OpenRouter API error: 403'` matches a realistic error message. +- Bare `'401'` by itself (e.g., `"request-id-401xyz"`) does NOT match — this + test locks in the decision not to add bare `'401'`. +- All PRE-EXISTING patterns still match realistic messages (no regressions). + +### Task 4.3 — Add SessionRoutes restart-trip integration test + +If a test file already exists for `SessionRoutes` in `tests/worker/http/` or +`tests/server/`, extend it. Otherwise create +`tests/worker/http/SessionRoutes.restart-trip.test.ts`: + +- Set up a session with pending messages and a `RestartGuard` pre-loaded near + the window cap. +- Trigger the restart path (mock `pendingStore.getPendingCount` > 0). +- Assert: `pendingStore.markAllSessionMessagesAbandoned` was called with the + session ID; `sessionManager.removeSessionImmediate` was called; + `eventBroadcaster.broadcastSessionCompleted` was called. + +If SessionRoutes has no existing test harness, keep scope small: lift the +restart-trip branch into a private helper method first, export it for +testing, and assert on the helper. Don't build a full HTTP harness for this. + +### Verification + +- `bun test tests/worker/RestartGuard.test.ts` — green. +- `bun test tests/worker/worker-service-unrecoverable.test.ts` — green. +- `bun test tests/worker/http/SessionRoutes.restart-trip.test.ts` — green. +- `bun test` (full suite) — no regressions. + +### Anti-pattern guard + +- Don't stub Claude Agent SDK in these tests — they are pure unit tests. +- Don't test behavior that would require spinning up the real worker daemon. + +--- + +## Phase 5 — Final Verification + +Run the full validation sequence: + +1. `bun run build` — builds cleanly (no TS errors). +2. `bun test` — full suite green. +3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase + is gone from the codebase). +4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches. +5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts` + — both match. +6. Sanity read of the diff — each of the three changes is isolated and local, + no refactor creep. +7. Update `CHANGELOG.md` — actually, per repo CLAUDE.md the changelog is + auto-generated. **Do not edit.** +8. Stage & commit on the existing branch + `investigate/windows-infinite-loop-usage-drain`. Commit message: + `fix: stop Max-plan drain loop on Windows (RestartGuard + SessionRoutes + OAuth patterns)`. + +### Anti-pattern guard + +- Do not touch `CHANGELOG.md` (auto-generated per CLAUDE.md). +- Do not bump version here — release flow handles it. +- Do not merge to main — the user wants a PR. + +--- + +## Execution note + +Each implementation phase (1, 2, 3, 4) is self-contained and cites the exact +files/lines discovered in Phase 0. Phases 1, 2, 3 can run in parallel; +Phase 4 depends on Phase 2 (RestartGuard API shape) and Phase 3 (optional +helper export from worker-service). Phase 5 depends on all prior phases. diff --git a/src/services/worker-service.ts b/src/services/worker-service.ts index 3b9d48a0d8..bea896c603 100644 --- a/src/services/worker-service.ts +++ b/src/services/worker-service.ts @@ -29,6 +29,7 @@ import { sanitizeEnv } from '../supervisor/env-sanitizer.js'; // transitively pulls in the SQLite database layer via ChromaSync/DatabaseManager. import { ensureWorkerStarted as ensureWorkerStartedShared } from './worker-spawner.js'; import { RestartGuard } from './worker/RestartGuard.js'; +import { isUnrecoverableError } from './worker/unrecoverable-patterns.js'; // Re-export for backward compatibility — canonical implementation in shared/plugin-state.ts export { isPluginDisabledInClaudeSettings } from '../shared/plugin-state.js'; @@ -709,23 +710,10 @@ export class WorkerService { const errorMessage = (error as Error)?.message || ''; // Detect unrecoverable errors that should NOT trigger restart - // These errors will fail immediately on retry, causing infinite loops - const unrecoverablePatterns = [ - 'Claude executable not found', - 'CLAUDE_CODE_PATH', - 'ENOENT', - 'spawn', - 'Invalid API key', - 'API_KEY_INVALID', - 'API key expired', - 'API key not valid', - 'PERMISSION_DENIED', - 'Gemini API error: 400', - 'Gemini API error: 401', - 'Gemini API error: 403', - 'FOREIGN KEY constraint failed', - ]; - if (unrecoverablePatterns.some(pattern => errorMessage.includes(pattern))) { + // These errors will fail immediately on retry, causing infinite loops. + // Pattern list lives in ./worker/unrecoverable-patterns.ts so the matcher + // is unit-testable without importing this full module. + if (isUnrecoverableError(errorMessage)) { hadUnrecoverableError = true; this.lastAiInteraction = { timestamp: Date.now(), @@ -831,7 +819,9 @@ export class WorkerService { pendingCount, restartsInWindow: session.restartGuard.restartsInWindow, windowMs: session.restartGuard.windowMs, - maxRestarts: session.restartGuard.maxRestarts + maxRestarts: session.restartGuard.maxRestarts, + totalRestarts: session.restartGuard.totalRestarts, + lifetimeCap: session.restartGuard.lifetimeCap }); session.consecutiveRestarts = 0; this.terminateSession(session.sessionDbId, 'max_restarts_exceeded'); diff --git a/src/services/worker/RestartGuard.ts b/src/services/worker/RestartGuard.ts index 39f5069eca..e973d6cd09 100644 --- a/src/services/worker/RestartGuard.ts +++ b/src/services/worker/RestartGuard.ts @@ -3,28 +3,51 @@ * Prevents tight-loop restarts (bug) while allowing legitimate occasional restarts * over long sessions. Replaces the flat consecutiveRestarts counter that stranded * pending messages after just 3 restarts over any timeframe (#2053). + * + * Additional guards (Phase 2 of windows-max-plan-drain-fix): + * - Decay of windowed history only fires after N consecutive successes, so a + * single fluky success cannot clear the runaway-loop detector. + * - Absolute lifetime cap: once total restarts cross the cap, the guard trips + * permanently and cannot be reset by later successes. */ const RESTART_WINDOW_MS = 60_000; // Only count restarts within last 60 seconds const MAX_WINDOWED_RESTARTS = 10; // 10 restarts in 60s = runaway loop const DECAY_AFTER_SUCCESS_MS = 5 * 60_000; // Clear history after 5min of uninterrupted success +const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5; +const ABSOLUTE_LIFETIME_RESTART_CAP = 50; export class RestartGuard { private restartTimestamps: number[] = []; private lastSuccessfulProcessing: number | null = null; + private consecutiveSuccessCount = 0; + private totalRestartsAllTime = 0; + private decayEligible = false; /** * Record a restart and check if the guard should trip. * @returns true if the restart is ALLOWED, false if it should be BLOCKED */ recordRestart(): boolean { + this.totalRestartsAllTime += 1; + this.consecutiveSuccessCount = 0; // streak broken by any restart + + // Terminal: lifetime cap reached — never resets, even if successes follow. + if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) { + return false; + } + const now = Date.now(); - // Decay: clear history only after real success + 5min of uninterrupted success - if (this.lastSuccessfulProcessing !== null + // Decay: only fires if we accumulated the required consecutive successes + // AND 5min has elapsed since the last success. One-off successes cannot + // clear the windowed-restart history. + if (this.decayEligible + && this.lastSuccessfulProcessing !== null && now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) { this.restartTimestamps = []; this.lastSuccessfulProcessing = null; + this.decayEligible = false; } // Prune old timestamps outside the window @@ -41,9 +64,15 @@ export class RestartGuard { /** * Call when a message is successfully processed to update the success timestamp. + * Requires REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY consecutive calls before + * the restart-window decay path can fire. */ recordSuccess(): void { + this.consecutiveSuccessCount += 1; this.lastSuccessfulProcessing = Date.now(); + if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) { + this.decayEligible = true; + } } /** @@ -67,4 +96,19 @@ export class RestartGuard { get maxRestarts(): number { return MAX_WINDOWED_RESTARTS; } + + /** + * Total restarts counted for the lifetime of this guard (for logging). + * Never decreases; compared against `lifetimeCap` to decide terminal trip. + */ + get totalRestarts(): number { + return this.totalRestartsAllTime; + } + + /** + * Absolute lifetime restart cap (for logging). + */ + get lifetimeCap(): number { + return ABSOLUTE_LIFETIME_RESTART_CAP; + } } diff --git a/src/services/worker/http/routes/SessionRoutes.ts b/src/services/worker/http/routes/SessionRoutes.ts index c633f8a99c..a399510d3b 100644 --- a/src/services/worker/http/routes/SessionRoutes.ts +++ b/src/services/worker/http/routes/SessionRoutes.ts @@ -97,6 +97,49 @@ export class SessionRoutes extends BaseRouteHandler { private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099) private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590) + /** + * Handle the restart-guard-tripped branch of crash recovery. + * + * When the windowed restart guard (or lifetime cap) trips, we must: + * 1. abort the SDK subprocess so it stops consuming Claude + * 2. mark pending messages abandoned so the next worker startup's + * processPendingQueues() does NOT replay them (drain loop fix) + * 3. remove the in-memory session and broadcast completion so the UI + * reflects the terminal state + * + * Mirrors WorkerService.terminateSession() semantics — cannot call that + * method directly because it's private. Extracted into a helper so it's + * unit-testable without standing up a full HTTP harness (see + * tests/worker/http/SessionRoutes.restart-trip.test.ts). + */ + private handleRestartGuardTripped( + sessionDbId: number, + pendingCount: number, + session: { abortController: AbortController; restartGuard?: RestartGuard } + ): void { + const restartGuard = session.restartGuard; + logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, { + sessionId: sessionDbId, + pendingCount, + restartsInWindow: restartGuard?.restartsInWindow, + windowMs: restartGuard?.windowMs, + maxRestarts: restartGuard?.maxRestarts, + totalRestarts: restartGuard?.totalRestarts, + lifetimeCap: restartGuard?.lifetimeCap, + action: 'Pending messages abandoned to prevent replay loop.' + }); + session.abortController.abort(); + const pendingStore = this.sessionManager.getPendingMessageStore(); + const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId); + logger.info('SYSTEM', 'Session terminated', { + sessionId: sessionDbId, + reason: 'max_restarts_exceeded', + abandonedMessages: abandoned + }); + this.sessionManager.removeSessionImmediate(sessionDbId); + this.eventBroadcaster.broadcastSessionCompleted(sessionDbId); + } + private ensureGeneratorRunning(sessionDbId: number, source: string): void { const session = this.sessionManager.getSession(sessionDbId); if (!session) return; @@ -316,16 +359,7 @@ export class SessionRoutes extends BaseRouteHandler { session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging if (!restartAllowed) { - logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, { - sessionId: sessionDbId, - pendingCount, - restartsInWindow: session.restartGuard.restartsInWindow, - windowMs: session.restartGuard.windowMs, - maxRestarts: session.restartGuard.maxRestarts, - action: 'Generator will NOT restart. Check logs for root cause. Messages remain in pending state.' - }); - // Don't restart - abort to prevent further API calls - session.abortController.abort(); + this.handleRestartGuardTripped(sessionDbId, pendingCount, session); return; } @@ -334,7 +368,9 @@ export class SessionRoutes extends BaseRouteHandler { pendingCount, consecutiveRestarts: session.consecutiveRestarts, restartsInWindow: session.restartGuard!.restartsInWindow, - maxRestarts: session.restartGuard!.maxRestarts + maxRestarts: session.restartGuard!.maxRestarts, + totalRestarts: session.restartGuard!.totalRestarts, + lifetimeCap: session.restartGuard!.lifetimeCap }); // Abort OLD controller before replacing to prevent child process leaks diff --git a/src/services/worker/unrecoverable-patterns.ts b/src/services/worker/unrecoverable-patterns.ts new file mode 100644 index 0000000000..a7cc55c947 --- /dev/null +++ b/src/services/worker/unrecoverable-patterns.ts @@ -0,0 +1,45 @@ +/** + * Unrecoverable error patterns + * + * Error messages matching any of these substrings should NOT trigger a restart + * of the SDK generator. Retrying them will fail identically and drain API + * budget / Max-plan usage (see windows-max-plan-drain-fix plan, Phase 3). + * + * Matching is a plain substring check via `String.prototype.includes`, so + * patterns must be specific enough to avoid false positives (e.g. bare `'401'` + * would match request IDs and is forbidden — use agent-prefixed forms). + */ + +export const UNRECOVERABLE_ERROR_PATTERNS: readonly string[] = [ + 'Claude executable not found', + 'CLAUDE_CODE_PATH', + 'ENOENT', + 'spawn', + 'Invalid API key', + 'API_KEY_INVALID', + 'API key expired', + 'API key not valid', + 'PERMISSION_DENIED', + 'Gemini API error: 400', + 'Gemini API error: 401', + 'Gemini API error: 403', + // OAuth / subscription-token expiry (Max plan users) — matches SDK + // subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN + // is no longer valid. + 'OAuth token expired', + 'token has been revoked', + 'Unauthorized', + // Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures. + 'OpenRouter API error: 401', + 'OpenRouter API error: 403', + 'FOREIGN KEY constraint failed', +]; + +/** + * Returns true if the given error message matches any unrecoverable pattern. + * Accepts an empty / non-string input and returns false in that case. + */ +export function isUnrecoverableError(errorMessage: string | null | undefined): boolean { + if (!errorMessage) return false; + return UNRECOVERABLE_ERROR_PATTERNS.some(pattern => errorMessage.includes(pattern)); +}