Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions INVESTIGATION-windows-max-plan-drain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Investigation: Windows 11 Max-Plan Usage Drain (v12.3.3+ regression)

**Branch:** `investigate/windows-infinite-loop-usage-drain`
**Date:** 2026-04-20
**Reporter:** Discord user (Win11 + Claude Code CLI + claude-mem latest)
**Symptoms:**
- "Going into a loop with Claude"
- "Consuming entire Max plan usage in the background"
- "Continuous failed Python hooks" in the morning after updating
- Issue stops only after uninstalling claude-mem

## TL;DR

Between v12.3.2 (last "safe") and v12.3.7 (current), the flat 3-restart cap on
SDK generator crashes was replaced with a time-windowed `RestartGuard` that
**resets its decay window on any successful message**. On Windows, where the
MCP loopback and Claude-executable resolution have additional failure modes
(observation 71051 — *"MCP loopback failure causes 91.6% session failure rate"*),
the worker enters a slow-drip crash loop that never trips the new guard, and
burns the user's Max-plan OAuth token on each restart's Claude Agent SDK call.

## Root-cause chain

1. **Auth path** — `src/shared/EnvManager.ts:215` builds an isolated env for
every SDK subprocess spawn. When the user has no `ANTHROPIC_API_KEY` in
`~/.claude-mem/.env`, line 255 passes through `CLAUDE_CODE_OAUTH_TOKEN` from
the parent Claude Code session. **Every worker-driven Claude call is billed
against the user's Max subscription.** That is by design, but it's the
blast-radius multiplier for every other bug in the chain.

2. **Automatic replay on every worker start** —
`src/services/worker-service.ts:592` calls `processPendingQueues(50)` during
worker initialization. It re-spawns an SDK subprocess for every session
with `status IN ('pending', 'processing')` messages
(`PendingMessageStore.getSessionsWithPendingMessages()`, line 447). A single
accumulated backlog from a previous failed run becomes a fresh storm of
Claude calls on the next daemon restart.

3. **RestartGuard is too permissive for slow-drip failures** —
`src/services/worker/RestartGuard.ts`:
- `MAX_WINDOWED_RESTARTS = 10` restarts per 60 s
- `DECAY_AFTER_SUCCESS_MS = 5 min` — on *any* `recordSuccess()` call, the
restart timestamp array is wiped
- `ResponseProcessor.ts:211` calls `recordSuccess()` after *any* batch where
messages were confirmed
If 1-of-11 SDK invocations succeeds, the window never fills, the decay
clears history, and restarts continue indefinitely. At the observed 91.6 %
MCP failure rate this is exactly the regime the user is in.

4. **Two divergent crash-recovery paths** —
- `src/services/worker-service.ts:822-857` — on restart-guard trip calls
`terminateSession()` which calls `pendingStore.markAllSessionMessagesAbandoned()`
(PendingMessageStore.ts:293). Correct behavior.
- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — on restart-
guard trip only `session.abortController.abort()`. **Messages remain in
`pending` state** (explicitly acknowledged in the log message on line 325).
Next worker startup's `processPendingQueues()` grabs them again, starting
the loop over.

5. **Exponential-backoff ceiling amplifies damage** —
`SessionRoutes.ts:348` and `worker-service.ts` both cap backoff at 8 s after
4+ restarts. Steady state is ~7 restarts/minute ≈ 10 000 SDK invocations/day
before the guard trips — if it ever does (see #3).

6. **OAuth-expiry has no special handling** — the `unrecoverablePatterns` list
(`worker-service.ts:713-727`) matches on `'Invalid API key'`, `'API_KEY_INVALID'`,
`'API key expired'`, `'API key not valid'`. None of these match OAuth-token
failures. An expired/revoked `CLAUDE_CODE_OAUTH_TOKEN` produces errors that
the worker treats as transient and retries. Observation 55605 records PR
#1180 as a prior "OAuth Token Expiry Infinite Retry Loop" fix — the same
class of bug has re-surfaced against a new token type.

## "Failed Python hooks"

The user's wording is a misattribution. claude-mem's hooks are TypeScript
compiled to `plugin/scripts/*-hook.cjs` and run via Bun/Node. However:

- `uv` (Python toolchain) is installed for ChromaDB, and the `ChromaMcpManager`
spawns a Python process for vector sync.
- When Chroma sync fails, errors surface in the hook's stderr alongside the
real Bun/Node hook failure.
- On Windows the message the user most likely saw was the Chroma uv/Python
subprocess failing, conflated with the hook wrapper's own failure output.

It is **not** that hooks themselves are Python. The underlying bug is the
worker's SDK retry loop.

## Windows-specific amplifiers

- `SDKAgent.ts:466-473` — `where claude.cmd` resolution is tried first on Win32.
Any environment where `PATHEXT` or `where` behaves oddly (mingw, Git Bash
with stripped PATH) returns an error that is caught silently and falls
through to the generic "Claude executable not found". That string IS in
`unrecoverablePatterns`, so it should abort cleanly — but only if the SDK
surfaces it. In practice, transient subprocess spawn races on Windows surface
as generic errors that **don't** match the list, and then restart.
- `worker-spawner.ts:39` — the Windows spawn-cooldown lock (2 min) only
suppresses *daemon* spawn attempts, not the SDK subprocess spawn-storm
described here.
- `hook-constants.ts:30-34` — Windows gets a hook-timeout multiplier. Longer
hook windows = more time for the crash loop to run per session.

## What changed vs v12.3.2

```
src/services/worker/RestartGuard.ts | 70 ++++++ (NEW)
src/services/worker/http/routes/SessionRoutes.ts | 24 +/- (MAX=3 → RestartGuard)
src/services/worker-service.ts | 28 +/- (MAX=3 → RestartGuard)
src/services/sqlite/PendingMessageStore.ts | 19 ++ (clearFailed → clearFailedOlderThan(1h))
```

The regression is squarely in the restart-guard swap. The old flat counter
would have tripped after 3 crashes and stopped the SDK spawn, regardless of
whether anything eventually succeeded.

## Recommended fixes (in priority order)

1. **SessionRoutes.ts restart-guard trip must call `terminateSession`** (or
`markAllSessionMessagesAbandoned`) — mirror the behavior in
`worker-service.ts:837`. Today it explicitly leaves messages pending, which
guarantees re-replay on daemon restart.

2. **Tighten RestartGuard decay** — either
- require N consecutive successes before decay, not a single one, or
- track a separate failure rate; if fail-rate > 50 % over the window, trip
regardless of `recordSuccess()` calls.

3. **Add OAuth-expiry to `unrecoverablePatterns`** — common SDK error strings
from expired OAuth tokens (`Unauthorized`, `OAuth token expired`,
`token has been revoked`, 401 responses) should be treated the same as
`'Invalid API key'`.

4. **Cap absolute restart count per session-lifetime** — RestartGuard caps per
window but has no absolute ceiling. A hard cap (e.g. 50 restarts regardless
of window) protects users from the decay-loop regime.

5. **Kill-switch** — a `CLAUDE_MEM_PAUSE_WORKER` setting the user can flip
without uninstalling, so the next Discord user isn't forced to uninstall
to stop the bleeding. Hook entrypoints would short-circuit if set.

6. **Telemetry on worker startup** — emit a warning if
`processPendingQueues()` finds > N orphaned sessions or > M orphaned
messages. Today the auto-recovery is silent for backlogs of any size.

## Files to touch for the fix

- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — call
`terminateSession` instead of bare `abort()`.
- `src/services/worker/RestartGuard.ts` — stricter decay semantics + absolute
cap.
- `src/services/worker-service.ts:713-727` — extend `unrecoverablePatterns`.
- `src/shared/SettingsDefaultsManager.ts` — `CLAUDE_MEM_PAUSE_WORKER` flag.

## What the user should do *right now*

Until a fix ships, the Discord user's mitigation (uninstall) is correct. As
a less-drastic workaround:

1. Stop the worker: `curl -X POST http://localhost:37777/api/shutdown` (or
kill the `bun` process in Task Manager).
2. Delete `~/.claude-mem/claude-mem.db-wal` and empty the `pending_messages`
table via sqlite3 to break any stored replay queue.
3. Remove the plugin from `~/.claude.json` until the fix ships.
Loading