Skip to content

fix: stop Windows Max-plan drain loop#2083

Closed
thedotmack wants to merge 1 commit into
mainfrom
investigate/windows-infinite-loop-usage-drain
Closed

fix: stop Windows Max-plan drain loop#2083
thedotmack wants to merge 1 commit into
mainfrom
investigate/windows-infinite-loop-usage-drain

Conversation

@thedotmack
Copy link
Copy Markdown
Owner

Summary

  • Discord report: a Windows 11 user's Max-plan usage was being consumed in the background, with "continuous failed hooks" and a loop that only stopped after uninstalling claude-mem.
  • Investigation traced it to a regression chain introduced with the v12.3.3 RestartGuard. Full write-up in INVESTIGATION-windows-max-plan-drain.md; phased implementation plan in PLAN-windows-max-plan-drain-fix.md.
  • Three targeted fixes, each on one root-cause leg:
    1. SessionRoutes restart-trip (SessionRoutes.ts:318-340) — when the guard trips, the code path previously left pending messages in 'pending' state, and the next daemon startup's processPendingQueues(50) re-replayed them. Now it calls markAllSessionMessagesAbandoned + removeSessionImmediate + broadcastSessionCompleted inline (mirroring the private WorkerService.terminateSession, pattern copied from the sibling SDK-terminated branch at SessionRoutes.ts:123).
    2. RestartGuardrecordSuccess() no longer clears the window on a single success; it requires 5 consecutive successes before the decay path becomes eligible. Adds a terminal lifetime cap of 50 total restarts that never resets. Public API unchanged (recordRestart(): boolean, recordSuccess(): void, existing getters preserved; added totalRestarts, lifetimeCap).
    3. unrecoverablePatterns — extracted into src/services/worker/unrecoverable-patterns.ts (testable in isolation); added 'OAuth token expired', 'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401', 'OpenRouter API error: 403'. Deliberately not adding bare '401' — too broad.

Why the loop drained Max-plan specifically

The Claude Agent SDK subprocess inherits CLAUDE_CODE_OAUTH_TOKEN from the parent Claude Code session (EnvManager.ts:255). Every worker-driven Claude call is therefore billed against the user's subscription. Combined with the 91.6% MCP-loopback failure regime reported in memory (obs 71051), the old guard never tripped, processPendingQueues kept re-feeding the loop, and nothing in the unrecoverable list recognized OAuth expiry. The fix closes each gap.

Test plan

  • bun test tests/worker/agents/fallback-error-handler.test.ts — must stay green (adjacent logic, 25 tests passing locally after change).
  • Unit tests for RestartGuard (window cap, consecutive-success requirement, lifetime-cap terminality).
  • Unit tests for isUnrecoverableError against each new pattern; negative case for bare '401' in an unrelated string.
  • Manual reproduction on Windows: induce SDK failure, confirm (a) pending messages are abandoned after guard trip, (b) daemon restart does not replay them, (c) expired OAuth token aborts cleanly.
  • Confirm bun run build succeeds (pre-existing TS errors on unrelated lines are unaffected).

🤖 Generated with Claude Code

…guard trip, tighten RestartGuard, catch OAuth expiry

Discord report: Windows 11 user's Max-plan usage was being consumed in the
background. Root cause is a regression introduced alongside the v12.3.3
RestartGuard: (1) when the guard trips in the HTTP-layer crash-recovery
path, pending messages were left in 'pending' state and auto-replayed by
processPendingQueues on the next daemon start; (2) any single successful
message cleared the restart-window decay, so a 9-of-10 failure regime
never tripped the guard; (3) OAuth-token expiry errors are not in the
unrecoverable list and loop forever on Max-plan subscriptions.

SessionRoutes.ts: on restart-guard trip, abandon pending messages and
remove the session (mirrors WorkerService.terminateSession which is
private). Pattern copied from the sibling block at SessionRoutes.ts:123.

RestartGuard.ts: require 5 consecutive recordSuccess() calls before the
window-decay path becomes eligible; add a terminal lifetime cap of 50
total restarts that never resets. Public API unchanged.

worker-service.ts: extract unrecoverablePatterns into a testable helper
module and add OAuth/OpenRouter-auth patterns: 'OAuth token expired',
'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401',
'OpenRouter API error: 403'. Bare '401' deliberately not added (too broad).

Investigation and phased plan docs included for reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Resolved a Windows 11 regression introduced in v12.3.3 that caused indefinite crash loops and recovery cycles
    • Improved restart guard logic with lifetime caps to prevent excessive restart attempts
    • Enhanced error classification to properly identify unrecoverable errors
    • Improved session cleanup and pending message management during crash recovery scenarios

Walkthrough

Implements fixes for a Windows regression by extracting unrecoverable error patterns into a dedicated module, enhancing RestartGuard with lifetime caps and improved decay behavior, and ensuring pending messages are abandoned when restart guard trips during crash recovery.

Changes

Cohort / File(s) Summary
Documentation
INVESTIGATION-windows-max-plan-drain.md, PLAN-windows-max-plan-drain-fix.md
Added investigation report detailing Windows v12.3.3+ regression root causes (auth/token handling, pending session replay, restart decay) and comprehensive implementation plan with phased fixes.
Unrecoverable Error Patterns
src/services/worker/unrecoverable-patterns.ts, src/services/worker-service.ts
Extracted inline pattern matching into new reusable module isUnrecoverableError() with expanded OAuth/auth-expiry patterns; refactored worker-service to import and use the centralized matcher.
RestartGuard Enhancements
src/services/worker/RestartGuard.ts
Added lifetime restart cap (ABSOLUTE_LIFETIME_RESTART_CAP), consecutive-success decay eligibility tracking, and total-restarts observability getters; decay now requires both consecutive successes and elapsed time since last success.
Session Crash Recovery
src/services/worker/http/routes/SessionRoutes.ts
Introduced handleRestartGuardTripped() helper to abandon all pending messages via store, remove session, broadcast completion, and log terminal state; extended restart-guard metadata logging with totalRestarts and lifetimeCap.

Sequence Diagram

sequenceDiagram
    participant Session as Session<br/>(Generator)
    participant RestartGuard as RestartGuard
    participant CrashRecovery as Crash Recovery<br/>Handler
    participant PendingStore as Pending Message<br/>Store
    participant SessionMgmt as Session<br/>Management

    Note over Session,SessionMgmt: Crash occurs during processing

    Session->>RestartGuard: recordRestart()
    RestartGuard->>RestartGuard: Increment totalRestarts<br/>Check lifetime cap
    alt Lifetime Cap Exceeded
        RestartGuard-->>Session: false (block restart)
        Session->>CrashRecovery: Crash recovery triggered
    else Within Cap
        RestartGuard-->>Session: true (allow restart)
    end

    CrashRecovery->>PendingStore: Mark all pending<br/>messages as abandoned
    PendingStore-->>CrashRecovery: Confirmed
    CrashRecovery->>SessionMgmt: Remove session<br/>from memory
    SessionMgmt-->>CrashRecovery: Removed
    CrashRecovery->>SessionMgmt: Broadcast session<br/>completion
    Note over PendingStore: Prevents respawn/replay<br/>of stale messages
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Poem

A Windows hop through guarded gates,
Where pending threads won't seal their fates—
Decay and caps now keep things tight,
No infinite drain in the dead of night! 🐰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the primary fix: stopping a Windows Max-plan drain loop caused by a restart/replay regression.
Description check ✅ Passed The description is detailed and directly related to the changeset, explaining the Discord report, investigation findings, three targeted fixes, root causes, and a comprehensive test plan.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch investigate/windows-infinite-loop-usage-drain

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Code Review: fix: stop Windows Max-plan drain loop

Overview

This PR correctly identifies and fixes a real-money regression: a Windows user's Max-plan subscription was drained in the background by a crash loop that the RestartGuard couldn't stop. The root-cause analysis in INVESTIGATION-windows-max-plan-drain.md is thorough and the three-legged fix is well-scoped.


What's done well

  • SessionRoutes fix (Phase 1) is exactly right. The old code explicitly acknowledged leaving messages in 'pending' state in the log action string — this PR closes that gap cleanly by mirroring the WorkerService.terminateSession() three-step pattern inline.
  • handleRestartGuardTripped extraction makes the formerly-inlined guard-trip logic unit-testable and removes the pre-existing split-brain between the two crash-recovery paths.
  • Extracting unrecoverable-patterns.ts is a good isolation move — the matcher is now testable without importing the full WorkerService.
  • Consecutive-success requirement in RestartGuard directly addresses the 91.6% failure rate regime. A single fluky success resetting the window was the exact mechanism that let the loop run indefinitely.
  • Absolute lifetime cap is a prudent backstop. The investigation is right that the windowed guard alone is insufficient.

Issues

1. No tests shipped (blocking for me)

The Phase 4 test plan is detailed and the PR checklist is unchecked. Three test files were planned:

  • tests/worker/RestartGuard.test.ts
  • tests/worker/worker-service-unrecoverable.test.ts
  • tests/worker/http/SessionRoutes.restart-trip.test.ts

None appear in the diff. Given that the investigation notes "no existing tests for RestartGuard or unrecoverablePatterns", and this is a safety-critical guard against billing drain, shipping without coverage is risky. The RestartGuard logic is purely algorithmic and easy to unit-test with a mocked clock.

2. 'Unauthorized' is too broad

// src/services/worker/unrecoverable-patterns.ts
'Unauthorized',

The PR justifiably avoids bare '401' (would match request IDs), but 'Unauthorized' is a common English word that appears in many non-auth error messages:

  • "User is not authorized to access this resource"
  • "Unauthorized access attempt blocked"
  • Any HTTP middleware or proxy that formats its own 403 message

The investigation references '401 Unauthorized' in fallback-error-handler.test.ts:75-77 as the terminal pattern — consider using '401 Unauthorized' (the standard HTTP status line) instead of bare 'Unauthorized'. That's still specific enough to catch OAuth errors while avoiding false positives on authorization-related application errors.

3. Markdown investigation/plan files in repo root

INVESTIGATION-windows-max-plan-drain.md and PLAN-windows-max-plan-drain-fix.md are committed to the repo root. These are internal working documents. The repo already has a private/context/ tree (referenced in CLAUDE.md for the exit-codes doc). These would be better placed in private/context/ or docs/ so the root stays clean for standard project files, or removed entirely if the investigation doc is the authoritative record.

4. Minor: optional-chaining guard in handleRestartGuardTripped

private handleRestartGuardTripped(
  sessionDbId: number,
  pendingCount: number,
  session: { abortController: AbortController; restartGuard?: RestartGuard }
): void {
  const restartGuard = session.restartGuard; // may be undefined
  ...
  restartsInWindow: restartGuard?.restartsInWindow,  // logs undefined silently

The parameter type marks restartGuard as optional, so the logger will silently emit undefined fields when called without one. In practice the only call site (SessionRoutes.ts:362) is guarded by if (!restartAllowed) which implies the guard was used — but the helper's signature doesn't enforce that. Either make restartGuard required in the parameter type (and add a null-check at the call site) or add an explicit log note when it's missing. Low-impact but worth tightening given this is diagnostic data for an active incident.


Summary

The fix logic itself is correct — all three legs of the drain loop are addressed and the approach matches the existing patterns in the codebase. The main blocker is the absence of the planned tests, particularly for RestartGuard (algorithmic, trivial to test with a fake clock) and the new isUnrecoverableError matcher. Would suggest holding merge until those land, given the blast radius of the bug being fixed.

🤖 Generated with Claude Code

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR closes a Windows Max-plan drain loop by addressing three root causes: (1) the restart guard's trip branch in SessionRoutes now marks pending messages as abandoned and removes the session, preventing processPendingQueues from replaying them on next daemon startup; (2) RestartGuard now requires 5 consecutive successes before decay is eligible and adds a hard 50-restart lifetime cap; (3) OAuth/OpenRouter auth-failure patterns are added to the unrecoverable-error list and extracted into a testable module.

  • The bare 'Unauthorized' substring in unrecoverable-patterns.ts is too broad — it could match error text from MCP tools or external HTTP APIs invoked during a session, permanently terminating a healthy session that could have recovered. The already-added 'OAuth token expired' and 'token has been revoked' patterns should cover the Max-plan scenario without requiring this catch-all.

Confidence Score: 4/5

Safe to merge after addressing the 'Unauthorized' pattern scope — all other changes are well-targeted and correct.

There is one P1 finding: the bare 'Unauthorized' substring in the unrecoverable-patterns list is too broad and could prematurely kill sessions due to transient 401 errors from MCP-called APIs. All other changes (SessionRoutes guard-trip cleanup, RestartGuard tightening, module extraction) are logically sound and directly address the described drain loop.

src/services/worker/unrecoverable-patterns.ts — the 'Unauthorized' pattern needs scoping before merge.

Important Files Changed

Filename Overview
src/services/worker/unrecoverable-patterns.ts New file extracting error patterns from worker-service.ts; adds OAuth/OpenRouter patterns. The bare 'Unauthorized' substring is too broad and risks killing sessions on transient 3rd-party 401 errors.
src/services/worker/RestartGuard.ts Adds consecutive-success requirement (5) for decay eligibility and a 50-restart absolute lifetime cap. Logic is sound; minor off-by-one in documentation vs. implementation (50 allowed, 51st blocked).
src/services/worker/http/routes/SessionRoutes.ts New handleRestartGuardTripped() correctly aborts the controller, marks messages abandoned, removes the session, and broadcasts completion — closing the pending-message replay loop on guard trip.
src/services/worker-service.ts Refactored to import isUnrecoverableError from the new unrecoverable-patterns module; no behavioral change in this file's logic.
INVESTIGATION-windows-max-plan-drain.md Documentation file; records root-cause analysis of the Windows Max-plan drain loop. No code changes.
PLAN-windows-max-plan-drain-fix.md Documentation file; phased implementation plan for the drain-loop fix. No code changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Generator exits with pending work] --> B{isUnrecoverableError?}
    B -- Yes --> C[terminateSession\nmark abandoned + remove]
    B -- No --> D[restartGuard.recordRestart]
    D --> E{Windowed cap\nor lifetime cap exceeded?}
    E -- No --> F[Restart generator\nwith backoff]
    F --> A
    E -- Yes --> G[handleRestartGuardTripped\nSessionRoutes path]
    G --> H[abortController.abort]
    H --> I[markAllSessionMessagesAbandoned]
    I --> J[removeSessionImmediate]
    J --> K[broadcastSessionCompleted]
    K --> L[STOP — messages not replayed\non next daemon startup]
Loading

Fix All in Claude Code

Reviews (1): Last reviewed commit: "fix: stop Windows Max-plan drain loop — ..." | Re-trigger Greptile

// is no longer valid.
'OAuth token expired',
'token has been revoked',
'Unauthorized',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 'Unauthorized' pattern too broad — may prematurely kill healthy sessions

'Unauthorized' is a plain-English word that can appear in error messages from MCP tools, external HTTP APIs called through MCP, database permission errors, or any third-party library. Because isUnrecoverableError is a substring match, a transient "Unauthorized" from an MCP-called REST endpoint would permanently terminate the session and abandon all pending messages — the same effect as an expired OAuth token.

The PR description explicitly rejected bare '401' as "too broad," but 'Unauthorized' carries the same risk. The OAuth-specific patterns already added ('OAuth token expired', 'token has been revoked') should be sufficient to catch Max-plan token expiry. If Claude SDK specifically emits an "Unauthorized" string for expired tokens, a more scoped prefix (e.g., 'claude: Unauthorized' or 'SDK error: Unauthorized') would be safer.

Fix in Claude Code

Comment on lines +36 to +38
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Lifetime-cap check is off-by-one relative to the stated limit

totalRestartsAllTime is incremented before the > ABSOLUTE_LIFETIME_RESTART_CAP check, so restart #50 sets the counter to 50, 50 > 50 is false, and the restart is allowed. The cap therefore permits exactly 50 restarts (not "trips at 50"). This is consistent with the PR description ("50 total restarts"), but a brief comment clarifying that > CAP means "CAP restarts are allowed, CAP+1 is the first blocked one" would prevent future confusion.

Fix in Claude Code

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/services/worker-service.ts (1)

816-827: ⚠️ Potential issue | 🟠 Major

Emit the session-complete event from terminateSession() too.

This branch now logs the richer guard state, but its terminal path still funnels through this.terminateSession(...), which only abandons messages and removes the session. runFallbackForTerminatedSession() and SessionRoutes.handleRestartGuardTripped() already call broadcastSessionCompleted(), so worker-service guard trips and unrecoverable-error shutdowns still miss that terminal event.

🔧 Proposed fix
  private terminateSession(sessionDbId: number, reason: string): void {
    const pendingStore = this.sessionManager.getPendingMessageStore();
    const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);

    logger.info('SYSTEM', 'Session terminated', {
      sessionId: sessionDbId,
      reason,
      abandonedMessages: abandoned
    });

    // removeSessionImmediate fires onSessionDeletedCallback → broadcastProcessingStatus()
    this.sessionManager.removeSessionImmediate(sessionDbId);
+   this.sessionEventBroadcaster.broadcastSessionCompleted(sessionDbId);
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/worker-service.ts` around lines 816 - 827, The restart-guard
branch ends by calling this.terminateSession(session.sessionDbId,
'max_restarts_exceeded') but terminateSession does not emit the session-complete
event, so consumers miss the terminal event; update terminateSession(sessionId,
reason) to also call broadcastSessionCompleted(sessionId, { reason }) (or the
equivalent session-complete emitter used by
runFallbackForTerminatedSession()/SessionRoutes.handleRestartGuardTripped) after
it finishes abandoning messages/removing the session, ensuring all terminal
paths (restart guard trips and unrecoverable shutdowns) emit the same
session-complete event.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@PLAN-windows-max-plan-drain-fix.md`:
- Around line 385-390: Update Phase 5 verification to look in the new module
instead of the old file: replace the grep that targets
src/services/worker-service.ts with a search against
src/services/worker/unrecoverable-patterns.ts (e.g., grep -n "'OAuth token
expired'" src/services/worker/unrecoverable-patterns.ts) so the
OAuth/unrecoverable matcher check points to the extracted symbol in
unrecoverable-patterns.ts.
- Around line 241-245: The documentation and test plan disagree about the
lifetime-cap boundary for restarts (one place says the 50th restart should be
blocked, another says 50 are allowed and the 51st blocks); decide on the single
canonical behavior (either "max 50 allowed, 51st blocked" or "50th blocked") and
update both the test expectation in RestartGuard.test.ts (referenced as the
Phase 4 test) and the descriptive lines in this doc (the two bullets describing
49/50/51 behavior and the later lines 326-327) so they match the chosen rule and
the term "lifetime-cap" is used consistently throughout.

In `@src/services/worker/RestartGuard.ts`:
- Around line 31-50: In recordRestart(), after you set
this.consecutiveSuccessCount = 0 (which breaks the success streak), also clear
the decay flag by setting this.decayEligible = false so a broken streak cannot
later trigger the decay path; update the recordRestart() function (referencing
recordRestart(), consecutiveSuccessCount, decayEligible, restartTimestamps,
lastSuccessfulProcessing) to flip decayEligible off immediately when a restart
breaks the streak.

---

Outside diff comments:
In `@src/services/worker-service.ts`:
- Around line 816-827: The restart-guard branch ends by calling
this.terminateSession(session.sessionDbId, 'max_restarts_exceeded') but
terminateSession does not emit the session-complete event, so consumers miss the
terminal event; update terminateSession(sessionId, reason) to also call
broadcastSessionCompleted(sessionId, { reason }) (or the equivalent
session-complete emitter used by
runFallbackForTerminatedSession()/SessionRoutes.handleRestartGuardTripped) after
it finishes abandoning messages/removing the session, ensuring all terminal
paths (restart guard trips and unrecoverable shutdowns) emit the same
session-complete event.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 135d28be-068a-4f92-82a3-e7a6b7bff82b

📥 Commits

Reviewing files that changed from the base of the PR and between 49ab404 and a172c6e.

📒 Files selected for processing (6)
  • INVESTIGATION-windows-max-plan-drain.md
  • PLAN-windows-max-plan-drain-fix.md
  • src/services/worker-service.ts
  • src/services/worker/RestartGuard.ts
  • src/services/worker/http/routes/SessionRoutes.ts
  • src/services/worker/unrecoverable-patterns.ts

Comment on lines +241 to +245
- `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4).
- Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st →
blocked (lifetime cap persists).
- Behavior check: 4 successes then restart → full window still counted.
5th success then restart 5 min later → window cleared.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reconcile the lifetime-cap expectation.

This section says the 50th restart should already be blocked, but Lines 326-327 later say 50 restarts are still allowed and only the 51st blocks. Please pick one so the tests and the implementation don't drift apart.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PLAN-windows-max-plan-drain-fix.md` around lines 241 - 245, The documentation
and test plan disagree about the lifetime-cap boundary for restarts (one place
says the 50th restart should be blocked, another says 50 are allowed and the
51st blocks); decide on the single canonical behavior (either "max 50 allowed,
51st blocked" or "50th blocked") and update both the test expectation in
RestartGuard.test.ts (referenced as the Phase 4 test) and the descriptive lines
in this doc (the two bullets describing 49/50/51 behavior and the later lines
326-327) so they match the chosen rule and the term "lifetime-cap" is used
consistently throughout.

Comment on lines +385 to +390
1. `bun run build` — builds cleanly (no TS errors).
2. `bun test` — full suite green.
3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase
is gone from the codebase).
4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches.
5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Phase 5 still points verification at the old file.

The OAuth/unrecoverable matcher was extracted to src/services/worker/unrecoverable-patterns.ts, so this grep will now fail and send follow-up edits to the wrong place.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@PLAN-windows-max-plan-drain-fix.md` around lines 385 - 390, Update Phase 5
verification to look in the new module instead of the old file: replace the grep
that targets src/services/worker-service.ts with a search against
src/services/worker/unrecoverable-patterns.ts (e.g., grep -n "'OAuth token
expired'" src/services/worker/unrecoverable-patterns.ts) so the
OAuth/unrecoverable matcher check points to the extracted symbol in
unrecoverable-patterns.ts.

Comment on lines 31 to +50
recordRestart(): boolean {
this.totalRestartsAllTime += 1;
this.consecutiveSuccessCount = 0; // streak broken by any restart

// Terminal: lifetime cap reached — never resets, even if successes follow.
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
return false;
}

const now = Date.now();

// Decay: clear history only after real success + 5min of uninterrupted success
if (this.lastSuccessfulProcessing !== null
// Decay: only fires if we accumulated the required consecutive successes
// AND 5min has elapsed since the last success. One-off successes cannot
// clear the windowed-restart history.
if (this.decayEligible
&& this.lastSuccessfulProcessing !== null
&& now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
this.restartTimestamps = [];
this.lastSuccessfulProcessing = null;
this.decayEligible = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clear decayEligible when a restart breaks the success streak.

recordRestart() zeroes consecutiveSuccessCount, but it leaves decayEligible armed. After 5 successes, one restart, and then a 5-minute gap, the next restart will still wipe restartTimestamps even though the streak was already broken. That reopens the slow-drip path this change is trying to close.

🔧 Proposed fix
  recordRestart(): boolean {
    this.totalRestartsAllTime += 1;
    this.consecutiveSuccessCount = 0; // streak broken by any restart
+   this.decayEligible = false;
 
    // Terminal: lifetime cap reached — never resets, even if successes follow.
    if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
      return false;
    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/worker/RestartGuard.ts` around lines 31 - 50, In
recordRestart(), after you set this.consecutiveSuccessCount = 0 (which breaks
the success streak), also clear the decay flag by setting this.decayEligible =
false so a broken streak cannot later trigger the decay path; update the
recordRestart() function (referencing recordRestart(), consecutiveSuccessCount,
decayEligible, restartTimestamps, lastSuccessfulProcessing) to flip
decayEligible off immediately when a restart breaks the streak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant