Skip to content

fix(client): self-healing for permanently stuck expired shape handles#4087

Open
KyleAMathews wants to merge 5 commits intomainfrom
pr-4085
Open

fix(client): self-healing for permanently stuck expired shape handles#4087
KyleAMathews wants to merge 5 commits intomainfrom
pr-4085

Conversation

@KyleAMathews
Copy link
Copy Markdown
Contributor

Summary

Expired shape handle entries in localStorage can get permanently stuck, preventing data from ever loading for affected shapes. This adds a self-healing retry mechanism that clears the poisoned entry and retries once, allowing automatic recovery even when a proxy strips cache-buster query parameters.

Based on #4085 by @evan-liveflow — refined with additional hardening from code review.

Root Cause

When a shape gets a 409 (handle rotation), the client stores the old handle in localStorage['electric_expired_shapes']. On future requests, if a response contains that handle, the client treats it as a stale cached response and retries up to 3 times with cache-buster params.

The problem: if a proxy (e.g., phoenix_sync) strips query parameters, the cache busters are ineffective. All 3 retries fail, FetchError(502) is thrown to onError, and if onError doesn't retry, the stream dies. The expired entry persists in localStorage, so the next session hits the same wall — permanently.

Since the server never reuses handles (now documented as SPEC.md S0), the expired entry becomes a false positive once the caching layer clears — but the client has no way to discover this.

Approach

After stale cache retries exhaust (3 attempts), the client now:

  1. Always clears the expired entry from localStorage — if cache busters didn't work, keeping the entry only poisons future sessions
  2. Attempts one self-healing retry — resets the stream and retries without the expired_handle param. Since handles are never reused, the fresh response will have a new handle and won't trigger stale detection
  3. Guards against infinite loops via #expiredShapeRecoveryKey (once per shape key, reset on up-to-date)
if (transition.exceededMaxRetries) {
  if (shapeKey) {
    expiredShapesCache.delete(shapeKey)       // always clear
    if (this.#expiredShapeRecoveryKey !== shapeKey) {
      this.#expiredShapeRecoveryKey = shapeKey // remember we tried
      this.#reset()                            // fresh start
      throw new StaleCacheError(...)           // caught internally → retry
    }
  }
  throw new FetchError(502, ...)               // truly give up
}

Key Invariants

  • S0: Server handles are unique and never reused (phash2 + microsecond timestamp, SQLite UNIQUE INDEX, ETS insert_new)
  • Self-healing fires at most once per shape per retry cycle (#expiredShapeRecoveryKey guard)
  • Guard resets on up-to-date, so long-lived streams can self-heal again if CDN misbehaves later
  • Expired entry is cleared on every exhaustion, regardless of whether self-healing fires

Non-goals

  • TTL on expired cache entries — the self-healing mechanism handles the failure mode without added complexity
  • Changing onError contract — the fix works regardless of what the user's onError callback does

Verification

cd packages/typescript-client
pnpm vitest run --config vitest.unit.config.ts
# 312 tests pass
pnpm exec tsc --noEmit
# Clean

Files changed

File Change
src/client.ts Self-healing logic in #onInitialResponse, recovery key cleared on up-to-date, updated catch block comment
test/expired-shapes-cache.test.ts Updated 2 existing tests for self-healing flow, added test for CDN-always-stale scenario
SPEC.md Added S0 (handle uniqueness guarantee), updated L3 loop-back entry and guard table
.changeset/fix-expired-shapes-self-healing.md Changeset for patch release

Based on #4085

evanob and others added 2 commits April 2, 2026 20:22
When stale cache retries exhaust (3 attempts), clear the expired entry
from localStorage and retry once without the expired_handle param.
Since handles are never reused (SPEC.md S0), the fresh response gets a
new handle and bypasses stale detection. This prevents shapes from being
permanently unloadable when a proxy strips cache-buster query params.

Also documents the server handle uniqueness guarantee (S0) in the spec,
updates the loop-back table for the new self-healing path, and resets
the recovery guard on up-to-date so self-healing remains available for
long-lived streams.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 3, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 8502389
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69cfd8a0b8ddc90008691651
😎 Deploy Preview https://deploy-preview-4087--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Apr 3, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@4087
npm i https://pkg.pr.new/@electric-sql/client@4087
npm i https://pkg.pr.new/@electric-sql/y-electric@4087

commit: 1f7a5f4

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.75%. Comparing base (2659598) to head (1f7a5f4).
⚠️ Report is 32 commits behind head on main.
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (2659598) and HEAD (1f7a5f4). Click for more details.

HEAD has 6 uploads less than BASE
Flag BASE (2659598) HEAD (1f7a5f4)
unit-tests 6 4
typescript 5 4
packages/typescript-client 1 0
elixir 1 0
electric-telemetry 1 0
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4087      +/-   ##
==========================================
- Coverage   84.90%   75.75%   -9.16%     
==========================================
  Files          39       11      -28     
  Lines        2869      693    -2176     
  Branches      609      174     -435     
==========================================
- Hits         2436      525    -1911     
+ Misses        431      167     -264     
+ Partials        2        1       -1     
Flag Coverage Δ
electric-telemetry ?
elixir ?
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client ?
packages/y-electric 56.05% <ø> (ø)
typescript 75.75% <ø> (-12.93%) ⬇️
unit-tests 75.75% <ø> (-9.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

KyleAMathews and others added 3 commits April 3, 2026 09:16
…test

The test waited for fast-loop detection to error, but the exponential
backoff (100ms-5s across 5 detections) takes longer than the timeout
in CI. Simplified to verify self-healing fires and the entry is cleared
— the fast-loop error path is already tested in stream.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The #expiredShapeRecoveryKey guard was only cleared in #onMessages when
an up-to-date batch arrived. The 204 backward-compatibility path
transitions directly to LiveState without going through #onMessages
(empty body → batch.length === 0 → early return), leaving the guard
stuck. This prevented a second self-healing cycle on the same stream
instance.

Clear the guard in #onInitialResponse when the response transitions
directly to live (action=accepted, state=live), covering the 204 path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…test

The caughtError===null assertion was environment-sensitive: the fast-loop
detector's 500ms window can catch more requests on slower machines,
firing a 502 that's orthogonal to the recovery guard bug being tested.

The precise signal is selfHealCount===2: if the guard is stuck, the
code throws 502 *before* incrementing, so selfHealCount stays at 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants