electric-sql · msfstef · Apr 7, 2026 · Apr 6, 2026 · Apr 7, 2026 · Apr 7, 2026
diff --git a/website/.vitepress/config.mts b/website/.vitepress/config.mts
@@ -222,6 +222,7 @@ export default defineConfig({
               link: '/docs/guides/postgres-permissions',
             },
             { text: 'Deployment', link: '/docs/guides/deployment' },
+            { text: 'Upgrading', link: '/docs/guides/upgrading' },
             { text: 'Sharding', link: '/docs/guides/sharding' },
             { text: 'Security', link: '/docs/guides/security' },
             { text: 'Troubleshooting', link: '/docs/guides/troubleshooting' },

diff --git a/website/docs/api/config.md b/website/docs/api/config.md
@@ -128,6 +128,35 @@ Suffix for the logical replication publication and slot name.
 
 </EnvVarConfig>
 
+### CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN
+
+<EnvVarConfig
+    name="CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN"
+    defaultValue="false"
+    example="true">
+
+When set to `true`, Electric creates a [temporary replication slot](https://www.postgresql.org/docs/current/protocol-replication.html) that is automatically dropped when the database connection closes. This is useful for ephemeral deployments where each container has its own storage and replication slots don't need to persist across restarts.
+
+> [!Warning] Unclean shutdowns cause shape rotations
+> If Electric crashes or loses its database connection (e.g., during a network partition), the temporary slot is lost. The next instance starts with a fresh slot and clients connected to old shapes will receive `409` (must-refetch) responses, requiring a full resync.
+
+See the [Upgrading guide](/docs/guides/upgrading#option-a-temporary-replication-slots) for more context on using temporary slots.
+
+</EnvVarConfig>
+
+### ELECTRIC_TEMPORARY_REPLICATION_SLOT_USE_RANDOM_NAME
+
+<EnvVarConfig
+    name="ELECTRIC_TEMPORARY_REPLICATION_SLOT_USE_RANDOM_NAME"
+    defaultValue="false"
+    example="true">
+
+When used with [`CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN=true`](#cleanup-replication-slots-on-shutdown), generates a random replication slot name instead of the deterministic name based on [`ELECTRIC_REPLICATION_STREAM_ID`](#electric-replication-stream-id). This avoids slot name conflicts when multiple instances run concurrently during rolling deploys.
+
+Has no effect unless `CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN` is also set to `true`.
+
+</EnvVarConfig>
+
 ### ELECTRIC_REPLICATION_IDLE_TIMEOUT
 
 <EnvVarConfig

diff --git a/website/docs/guides/deployment.md b/website/docs/guides/deployment.md
@@ -217,6 +217,9 @@ readinessProbe:
 
 This ensures the pod is only marked ready when Electric is fully operational and ready to serve shape requests.
 
+> [!Tip] Rolling upgrades need different readiness probes
+> If you are performing rolling deployments with `maxSurge: 1`, the `exec` probe above will cause a deadlock &mdash; the new pod can never return `200` while the old pod holds the replication lock. Use an `httpGet` readiness probe instead, which accepts any 2xx. See the [Upgrading guide](/docs/guides/upgrading) for details.
+
 ### Observability
 
 Electric supports [OpenTelemetry](https://opentelemetry.io/) for exporting traces, with built-in support for [Honeycomb.io](https://www.honeycomb.io/). Metrics are also available in StatsD and Prometheus formats.
@@ -229,6 +232,10 @@ Electric is designed to run behind a caching proxy, such as [Nginx](https://ngin
 
 See the [Caching section](/docs/api/http#caching) of the HTTP API docs for more information.
 
+### Upgrading
+
+If you're running Electric behind an orchestrator that performs rolling updates (e.g., Kubernetes, AWS ECS), see the [Upgrading guide](/docs/guides/upgrading) for strategies to minimize disruption when deploying new versions.
+
 ## 3. Connecting your app
 
 You can then connect your app to Electric [over HTTP](/docs/api/http). Typically you use a [Client library](/docs/api/clients/typescript) and configure the URL in the constructor, e.g.:

diff --git a/website/docs/guides/sharding.md b/website/docs/guides/sharding.md
@@ -512,5 +512,6 @@ Switching shards is transparent at the API surface (same URL structure), but cli
 ## Next steps
 
 - Review the [deployment guide](/docs/guides/deployment) for production configuration
+- See the [upgrading guide](/docs/guides/upgrading) for rolling deployment strategies
 - See [auth patterns](/docs/guides/auth) for securing your sharded deployment
 - Check [benchmarks](/docs/reference/benchmarks) for performance expectations per shard
diff --git a/website/docs/guides/troubleshooting.md b/website/docs/guides/troubleshooting.md
@@ -379,6 +379,28 @@ GRANT SELECT ON schema.tablename TO electric_user;
 ALTER TABLE schema.tablename OWNER TO electric_user;
 ```
 
+### SQLite corruption &mdash; why is my shape metadata database corrupt on NFS/EFS?
+
+Electric uses SQLite for shape metadata. SQLite relies on file-level locking that can behave incorrectly on network filesystems like NFS or AWS EFS, potentially leading to database corruption when multiple processes access the same file.
+
+##### Solution &mdash; configure exclusive mode or separate storage paths
+
+**Option 1:** Set [`ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE=true`](/docs/api/config#electric-shape-db-exclusive-mode) to force SQLite to use a single read-write connection instead of multiple reader connections. This avoids locking issues at the cost of reduced read throughput.
+
+**Option 2:** Set [`ELECTRIC_SHAPE_DB_STORAGE_DIR`](/docs/api/config#electric-shape-db-storage-dir) to a local (non-shared) path. This keeps the SQLite database on local storage while shape logs remain on the shared network filesystem. The SQLite database will be rebuilt from the shape logs on startup.
+
+### Replication slot recreation &mdash; why are all clients resyncing after a crash?
+
+When Electric's replication slot is dropped or lost &mdash; whether due to a crash, use of [temporary replication slots](/docs/guides/upgrading#option-a-temporary-replication-slots), or Postgres invalidating it because [`max_slot_wal_keep_size`](#recommended-postgresql-settings) was exceeded &mdash; the new slot starts from the current WAL position with no history.
+
+This means all existing shapes are invalidated. Clients will receive `409` (must-refetch) responses and must perform a full resync of their shapes. This is normal recovery behavior but results in a temporary spike in load as all clients resync simultaneously.
+
+##### Solution &mdash; handle 409 responses and monitor slot health
+
+- Ensure your clients handle `409` responses gracefully (the official [TypeScript client](/docs/api/clients/typescript) does this automatically)
+- Monitor your replication slot health with the [diagnostic checklist](#quick-diagnostic-checklist) above
+- Set `max_slot_wal_keep_size` conservatively to avoid unexpected slot invalidation
+
 ### Vercel CDN caching &mdash; why are my shapes not updating on Vercel?
 
 Vercel's CDN can cache responses when you proxy requests to an external Electric service using [rewrites](https://vercel.com/docs/edge-network/caching). Vercel's [cache keys are not configurable](https://vercel.com/docs/cdn-cache/purge#cache-keys) and may not differentiate between requests with different query parameters. Since Electric uses query parameters like `offset` and `handle` to track shape log position, this can result in stale or incorrect cached responses being served instead of reaching your Electric backend.