diff --git a/website/.vitepress/config.mts b/website/.vitepress/config.mts index abd8c87b84..25c02dd041 100644 --- a/website/.vitepress/config.mts +++ b/website/.vitepress/config.mts @@ -222,6 +222,7 @@ export default defineConfig({ link: '/docs/guides/postgres-permissions', }, { text: 'Deployment', link: '/docs/guides/deployment' }, + { text: 'Upgrading', link: '/docs/guides/upgrading' }, { text: 'Sharding', link: '/docs/guides/sharding' }, { text: 'Security', link: '/docs/guides/security' }, { text: 'Troubleshooting', link: '/docs/guides/troubleshooting' }, diff --git a/website/docs/api/config.md b/website/docs/api/config.md index 74fc527d5b..3226912189 100644 --- a/website/docs/api/config.md +++ b/website/docs/api/config.md @@ -128,6 +128,35 @@ Suffix for the logical replication publication and slot name. +### CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN + + + +When set to `true`, Electric creates a [temporary replication slot](https://www.postgresql.org/docs/current/protocol-replication.html) that is automatically dropped when the database connection closes. This is useful for ephemeral deployments where each container has its own storage and replication slots don't need to persist across restarts. + +> [!Warning] Unclean shutdowns cause shape rotations +> If Electric crashes or loses its database connection (e.g., during a network partition), the temporary slot is lost. The next instance starts with a fresh slot and clients connected to old shapes will receive `409` (must-refetch) responses, requiring a full resync. + +See the [Upgrading guide](/docs/guides/upgrading#temporary-replication-slots) for more context on using temporary slots. + + + +### ELECTRIC_TEMPORARY_REPLICATION_SLOT_USE_RANDOM_NAME + + + +When used with [`CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN=true`](#cleanup-replication-slots-on-shutdown), generates a random replication slot name instead of the deterministic name based on [`ELECTRIC_REPLICATION_STREAM_ID`](#electric-replication-stream-id). This avoids slot name conflicts when multiple instances run concurrently during rolling deploys. + +Has no effect unless `CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN` is also set to `true`. + + + ### ELECTRIC_REPLICATION_IDLE_TIMEOUT [!Tip] Rolling upgrades need different readiness probes +> If you are performing rolling deployments with `maxSurge: 1`, the `exec` probe above will cause a deadlock — the new pod can never return `200` while the old pod holds the replication lock. Use an `httpGet` readiness probe instead, which accepts any 2xx. See the [Upgrading guide](/docs/guides/upgrading) for details. + ### Observability Electric supports [OpenTelemetry](https://opentelemetry.io/) for exporting traces, with built-in support for [Honeycomb.io](https://www.honeycomb.io/). Metrics are also available in StatsD and Prometheus formats. @@ -229,6 +232,10 @@ Electric is designed to run behind a caching proxy, such as [Nginx](https://ngin See the [Caching section](/docs/api/http#caching) of the HTTP API docs for more information. +### Upgrading + +If you're running Electric behind an orchestrator that performs rolling updates (e.g., Kubernetes, AWS ECS), see the [Upgrading guide](/docs/guides/upgrading) for strategies to minimize disruption when deploying new versions. + ## 3. Connecting your app You can then connect your app to Electric [over HTTP](/docs/api/http). Typically you use a [Client library](/docs/api/clients/typescript) and configure the URL in the constructor, e.g.: diff --git a/website/docs/guides/sharding.md b/website/docs/guides/sharding.md index d415a94736..410cafa018 100644 --- a/website/docs/guides/sharding.md +++ b/website/docs/guides/sharding.md @@ -512,5 +512,6 @@ Switching shards is transparent at the API surface (same URL structure), but cli ## Next steps - Review the [deployment guide](/docs/guides/deployment) for production configuration +- See the [upgrading guide](/docs/guides/upgrading) for rolling deployment strategies - See [auth patterns](/docs/guides/auth) for securing your sharded deployment - Check [benchmarks](/docs/reference/benchmarks) for performance expectations per shard diff --git a/website/docs/guides/troubleshooting.md b/website/docs/guides/troubleshooting.md index c63e3c14b7..0c0d0bf354 100644 --- a/website/docs/guides/troubleshooting.md +++ b/website/docs/guides/troubleshooting.md @@ -379,6 +379,36 @@ GRANT SELECT ON schema.tablename TO electric_user; ALTER TABLE schema.tablename OWNER TO electric_user; ``` +### SQLite corruption — why is my shape metadata database corrupt on NFS/EFS? + +Electric uses SQLite for shape metadata. SQLite relies on file-level locking that can behave incorrectly on network filesystems like NFS or AWS EFS, potentially leading to database corruption when multiple processes access the same file. + +##### Solution — configure exclusive mode or separate storage paths + +**Option 1:** Set [`ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE=true`](/docs/api/config#electric-shape-db-exclusive-mode) to force SQLite to use a single read-write connection instead of multiple reader connections. This avoids locking issues at the cost of reduced read throughput. + +**Option 2:** Set [`ELECTRIC_SHAPE_DB_STORAGE_DIR`](/docs/api/config#electric-shape-db-storage-dir) to a local (non-shared) path. This keeps the SQLite database on local storage while shape logs remain on the shared network filesystem. The SQLite database will be rebuilt from the shape logs on startup. + +### Replication slot recreation — why are all clients resyncing after a crash? + +When Electric's replication slot is dropped or lost — whether due to a crash, use of [temporary replication slots](/docs/guides/upgrading#temporary-replication-slots), or Postgres invalidating it because [`max_slot_wal_keep_size`](#recommended-postgresql-settings) was exceeded — the new slot starts from the current WAL position with no history. + +This means all existing shapes are invalidated. Clients will receive `409` (must-refetch) responses and must perform a full resync of their shapes. This is normal recovery behavior but results in a temporary spike in load as all clients resync simultaneously. + +##### Solution — handle 409 responses and monitor slot health + +- Ensure your clients handle `409` responses gracefully (the official [TypeScript client](/docs/api/clients/typescript) does this automatically) +- Monitor your replication slot health with the [diagnostic checklist](#quick-diagnostic-checklist) above +- Set `max_slot_wal_keep_size` conservatively to avoid unexpected slot invalidation + +### Rolling upgrades — why is my second instance stuck in 'waiting' state? + +This is expected behavior during a [rolling upgrade](/docs/guides/upgrading). The second instance has loaded shape metadata and is serving existing shapes in read-only mode while waiting for the first instance to release the advisory lock. Check `/v1/health` to confirm — a `202` response with `{"status": "waiting"}` indicates the instance is healthy and serving reads. + +During the advisory lock handover, there is a brief window (typically under a minute) where no instance returns `200`. During this window, existing shapes continue to be served in read-only mode. New shape creation returns `503` with a `Retry-After` header. This is expected and handled gracefully by clients. + +If you're seeing `409` errors during deploys with shared storage, check that both instances are pointing to the same `ELECTRIC_STORAGE_DIR` on a shared filesystem. With [separate storage](/docs/guides/upgrading#separate-storage-ephemeral), `409`s during deploys are expected since each instance has its own shape handles. + ### Vercel CDN caching — why are my shapes not updating on Vercel? Vercel's CDN can cache responses when you proxy requests to an external Electric service using [rewrites](https://vercel.com/docs/edge-network/caching). Vercel's [cache keys are not configurable](https://vercel.com/docs/cdn-cache/purge#cache-keys) and may not differentiate between requests with different query parameters. Since Electric uses query parameters like `offset` and `handle` to track shape log position, this can result in stale or incorrect cached responses being served instead of reaching your Electric backend. diff --git a/website/docs/guides/upgrading.md b/website/docs/guides/upgrading.md new file mode 100644 index 0000000000..c474e3a022 --- /dev/null +++ b/website/docs/guides/upgrading.md @@ -0,0 +1,375 @@ +--- +title: Upgrading - Guide +description: >- + How to upgrade the Electric sync service with minimal disruption. +outline: [2, 3] +--- + +# Upgrading + +How to upgrade the [Electric sync engine](/primitives/postgres-sync) with minimal disruption using rolling deployments. This guide covers two deployment scenarios: [shared storage](#shared-storage-recommended) (recommended) and [separate storage](#separate-storage-ephemeral) for ephemeral environments. + +Before reading this guide, make sure you're familiar with the [Deployment guide](/docs/guides/deployment) for general setup. + +## Overview + +Electric is designed to run as a **single active instance** per replication stream. It uses a PostgreSQL advisory lock — a cooperative lock used for application-level coordination that does not lock any tables or rows — to ensure only one instance actively replicates from Postgres at a time. + +When you deploy a new version: + +1. The **new instance** starts and loads shape metadata from storage +2. While the old instance holds the lock, the new instance enters **read-only mode** — it can serve requests for existing shapes but cannot create new ones +3. Once the old instance shuts down, its database connection drops and the lock is released +4. The **new instance** acquires the lock and becomes fully active + +``` +Time ────────────────────────────────────────────► + +Old [==== active (200) ====]--shutdown--X + lock released─┐ +New [starting][waiting (202)]───────────┴─[== active ==] + │ │ │ + loading serves existing fully operational + metadata shapes (read-only) +``` + +The read-only window is typically brief — a few seconds to under a minute, depending on how quickly your orchestrator terminates the old instance. During this window, existing shapes continue to be served. Requests for new shapes return `503` with a `Retry-After` header until the new instance becomes active. The official [TypeScript client](/docs/api/clients/typescript) handles both of these automatically. + +> [!Tip] Version compatibility +> Shape handle stability across deploys depends on Electric's internal shape identity computation not changing between versions. If a new version changes how shapes are identified or changes the storage schema, even shared-storage upgrades may trigger `409` (must-refetch) responses. Check the release notes for any such breaking changes before upgrading. + +### Choosing a strategy + +| | Shared storage | Separate storage | +|---|---|---| +| Client disruption | Minimal (new shapes briefly delayed) | 409s (clients must refetch shapes) | +| Sticky sessions required | No | Yes | +| Postgres overhead | Single slot | One slot per instance | +| Best for | [Most deployments](#shared-storage-recommended) | [Ephemeral environments](#separate-storage-ephemeral) | + +## How the advisory lock works + +The advisory lock is tied to the replication slot name: + +```sql +SELECT pg_advisory_lock(hashtext('electric_slot_{stream_id}')) +``` + +This lock is scoped to Electric's replication slot name and does not conflict with any other advisory locks or table-level locks in your database. + +- Only one instance can hold the lock per [`ELECTRIC_REPLICATION_STREAM_ID`](/docs/api/config#electric-replication-stream-id) +- The lock is held on the replication database connection — if the connection drops (e.g., instance shutdown), the lock is automatically released + +> [!Tip] Lock breaker +> Electric includes a lock breaker mechanism that checks every 10 seconds whether the replication slot associated with the lock is inactive in Postgres. If the slot is inactive but a backend still holds the advisory lock, Electric terminates that backend. This only affects connections where the replication stream has already stopped, so it will not interfere with a healthy instance during a normal rolling deploy. + +## Health check behavior during upgrades + +The [`/v1/health`](/docs/guides/deployment#health-checks) endpoint reflects the instance's current state: + +| HTTP Status | Response | Meaning | +|-------------|----------|---------| +| `200` | `{"status": "active"}` | The instance is active — it holds the advisory lock and is fully operational | +| `202` | `{"status": "waiting"}` | The instance is ready — it can serve existing shapes in read-only mode but is not yet active | +| `202` | `{"status": "starting"}` | The instance is starting up and not yet ready to serve any requests | + +During the `waiting` state: + +- Requests for **existing shapes** are served normally (read-only mode) +- Requests that require **creating new shapes** return `503` with a `Retry-After: 5` header +- **Shape deletion** also requires active mode and returns `503` while waiting + +For orchestrator probe configuration, see the [health check section](#health-checks-must-accept-http-202) below. + +## Shared storage (recommended) + +When instances share the same filesystem (e.g., a persistent volume), they share shape data and metadata. This is the recommended approach because shape handles remain stable across deploys — clients don't need sticky sessions and experience minimal disruption. + +### When to use + +- Kubernetes with [ReadWriteMany](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) PersistentVolumeClaims +- AWS ECS on EC2 with shared host volumes (use [placement constraints](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-constraints.html) to keep tasks on the same host) +- Any platform where both instances can access the same filesystem + +> [!Warning] Network filesystems and performance +> Electric is IO-intensive — it reads and writes shape logs and metadata frequently. Network filesystems like [EFS](https://aws.amazon.com/efs/) or NFS add significant latency compared to local storage and may not perform well for large deployments. Prefer local volumes (e.g., NVMe SSDs on EC2 with host bind mounts) where possible. If you must use a network filesystem, see the [troubleshooting guide](/docs/guides/troubleshooting#sqlite-corruption-mdash-why-is-my-shape-metadata-database-corrupt-on-nfs-efs) for important SQLite configuration. + +### Configuration + +Both instances use identical configuration. The key requirement is that `ELECTRIC_STORAGE_DIR` points to a shared filesystem: + +```shell +DATABASE_URL=postgresql://user:password@host:5432/mydb +ELECTRIC_STORAGE_DIR=/shared/electric/data +ELECTRIC_SECRET=your-secret +``` + +> [!Warning] `ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE` for shared storage +> When using a **network filesystem** (NFS, EFS) for shared storage, you **must** set [`ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE=true`](/docs/api/config#electric-shape-db-exclusive-mode). This configures SQLite to use a single read-write connection, preventing corruption from concurrent access — SQLite's default WAL mode relies on shared-memory locking that does not work correctly on network filesystems. For **local shared volumes** (e.g., a K8s PVC backed by local SSD), this setting is not strictly required but is recommended as a safe default. It is included in all shared-storage examples below. + +### Docker Compose example + +This example demonstrates the shared-storage setup. In practice, your orchestrator handles starting and stopping instances during an upgrade. + +```yaml +services: + electric: + image: electricsql/electric:0.9 # pin to a specific version + environment: + DATABASE_URL: ${DATABASE_URL} + ELECTRIC_STORAGE_DIR: /var/lib/electric/data + ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE: "true" + ELECTRIC_SECRET: ${ELECTRIC_SECRET} + volumes: + - electric_data:/var/lib/electric/data + healthcheck: + test: ["CMD", "curl", "-sf", "http://localhost:3000/v1/health"] + interval: 10s + timeout: 2s + retries: 3 + # ...ports, networks, etc. + +volumes: + electric_data: +``` + +> [!Tip] Simulating a rolling deploy +> To test the lock handover locally, start a second container pointing at the same volume, then stop the first. Note that `docker compose --scale` requires removing static port mappings or using a port range to avoid conflicts. + +### Kubernetes example + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: electric +spec: + replicas: 1 + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 1 + maxUnavailable: 0 + template: + # ...labels, selectors + spec: + terminationGracePeriodSeconds: 60 + containers: + - name: electric + image: electricsql/electric:0.9 # pin to a specific version + env: + - name: DATABASE_URL + valueFrom: + secretKeyRef: + name: electric-secrets + key: database-url + - name: ELECTRIC_STORAGE_DIR + value: "/var/lib/electric/data" + - name: ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE + value: "true" + - name: ELECTRIC_SECRET + valueFrom: + secretKeyRef: + name: electric-secrets + key: electric-secret + volumeMounts: + - name: electric-storage + mountPath: /var/lib/electric/data + resources: + requests: + cpu: "500m" + memory: "512Mi" + # ...limits + livenessProbe: + httpGet: + path: /v1/health + port: 3000 + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 2 + failureThreshold: 6 + readinessProbe: + httpGet: + path: /v1/health + port: 3000 + initialDelaySeconds: 5 + periodSeconds: 10 + timeoutSeconds: 2 + failureThreshold: 3 + volumes: + - name: electric-storage + persistentVolumeClaim: + claimName: electric-shared-pvc +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: electric-shared-pvc +spec: + accessModes: + - ReadWriteMany + # storageClassName: efs-sc # use a storage class that supports RWX + resources: + requests: + storage: 10Gi +``` + +With `maxSurge: 1` and `maxUnavailable: 0`, Kubernetes will: + +1. Start a new pod alongside the existing one +2. The new pod enters read-only mode (`202` "waiting") and passes the readiness probe (any 2xx) +3. Kubernetes terminates the old pod +4. The old pod shuts down, releasing the advisory lock +5. The new pod acquires the lock and becomes fully active (`200`) + +### AWS ECS example + +This example uses EC2 launch type with a host bind mount for shared storage. Both old and new tasks share the same directory on the EC2 host. + +> [!Warning] Same-host placement +> ECS does not guarantee that the new task lands on the same host as the old one. To ensure both tasks share the same host volume, your ECS cluster must have exactly one EC2 instance matching your placement constraint, or use a custom instance attribute to pin tasks to a specific host. + +```json +{ + "family": "electric", + "networkMode": "awsvpc", + "cpu": "1024", + "memory": "2048", + "executionRoleArn": "arn:aws:iam::...:role/ecsTaskExecutionRole", + "containerDefinitions": [ + { + "name": "electric", + "image": "electricsql/electric:0.9", + "portMappings": [ + { "containerPort": 3000, "protocol": "tcp" } + ], + "environment": [ + { "name": "ELECTRIC_STORAGE_DIR", "value": "/var/lib/electric/data" }, + { "name": "ELECTRIC_SHAPE_DB_EXCLUSIVE_MODE", "value": "true" } + ], + "secrets": [ + { + "name": "DATABASE_URL", + "valueFrom": "arn:aws:secretsmanager:..." + }, + { + "name": "ELECTRIC_SECRET", + "valueFrom": "arn:aws:secretsmanager:..." + } + ], + "mountPoints": [ + { "sourceVolume": "electric-data", "containerPath": "/var/lib/electric/data" } + ], + "healthCheck": { + "command": ["CMD-SHELL", "curl -sf http://localhost:3000/v1/health || exit 1"], + "interval": 10, + "timeout": 2, + "retries": 3, + "startPeriod": 60 + } + } + ], + "volumes": [ + { + "name": "electric-data", + "host": { "sourcePath": "/var/lib/electric/data" } + } + ] +} +``` + +Configure your ECS service for rolling upgrades: + +```json +{ + "deploymentConfiguration": { + "minimumHealthyPercent": 100, + "maximumPercent": 200 + } +} +``` + +This ensures ECS starts the new task before stopping the old one, allowing the advisory lock handover to occur. Set the health check grace period on your ECS service to 60–90 seconds to allow time for the new task to acquire the advisory lock. + +### Health checks must accept HTTP 202 + +Your orchestrator's health or readiness check must accept `202` responses during upgrades. If it only considers `200` as healthy, the new instance can never become ready while the old instance holds the lock — creating a deadlock where the orchestrator waits for the new instance before terminating the old one. + +Both Kubernetes `httpGet` probes and ECS health checks using `curl -sf` accept any 2xx by default, which is the correct behavior for rolling upgrades. + +> [!Warning] Single-instance readiness probes +> The [Deployment guide](/docs/guides/deployment#kubernetes-probes) recommends an `exec` readiness probe that checks for exactly HTTP `200`. That approach is correct for single-instance deployments where you don't want a starting instance to receive traffic, but it will deadlock during rolling upgrades. If you are performing rolling upgrades, use `httpGet` readiness probes as shown in the examples above. + +## Separate storage (ephemeral) + +When shared storage is not available (e.g., ECS with ephemeral block storage, containers with local-only disks), each instance must have its own replication slot and maintains its own shape data independently. This means each instance has **different shape handles** for the same shape definitions, so clients **must** use sticky sessions and will receive `409` (must-refetch) responses when they switch between instances during a deploy. + +The platform examples from the [shared storage](#shared-storage-recommended) section above apply — just remove the shared volume mount and use the configuration shown here. + +There are two ways to manage the per-instance replication slots: + +### Temporary replication slots + +Use temporary replication slots that are automatically cleaned up when the connection closes. This is the simplest approach for ephemeral storage and avoids accumulating orphaned slots. + +```shell +CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN=true +ELECTRIC_TEMPORARY_REPLICATION_SLOT_USE_RANDOM_NAME=true +ELECTRIC_STORAGE_DIR=/local/electric/data +``` + +The random name option avoids replication slot name conflicts when old and new instances briefly overlap during a rolling upgrade. + +With this configuration: + +- Electric creates a `TEMPORARY` replication slot on the database connection +- The slot is automatically dropped by Postgres when the connection closes (on clean shutdown or crash) +- The new instance creates a fresh temporary slot and starts replicating + +> [!Warning] Network partitions cause shape rotations +> If Electric crashes or loses its database connection unexpectedly, the temporary slot is eventually cleaned up by Postgres once it detects the dead connection (which depends on TCP keepalive settings and may take minutes). When the new instance starts with a fresh slot, all existing shapes are invalidated and clients receive `409` (must-refetch) responses requiring a full resync. See [Replication slot recreation](/docs/guides/troubleshooting#replication-slot-recreation-mdash-why-are-all-clients-resyncing-after-a-crash) in the troubleshooting guide for more details. + +See the config reference for [`CLEANUP_REPLICATION_SLOTS_ON_SHUTDOWN`](/docs/api/config#cleanup-replication-slots-on-shutdown) and [`ELECTRIC_TEMPORARY_REPLICATION_SLOT_USE_RANDOM_NAME`](/docs/api/config#electric-temporary-replication-slot-use-random-name). + +### Separate replication stream IDs + +Alternatively, give each concurrent instance its own [`ELECTRIC_REPLICATION_STREAM_ID`](/docs/api/config#electric-replication-stream-id). This creates named replication slots that persist, giving you more explicit control. This is different from [sharding](/docs/guides/sharding), where separate stream IDs are used for instances connecting to different databases — here, both instances connect to the same database. + +```shell +# Instance A (e.g., blue deployment) +ELECTRIC_REPLICATION_STREAM_ID=deploy-blue +ELECTRIC_STORAGE_DIR=/local/electric/data + +# Instance B (e.g., green deployment) +ELECTRIC_REPLICATION_STREAM_ID=deploy-green +ELECTRIC_STORAGE_DIR=/local/electric/data +``` + +> [!Warning] Postgres resource overhead +> Each replication stream ID creates its own replication slot and publication. Multiple replication slots increase WAL retention on Postgres since each slot independently prevents WAL from being cleaned up. +> +> Monitor your replication slots as described in the [Troubleshooting guide](/docs/guides/troubleshooting#wal-growth-mdash-why-is-my-postgres-database-storage-filling-up). Clean up unused slots promptly when old instances are fully decommissioned. + +When the old deployment is fully stopped, clean up its replication slot and publication in Postgres. The names follow the pattern `electric_slot_{stream_id}` and `electric_publication_{stream_id}`: + +```sql +SELECT pg_drop_replication_slot('electric_slot_deploy_blue'); +DROP PUBLICATION IF EXISTS electric_publication_deploy_blue; +``` + +## Client behavior during deploys + +The official [TypeScript client](/docs/api/clients/typescript) handles deploy transitions automatically: + +- **`503` with `Retry-After` header**: The client backs off and retries. This happens when requesting new shapes during the read-only window. +- **`409` (must-refetch)**: The client refetches the shape from scratch. This happens with separate-storage strategies or when shapes are rotated. +- **Long-poll connections**: Existing long-poll connections on active shapes continue working normally during the read-only window. + +If you're using a custom client, ensure it handles these response codes. See the [HTTP API docs](/docs/api/http) for details on the protocol. + +## Next steps + +- [Deployment guide](/docs/guides/deployment) for general deployment setup +- [Sharding guide](/docs/guides/sharding) for multi-database deployment patterns +- [Config reference](/docs/api/config) for all configuration options +- [Troubleshooting guide](/docs/guides/troubleshooting#rolling-upgrades-mdash-why-is-my-second-instance-stuck-in-waiting-state) for common upgrade issues