diff --git a/docs/distributed_sequences.md b/docs/distributed_sequences.md new file mode 100644 index 00000000..0271b757 --- /dev/null +++ b/docs/distributed_sequences.md @@ -0,0 +1,369 @@ +# Distributed Sequences + +PostgreSQL sequences are node-local. In a multi-master Spock cluster, two +nodes executing `INSERT INTO orders DEFAULT VALUES` concurrently will each +call `nextval('orders_id_seq')` and may produce the same value; the second +insert violates the primary key constraint on replication and is +discarded. The Spock distributed sequence framework intercepts +`nextval()` and produces cluster-unique values, removing the conflict +without changing the application's schema. + +This release ships the **Snowflake** generation method. + +## Concepts + +A sequence may be assigned a *kind*: `local` (stock PostgreSQL behaviour, +the default) or `snowflake`. Assignments are stored in +`spock.sequence_kind` and replicated to all nodes via Spock's +`user_catalog_table` machinery, so every node converges on the same view +of which sequences are managed. + +When a managed sequence is advanced (`nextval()`, a `serial`/`bigserial` +default, `GENERATED AS IDENTITY`, a trigger that calls `nextval()`, +etc.) Spock's `nextval_hook` dispatches to the configured method and +returns a generated value. The on-disk `last_value` of the sequence is +*not* advanced for managed sequences; `currval()` and `lastval()` are +updated and return the method-generated value, as they would for stock +sequences. + +## Configuration + +### Required core patch + +The framework requires a patched PostgreSQL binary that exposes +`nextval_hook`. The patch lives in +`patches//pg-050-nextval-hook.diff` and is applied as part +of the Spock build. On an unpatched server the Spock shared library +fails to load with an unresolved-symbol error. + +### GUCs + +| GUC | Type | Scope | Default | Purpose | +|--------------------------------|---------|----------------|-------------|---------| +| `spock.default_sequence_kind` | enum | superuser | `local` | Method applied to sequences that have no explicit row in `spock.sequence_kind`. Valid values: `local`, `snowflake`. | + +There is **no fixed cap** on the number of sequences that may be +managed. Per-sequence state lives in the sequence's own heap tuple +(no extension-private shared memory). The earlier +`spock.max_managed_sequences` and `spock.snowflake_node_id` GUCs have +been removed -- if you have them in `postgresql.conf` you will see +"unrecognized configuration parameter" at startup; remove the lines. + +The Snowflake node id is derived from the local `spock.node` (the same +node identity Spock already uses for replication). `spock.node.id` is +a 16-bit hash of the node name (`hash_any(name) & 0xffff`); the +snowflake field takes the low 10 bits of that. **Two node names whose +hashes collide on the low 10 bits silently emit colliding snowflakes +cluster-wide.** Operators must pick node names whose derived ids are +distinct and non-zero across the cluster. + +The derivation `spock.node.id & 1023` runs at `spock.node_create()` +time -- there is no preview helper, so the operator workflow is: +create the node, inspect `SELECT id & 1023 FROM spock.node;`, and if +the result is 0 or collides with a peer's value, drop the node and +recreate it with a different name. The first `nextval()` against a +snowflake sequence ERRORS when the local Spock node hasn't been +created yet or when its derived id is 0. + +## SQL surface + +### `spock.alter_sequence_set_kind(seqname regclass, kind text) → void` + +Assigns a sequence to a kind. The caller must own the sequence (the +function rejects non-owners). Valid `kind` values: `'local'`, +`'snowflake'`. + +```sql +SELECT spock.alter_sequence_set_kind('public.orders_id_seq', 'snowflake'); +``` + +Effect is **node-local** in Spock 6.0. The call writes the +`spock.sequence_kind` row on the local node only and invalidates every +local backend's dispatcher cache via relcache; subsequent `nextval()` +calls on this node pick up the new kind. The row does **not** replicate +to peer nodes — `spock.sequence_kind` is intentionally not a +`user_catalog_table`, and Spock's autoddl machinery replicates only +LOGSTMT_DDL utility statements (CREATE/ALTER/DROP), not SQL function +invocations. + +**Operators must run `spock.alter_sequence_set_kind()` (or +`spock.convert_all_sequences()`) on every node in the cluster.** +Running it on only some nodes is supported but inconsistent: the +unconverted nodes continue to emit stock sequence values while +converted nodes emit snowflakes, so cluster-wide uniqueness is no +longer guaranteed until every node has the same kind for a given +sequence. + +Autoddl-driven propagation is a planned follow-up; see the technical +spec's "Known gaps and follow-up work" section. + +### `spock.convert_all_sequences(method text DEFAULT 'snowflake', force bool DEFAULT false) → integer` + +Assigns *every* sequence in the current database to the given method. +Sequences that already have an entry are skipped unless `force` is +true. Returns the number of sequences whose row was inserted or +updated. + +```sql +-- One-shot conversion at cluster setup +SELECT spock.convert_all_sequences('snowflake'); +``` + +Restricted to superuser, because it touches sequences owned by other +roles. Holds a database-scoped advisory lock for the duration to +serialise concurrent invocations. + +### `spock.sequence_hook_available() → boolean` + +Returns true when the patched PostgreSQL binary exports `nextval_hook` +and Spock has successfully attached to it. Intended for monitoring; +test code can use it to skip cleanly on unpatched servers. + +### `spock.sequence_info` view + +Per-sequence summary of managed sequences: + +```sql +SELECT * FROM spock.sequence_info; + sequence_name | kind | hook_status +--------------------+-----------+------------- + public.orders_id | snowflake | active + public.shipments_id| snowflake | active +``` + +## Snowflake method + +### Bit layout + +A Snowflake value is a non-negative `int8` (bigint) with the layout: + +``` + bit 63 : reserved, always 0 (sign bit, keeps values non-negative) + bits 62..22 : 41-bit timestamp in milliseconds since the Spock epoch + (2023-01-01 00:00:00 UTC; matches the standalone `snowflake` extension) + bits 21..12 : 10-bit node id (0..1023) + bits 11..0 : 12-bit counter (0..4095) +``` + +### Guarantees + +- **Cluster-wide uniqueness**: provided every node's Spock node name + derives to a distinct non-zero 10-bit id (i.e. `spock.node.id & 1023` + is distinct and non-zero on every peer), no two values produced + anywhere in the cluster can collide. +- **Monotonicity per node**: every value emitted by a given node is + strictly greater than the previous value emitted by that node on the + same sequence, even across backend restarts and crashes. Per-sequence + state is persisted in the sequence relation's heap tuple (the + `last_value` column) and protected by a WAL pre-log reservation — + not in extension-private shared memory. On backend restart or crash + recovery the next `nextval()` reads `last_value` from the heap tuple + and continues monotonic issuance from there (possibly skipping ahead + by up to the reservation window). +- **Throughput**: up to 4096 distinct values per millisecond per node + (~4M / s). If the rate is exceeded the implementation advances the + timestamp field one millisecond ahead of the wall clock and resets + the counter; the wall clock catches up when the workload subsides. +- **Clock-skew tolerance**: if the wall clock regresses (NTP step, VM + migration), the implementation continues from the last-observed + timestamp rather than producing duplicates. A WARNING is logged + (rate-limited to one per backend per minute). + +### Caveats + +- **`CACHE > 1` is ignored** for managed sequences. The in-core cache + is bypassed; the per-call cost is one CAS plus a hash lookup. +- **`ALTER SEQUENCE ... RESTART`** has no effect on a Snowflake-managed + sequence. The same applies to `INCREMENT`, `MINVALUE`, `MAXVALUE`, + `CYCLE`, and `CACHE`: all are ignored. To change any of these, revert + to `local` first, run the `ALTER SEQUENCE`, then convert back to + `snowflake` if desired. +- **`setval()`** does not propagate to the Snowflake state. The + function silently succeeds but the next `nextval()` returns the + Snowflake value, not the value set. +- **Sequence type and MAXVALUE**: `spock.alter_sequence_set_kind(seq, + 'snowflake')` refuses sequences whose declared type is not `bigint` + and sequences whose `MAXVALUE` is below `1 << 22` (the lowest + possible Snowflake value). Snowflake values are 64-bit; smaller + column types cannot store them. +- **Epoch limit**: the 41-bit ms timestamp field exhausts in 2092. +- **Pre-epoch clock**: Snowflake sequences cannot produce values when + the system clock indicates a time earlier than 2023-01-01 UTC. + +### Interaction with built-in logical replication + +Managed sequences **do not produce sequence-advance WAL records**. The +in-core code only writes `xl_seq_rec` records when it updates the +heap-stored `last_value`, and a hook-handled `nextval()` skips that +write entirely. As a consequence: + +- Built-in logical replication (`pgoutput`) configured with + `sequences = true`, and any third-party output plugin that consumes + the `sequence_cb` callback, **will not receive sequence-advance + events** for sequences assigned to `snowflake`. Replication of the + rows that *use* those sequence values still works as normal; only + the `last_value` state of the sequence object itself is skipped. +- This is intentional. Snowflake-managed sequences derive uniqueness + from the embedded node id, not from cross-node replication of + `last_value`; replicating the state would be pointless and + occasionally wrong. +- If you have a non-Spock subscriber on the same publication that + expects to receive sequence updates, revert the affected sequence + to `local` before adding it to that publication. + +### Security model + +`spock.alter_sequence_set_kind` is `REVOKE`'d from `PUBLIC` and +performs a sequence-owner check on every call: only the owner of a +sequence (or a member of the owning role) may change its kind, even +after `EXECUTE` has been granted on the function. `spock.convert_all_sequences` +is superuser-only. + +Each `alter_sequence_set_kind` call writes a row to `spock.sequence_kind`, +which is a `user_catalog_table` replicated to peer nodes via Spock's +existing machinery. **The apply worker on a peer node writes the +replicated row without re-checking the originating user's privileges**, +because peer Spock nodes are in the same trust domain by design. The +implication is: + +- Any role you grant `EXECUTE` on `spock.alter_sequence_set_kind` is + effectively a role that can change sequence behaviour on *every* + node in the cluster. +- Grant `EXECUTE` only to roles you would also trust on every other + node. + +The read-only diagnostic function `spock.sequence_hook_available()` is +also `REVOKE`'d from `PUBLIC`; grant explicitly to monitoring or +support roles as needed. + +## Operations + +### Converting an existing cluster + +On any one node (the change replicates to the rest): + +```sql +-- After all nodes have been upgraded to the Spock release that ships +-- distributed sequences +SELECT spock.convert_all_sequences('snowflake'); +``` + +Verify on each node: + +```sql +SELECT count(*) FROM spock.sequence_info WHERE kind = 'snowflake'; +``` + +### Reverting a sequence to local + +```sql +SELECT spock.alter_sequence_set_kind('public.orders_id_seq', 'local'); +``` + +After revert, `nextval()` reads the heap-stored `last_value`. Because +Snowflake never wrote it, the next value is whatever the sequence's +`START WITH` was (1 by default). **Pause writes on all but one node +during the transition** if downstream consumers cannot tolerate id +discontinuity. + +### Adding a new node + +When a new node joins the Spock cluster: + +1. Run `spock.node_create('newname', ...)` to create the local Spock + node; that fixes the snowflake `node_id` for the lifetime of the + node. +2. Verify that `SELECT id & 1023 FROM spock.node;` is non-zero and not + already used by an existing peer. If it isn't, drop the node and + retry with a different name -- the snowflake `node_id` is frozen + at `node_create()` time. +3. Bring the node up; Spock replicates `spock.sequence_kind` rows + automatically. +4. The new node generates Snowflake values immediately on first + `nextval()` call. + +No coordination across the cluster is needed for Snowflake (unlike the +planned `galloc` method, which uses chunk allocation). + +### Removing a node + +No action required. The removed node's id is never reused (Spock +tracks node ids in `spock.node`). Any Snowflake values it produced +remain valid; orphaned values are harmless. + +### `pg_dump` and `pg_upgrade` + +**`pg_upgrade`** preserves sequence OIDs and `spock.sequence_kind` rows +survive unchanged in both `--link` and `--copy` modes. No action +required. + +**Logical `pg_dump | pg_restore`**: kind assignments survive the +round-trip. `spock.sequence_kind` is keyed on `(nspname, relname)`, +which are stable across restore (unlike sequence OIDs), and the table +is registered with `pg_extension_config_dump` so its rows are included +in the dump output. The restored sequences pick up their kind on the +destination automatically. + +The `seqoid` cache column in `spock.sequence_kind` will hold stale +values from the source cluster immediately after restore — sequence +OIDs are reassigned on the destination. The dispatcher does not use +this column; it looks up rows by name. The column self-refreshes on +the next `spock.alter_sequence_set_kind()` or +`spock.convert_all_sequences()` call against the affected sequences. +The column is consulted only by the `ALTER SEQUENCE ... RENAME` hook; +a rename of a freshly-restored sequence before its seqoid has been +refreshed will not find the row, and the kind assignment will appear +to vanish until the operator runs `spock.convert_all_sequences()` to +re-establish (by name) and refresh the seqoid cache. + +Migrating a Spock database to a server that does **not** have Spock +installed: use `pg_dump --no-extension=spock` (PostgreSQL 17+) or its +older equivalents. The dump then excludes the `CREATE EXTENSION spock` +and all `spock.*` content, and restores cleanly on a vanilla +PostgreSQL cluster. + +### `ALTER SEQUENCE ... RENAME` and `... SET SCHEMA` + +Both are handled transparently. An object-access hook on `OAT_POST_ALTER` +for `RELKIND_SEQUENCE` rewrites the matching `spock.sequence_kind` row +to the new `(nspname, relname)`. The kind assignment survives the +rename without operator intervention. + +## Monitoring + +```sql +-- Active managed sequences and their methods +SELECT * FROM spock.sequence_info; + +-- Verify the hook is attached +SELECT spock.sequence_hook_available(); +``` + +Clock-skew events appear as `WARNING: wall clock regressed by N ms on +a Spock snowflake sequence` in the server log; the WARNING is +rate-limited to one per backend per minute. + +## Forward compatibility + +When PostgreSQL's upstream Sequence Access Method patch is committed +(targeting PG 19 or 20), the local `nextval_hook` patch is dropped and +Spock's methods become loadable sequence AMs registered via +`pg_seqam`. The SQL surface (`spock.sequence_kind`, +`spock.alter_sequence_set_kind`, `spock.convert_all_sequences`) maps +cleanly to `ALTER SEQUENCE ... USING ` and +`pg_class.relam`. A migration function `spock.migrate_to_seqam()` +will be provided. + +## Limitations + +- Only the `snowflake` method is implemented in this release. `galloc` + (range-allocated by Raft-style consensus) and `step_offset` + (disjoint arithmetic progressions) are planned for follow-up + releases. +- Snowflake values are `bigint`. Sequences feeding `int2` or `int4` + columns cannot use Snowflake; either widen the column or wait for + the `galloc` method. +- `pg_dump` of a database with Snowflake-managed sequences preserves + the per-sequence kind assignments, but the column type of each + sequence is whatever was declared at `CREATE SEQUENCE` time -- + including the `data_type` setting. Snowflake assumes `bigint`. diff --git a/include/spock_node.h b/include/spock_node.h index 0df5096a..98bcb902 100644 --- a/include/spock_node.h +++ b/include/spock_node.h @@ -67,6 +67,8 @@ extern const char *const skip_extension[]; extern void create_node(SpockNode *node); extern void drop_node(Oid nodeid); +extern Oid spock_node_id_for_name(const char *name); + extern SpockNode *get_node(Oid nodeid, bool missing_ok); extern SpockNode *get_node_by_name(const char *name, bool missing_ok); diff --git a/include/spock_seqam.h b/include/spock_seqam.h new file mode 100644 index 00000000..ecf4b3e5 --- /dev/null +++ b/include/spock_seqam.h @@ -0,0 +1,183 @@ +/*------------------------------------------------------------------------- + * + * spock_seqam.h + * Distributed sequence access methods for Spock. + * + * Copyright (c) 2022-2026, pgEdge, Inc. + * + *------------------------------------------------------------------------- + */ +#ifndef SPOCK_SEQAM_H +#define SPOCK_SEQAM_H + +#include "postgres.h" +#include "catalog/pg_sequence.h" +#include "commands/sequence.h" /* SeqTable, nextval_hook_type */ +#include "utils/relcache.h" + +/* + * Catalog name and column positions for spock.sequence_kind. + * + * Keep these in sync with sql/spock--*.sql. The table is keyed on + * (nspname, relname); seqoid is a non-key cache column. + */ +#define CATALOG_SEQUENCE_KIND "sequence_kind" +#define CATALOG_SEQUENCE_KIND_OID_IDX "spock_sequence_kind_seqoid_idx" +#define Natts_sequence_kind 4 +#define Anum_sequence_kind_nspname 1 +#define Anum_sequence_kind_relname 2 +#define Anum_sequence_kind_kind 3 +#define Anum_sequence_kind_seqoid 4 + +/* + * Hook entry for ALTER SEQUENCE ... RENAME and SET SCHEMA. Called from + * src/spock_executor.c:spock_object_access on OAT_POST_ALTER for a + * RELKIND_SEQUENCE relation (subId == 0). Resolves the row by seqoid via + * the secondary index and updates (nspname, relname) if the live sequence + * has been renamed or moved. + */ +extern void spock_seqam_relocate_sequence_record(Oid seqoid); + +/* + * Registered methods. Identifiers are stable across versions and serve as + * the on-disk encoding of the `kind` column when we eventually move it from + * text to a more compact form. + * + * SPOCK_SEQAM_NMETHODS is a sentinel: it must remain one past the last real + * kind so the method-table array sizes correctly when a new kind is added. + */ +typedef enum SpockSeqAmKind +{ + SPOCK_SEQAM_LOCAL = 0, /* fall through to in-core nextval */ + SPOCK_SEQAM_SNOWFLAKE = 1, + SPOCK_SEQAM_NMETHODS /* leave last */ +} SpockSeqAmKind; + +/* + * Per-method dispatch table. Filled in by spock_seqam_register_methods() + * at extension load and never mutated after that. + */ +typedef struct SpockSeqAmMethod +{ + const char *name; /* "snowflake", "local", ... */ + SpockSeqAmKind kind; + + /* + * Hot-path nextval. Signature mirrors the SeqAM `nextval` callback + * (the in-flight Paquier patch's SequenceAmRoutine.nextval): receives + * the open sequence relation, the unpacked catalog options, and an + * IN/OUT *last for prefetch reporting. Returns the int64 value to be + * used as nextval()'s result. + * + * Methods that do not prefetch (Snowflake) must set *last to the + * returned value so the in-core CACHE fast path stays neutral. + */ + int64 (*nextval) (Relation rel, + int64 incby, int64 maxv, int64 minv, + int64 cache, bool cycle, + int64 *last); +} SpockSeqAmMethod; + +/* + * GUC-controlled defaults. Defined in spock_seqam.c. + */ +extern int spock_seqam_default_kind; /* enum SpockSeqAmKind */ + +/* + * Snowflake layout constants. + * + * Bit allocation is fixed. PGD chose the same split for SnowflakeId and the + * mathematics of (4096 values / node / ms) are sufficient for any OLTP + * workload; making it configurable is a footgun. + */ +#define SPOCK_SNOWFLAKE_TIMESTAMP_BITS 41 +#define SPOCK_SNOWFLAKE_NODE_BITS 10 +#define SPOCK_SNOWFLAKE_COUNTER_BITS 12 + +#define SPOCK_SNOWFLAKE_NODE_SHIFT SPOCK_SNOWFLAKE_COUNTER_BITS +#define SPOCK_SNOWFLAKE_TIMESTAMP_SHIFT (SPOCK_SNOWFLAKE_NODE_BITS + SPOCK_SNOWFLAKE_COUNTER_BITS) + +#define SPOCK_SNOWFLAKE_COUNTER_MASK ((UINT64CONST(1) << SPOCK_SNOWFLAKE_COUNTER_BITS) - 1) +#define SPOCK_SNOWFLAKE_NODE_MASK (((UINT64CONST(1) << SPOCK_SNOWFLAKE_NODE_BITS) - 1) \ + << SPOCK_SNOWFLAKE_NODE_SHIFT) + +#define SPOCK_SNOWFLAKE_MAX_NODE_ID ((1 << SPOCK_SNOWFLAKE_NODE_BITS) - 1) + +/* + * 2023-01-01 00:00:00 UTC as Unix milliseconds. The 41-bit ms timestamp + * field thus runs out around 2092. Chosen to match the standalone + * `snowflake` extension's epoch so values from both extensions decode + * identically (same bit layout, same epoch, same node-id derivation). + * + * Do NOT change this constant after a cluster has generated any + * snowflake values: every previously emitted value would appear to be + * from a different era and could compare incorrectly with newly + * generated ones. + */ +#define SPOCK_SNOWFLAKE_EPOCH_MS INT64CONST(1672531200000) + +/* + * Magic word stamped in the special area of a sequence's heap page by + * the in-core sequence machinery. Hardcoded -- core declares the type + * privately in src/backend/commands/sequence.c. Validate on every read + * to catch page-layout drift across PG major versions at buildfarm- + * assert level. + */ +#define SPOCK_SEQUENCE_PAGE_MAGIC 0x1717 + +typedef struct SpockSequencePageMagic +{ + uint32 magic; +} SpockSequencePageMagic; + +/* + * Pre-log window for WAL batching, in milliseconds. On every nextval we + * compare the emitted value against last_logged_threshold (stored in + * seq->log_cnt). When we cross the threshold, we WAL-log a reservation + * for the next SPOCK_SNOWFLAKE_LOG_INTERVAL_MS of generation. Crash + * recovery jumps last_value forward by at most this interval -- never + * produces duplicates. Same mechanism core PG uses for stock sequence + * cache, expressed in time instead of count. + */ +#define SPOCK_SNOWFLAKE_LOG_INTERVAL_MS 30 + +/* + * Module entry point. Called from _PG_init() to register the nextval hook, + * install GUCs and the relcache invalidation callback. + */ +extern void spock_seqam_init(void); + +/* + * Drop hook callback. Invoked from src/spock_executor.c:spock_object_access + * on OAT_DROP of a RELKIND_SEQUENCE relation. Deletes the spock.sequence_kind + * row if any. + */ +extern void spock_seqam_drop_sequence_record(Oid seqoid); + +/* + * Lookup the catalog kind for a sequence by (nspname, relname). Returns + * false if the catalog is missing (extension not yet installed) or if no + * row exists for the sequence. Used by replication paths to skip + * managed-sequence values that must generate independently per node. + */ +extern bool spock_seqam_lookup_kind_by_name(const char *nspname, + const char *relname, + SpockSeqAmKind *kind_out); + +/* + * SQL-callable, registered from sql/spock--*.sql. + */ +extern Datum spock_alter_sequence_set_kind(PG_FUNCTION_ARGS); +extern Datum spock_convert_all_sequences(PG_FUNCTION_ARGS); +extern Datum spock_sequence_hook_available(PG_FUNCTION_ARGS); + +/* + * Forward declaration of the snowflake method. Defined in + * src/spock_seqam_snowflake.c. + */ +extern int64 spock_seqam_snowflake_nextval(Relation rel, + int64 incby, int64 maxv, int64 minv, + int64 cache, bool cycle, + int64 *last); + +#endif /* SPOCK_SEQAM_H */ diff --git a/mkdocs.yml b/mkdocs.yml index e14c0d92..21ca221d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -65,6 +65,7 @@ nav: - Using Spock in Read-Only Mode: managing/read_only.md - Using a Trigger to Manage Replication Set Membership: managing/repset_trigger.md - Using Snowflake Sequences: managing/snowflake.md + - Distributed Sequences (Snowflake via nextval hook): distributed_sequences.md - Using Lolor to Manage Large Objects: managing/lolor.md - Using Automatic DDL Replication: managing/spock_autoddl.md - Adding or Removing Nodes: diff --git a/patches/15/pg15-050-nextval-hook.diff b/patches/15/pg15-050-nextval-hook.diff new file mode 100644 index 00000000..abe840e6 --- /dev/null +++ b/patches/15/pg15-050-nextval-hook.diff @@ -0,0 +1,172 @@ +diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c +index fbcfcddb59e..b2430ea44b8 100644 +--- a/src/backend/commands/sequence.c ++++ b/src/backend/commands/sequence.c +@@ -95,6 +95,9 @@ static HTAB *seqhashtab = NULL; /* hash table for SeqTable items */ + */ + static SeqTableData *last_used_seq = NULL; + ++/* Spock: nextval generator hook; see sequence.h for contract. */ ++nextval_hook_type nextval_hook = NULL; ++ + static void fill_seq_with_data(Relation rel, HeapTuple tuple); + static void fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum); + static Relation lock_and_open_sequence(SeqTable seq); +@@ -637,24 +640,15 @@ nextval_internal(Oid relid, bool check_permissions) + { + SeqTable elm; + Relation seqrel; +- Buffer buf; +- Page page; + HeapTuple pgstuple; + Form_pg_sequence pgsform; +- HeapTupleData seqdatatuple; +- Form_pg_sequence_data seq; + int64 incby, + maxv, + minv, + cache, +- log, +- fetch, + last; +- int64 result, +- next, +- rescnt = 0; ++ int64 result; + bool cycle; +- bool logit = false; + + /* open and lock sequence */ + init_sequence(relid, &elm, &seqrel); +@@ -699,12 +693,51 @@ nextval_internal(Oid relid, bool check_permissions) + cycle = pgsform->seqcycle; + ReleaseSysCache(pgstuple); + ++ /* Spock: hook overrides; swap for rel->rd_sequenceam->nextval at SeqAM merge. */ ++ if (nextval_hook != NULL) ++ result = (*nextval_hook) (seqrel, incby, maxv, minv, cache, cycle, &last); ++ else ++ result = sequence_am_local_nextval(seqrel, incby, maxv, minv, ++ cache, cycle, &last); ++ ++ /* save info in local cache */ ++ elm->increment = incby; ++ elm->last = result; /* last returned number */ ++ elm->cached = last; /* last fetched number */ ++ elm->last_valid = true; ++ ++ relation_close(seqrel, NoLock); ++ last_used_seq = elm; ++ return result; ++} ++ ++/* ++ * Spock: stock per-call generator, factored out of nextval_internal so the ++ * nextval_hook can delegate to it. Caller owns seqrel and the per-session ++ * cache state; *last receives the largest value reserved for the session. ++ */ ++int64 ++sequence_am_local_nextval(Relation seqrel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last) ++{ ++ Buffer buf; ++ Page page; ++ HeapTupleData seqdatatuple; ++ Form_pg_sequence_data seq; ++ int64 log, ++ fetch; ++ int64 result, ++ next, ++ rescnt = 0; ++ bool logit = false; ++ + /* lock page' buffer and read tuple */ + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); + page = BufferGetPage(buf); + +- elm->increment = incby; +- last = next = result = seq->last_value; ++ *last = next = result = seq->last_value; + fetch = cache; + log = seq->log_cnt; + +@@ -791,7 +824,7 @@ nextval_internal(Oid relid, bool check_permissions) + { + log--; + rescnt++; +- last = next; ++ *last = next; + if (rescnt == 1) /* if it's first result - */ + result = next; /* it's what to return */ + } +@@ -800,13 +833,6 @@ nextval_internal(Oid relid, bool check_permissions) + log -= fetch; /* adjust for any unfetched numbers */ + Assert(log >= 0); + +- /* save info in local cache */ +- elm->last = result; /* last returned number */ +- elm->cached = last; /* last fetched number */ +- elm->last_valid = true; +- +- last_used_seq = elm; +- + /* + * If something needs to be WAL logged, acquire an xid, so this + * transaction's commit will trigger a WAL flush and wait for syncrep. +@@ -862,7 +888,7 @@ nextval_internal(Oid relid, bool check_permissions) + } + + /* Now update sequence tuple to the intended final state */ +- seq->last_value = last; /* last fetched number */ ++ seq->last_value = *last; /* last fetched number */ + seq->is_called = true; + seq->log_cnt = log; /* how much is logged */ + +@@ -870,8 +896,6 @@ nextval_internal(Oid relid, bool check_permissions) + + UnlockReleaseBuffer(buf); + +- relation_close(seqrel, NoLock); +- + return result; + } + +diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h +index 9da23008101..0fe86199306 100644 +--- a/src/include/commands/sequence.h ++++ b/src/include/commands/sequence.h +@@ -20,6 +20,7 @@ + #include "nodes/parsenodes.h" + #include "parser/parse_node.h" + #include "storage/relfilenode.h" ++#include "utils/relcache.h" + + + typedef struct FormData_pg_sequence_data +@@ -55,6 +56,24 @@ extern int64 nextval_internal(Oid relid, bool check_permissions); + extern Datum nextval(PG_FUNCTION_ARGS); + extern List *sequence_options(Oid relid); + ++/* ++ * Spock: per-call value-generation hook. Signature matches Paquier's ++ * upstream SeqAM nextval callback; the forward-port replaces the call ++ * site with rel->rd_sequenceam->nextval(...). Single consumer; extension ++ * must reject install when nextval_hook != NULL. ++ */ ++typedef int64 (*nextval_hook_type) (Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++extern PGDLLIMPORT nextval_hook_type nextval_hook; ++ ++/* Spock: stock per-call generator, factored out so hooks can delegate. */ ++extern int64 sequence_am_local_nextval(Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++ + extern ObjectAddress DefineSequence(ParseState *pstate, CreateSeqStmt *stmt); + extern ObjectAddress AlterSequence(ParseState *pstate, AlterSeqStmt *stmt); + extern void SequenceChangePersistence(Oid relid, char newrelpersistence); diff --git a/patches/16/pg16-050-nextval-hook.diff b/patches/16/pg16-050-nextval-hook.diff new file mode 100644 index 00000000..ae9baf22 --- /dev/null +++ b/patches/16/pg16-050-nextval-hook.diff @@ -0,0 +1,172 @@ +diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c +index e0af32075d1..9d7622e3a98 100644 +--- a/src/backend/commands/sequence.c ++++ b/src/backend/commands/sequence.c +@@ -95,6 +95,9 @@ static HTAB *seqhashtab = NULL; /* hash table for SeqTable items */ + */ + static SeqTableData *last_used_seq = NULL; + ++/* Spock: nextval generator hook; see sequence.h for contract. */ ++nextval_hook_type nextval_hook = NULL; ++ + static void fill_seq_with_data(Relation rel, HeapTuple tuple); + static void fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum); + static Relation lock_and_open_sequence(SeqTable seq); +@@ -636,24 +639,15 @@ nextval_internal(Oid relid, bool check_permissions) + { + SeqTable elm; + Relation seqrel; +- Buffer buf; +- Page page; + HeapTuple pgstuple; + Form_pg_sequence pgsform; +- HeapTupleData seqdatatuple; +- Form_pg_sequence_data seq; + int64 incby, + maxv, + minv, + cache, +- log, +- fetch, + last; +- int64 result, +- next, +- rescnt = 0; ++ int64 result; + bool cycle; +- bool logit = false; + + /* open and lock sequence */ + init_sequence(relid, &elm, &seqrel); +@@ -698,12 +692,51 @@ nextval_internal(Oid relid, bool check_permissions) + cycle = pgsform->seqcycle; + ReleaseSysCache(pgstuple); + ++ /* Spock: hook overrides; swap for rel->rd_sequenceam->nextval at SeqAM merge. */ ++ if (nextval_hook != NULL) ++ result = (*nextval_hook) (seqrel, incby, maxv, minv, cache, cycle, &last); ++ else ++ result = sequence_am_local_nextval(seqrel, incby, maxv, minv, ++ cache, cycle, &last); ++ ++ /* save info in local cache */ ++ elm->increment = incby; ++ elm->last = result; /* last returned number */ ++ elm->cached = last; /* last fetched number */ ++ elm->last_valid = true; ++ ++ relation_close(seqrel, NoLock); ++ last_used_seq = elm; ++ return result; ++} ++ ++/* ++ * Spock: stock per-call generator, factored out of nextval_internal so the ++ * nextval_hook can delegate to it. Caller owns seqrel and the per-session ++ * cache state; *last receives the largest value reserved for the session. ++ */ ++int64 ++sequence_am_local_nextval(Relation seqrel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last) ++{ ++ Buffer buf; ++ Page page; ++ HeapTupleData seqdatatuple; ++ Form_pg_sequence_data seq; ++ int64 log, ++ fetch; ++ int64 result, ++ next, ++ rescnt = 0; ++ bool logit = false; ++ + /* lock page' buffer and read tuple */ + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); + page = BufferGetPage(buf); + +- elm->increment = incby; +- last = next = result = seq->last_value; ++ *last = next = result = seq->last_value; + fetch = cache; + log = seq->log_cnt; + +@@ -790,7 +823,7 @@ nextval_internal(Oid relid, bool check_permissions) + { + log--; + rescnt++; +- last = next; ++ *last = next; + if (rescnt == 1) /* if it's first result - */ + result = next; /* it's what to return */ + } +@@ -799,13 +832,6 @@ nextval_internal(Oid relid, bool check_permissions) + log -= fetch; /* adjust for any unfetched numbers */ + Assert(log >= 0); + +- /* save info in local cache */ +- elm->last = result; /* last returned number */ +- elm->cached = last; /* last fetched number */ +- elm->last_valid = true; +- +- last_used_seq = elm; +- + /* + * If something needs to be WAL logged, acquire an xid, so this + * transaction's commit will trigger a WAL flush and wait for syncrep. +@@ -861,7 +887,7 @@ nextval_internal(Oid relid, bool check_permissions) + } + + /* Now update sequence tuple to the intended final state */ +- seq->last_value = last; /* last fetched number */ ++ seq->last_value = *last; /* last fetched number */ + seq->is_called = true; + seq->log_cnt = log; /* how much is logged */ + +@@ -869,8 +895,6 @@ nextval_internal(Oid relid, bool check_permissions) + + UnlockReleaseBuffer(buf); + +- relation_close(seqrel, NoLock); +- + return result; + } + +diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h +index 7db7b3da7bc..a1a9c040402 100644 +--- a/src/include/commands/sequence.h ++++ b/src/include/commands/sequence.h +@@ -20,6 +20,7 @@ + #include "nodes/parsenodes.h" + #include "parser/parse_node.h" + #include "storage/relfilelocator.h" ++#include "utils/relcache.h" + + + typedef struct FormData_pg_sequence_data +@@ -55,6 +56,24 @@ extern int64 nextval_internal(Oid relid, bool check_permissions); + extern Datum nextval(PG_FUNCTION_ARGS); + extern List *sequence_options(Oid relid); + ++/* ++ * Spock: per-call value-generation hook. Signature matches Paquier's ++ * upstream SeqAM nextval callback; the forward-port replaces the call ++ * site with rel->rd_sequenceam->nextval(...). Single consumer; extension ++ * must reject install when nextval_hook != NULL. ++ */ ++typedef int64 (*nextval_hook_type) (Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++extern PGDLLIMPORT nextval_hook_type nextval_hook; ++ ++/* Spock: stock per-call generator, factored out so hooks can delegate. */ ++extern int64 sequence_am_local_nextval(Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++ + extern ObjectAddress DefineSequence(ParseState *pstate, CreateSeqStmt *seq); + extern ObjectAddress AlterSequence(ParseState *pstate, AlterSeqStmt *stmt); + extern void SequenceChangePersistence(Oid relid, char newrelpersistence); diff --git a/patches/17/pg17-050-nextval-hook.diff b/patches/17/pg17-050-nextval-hook.diff new file mode 100644 index 00000000..636dbb79 --- /dev/null +++ b/patches/17/pg17-050-nextval-hook.diff @@ -0,0 +1,172 @@ +diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c +index 2b64a480e16..9d68339ed2c 100644 +--- a/src/backend/commands/sequence.c ++++ b/src/backend/commands/sequence.c +@@ -96,6 +96,9 @@ static HTAB *seqhashtab = NULL; /* hash table for SeqTable items */ + */ + static SeqTableData *last_used_seq = NULL; + ++/* Spock: nextval generator hook; see sequence.h for contract. */ ++nextval_hook_type nextval_hook = NULL; ++ + static void fill_seq_with_data(Relation rel, HeapTuple tuple); + static void fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum); + static Relation lock_and_open_sequence(SeqTable seq); +@@ -624,24 +627,15 @@ nextval_internal(Oid relid, bool check_permissions) + { + SeqTable elm; + Relation seqrel; +- Buffer buf; +- Page page; + HeapTuple pgstuple; + Form_pg_sequence pgsform; +- HeapTupleData seqdatatuple; +- Form_pg_sequence_data seq; + int64 incby, + maxv, + minv, + cache, +- log, +- fetch, + last; +- int64 result, +- next, +- rescnt = 0; ++ int64 result; + bool cycle; +- bool logit = false; + + /* open and lock sequence */ + init_sequence(relid, &elm, &seqrel); +@@ -686,11 +680,51 @@ nextval_internal(Oid relid, bool check_permissions) + cycle = pgsform->seqcycle; + ReleaseSysCache(pgstuple); + ++ /* Spock: hook overrides; swap for rel->rd_sequenceam->nextval at SeqAM merge. */ ++ if (nextval_hook != NULL) ++ result = (*nextval_hook) (seqrel, incby, maxv, minv, cache, cycle, &last); ++ else ++ result = sequence_am_local_nextval(seqrel, incby, maxv, minv, ++ cache, cycle, &last); ++ ++ /* save info in local cache */ ++ elm->increment = incby; ++ elm->last = result; /* last returned number */ ++ elm->cached = last; /* last fetched number */ ++ elm->last_valid = true; ++ ++ sequence_close(seqrel, NoLock); ++ last_used_seq = elm; ++ return result; ++} ++ ++/* ++ * Spock: stock per-call generator, factored out of nextval_internal so the ++ * nextval_hook can delegate to it. Caller owns seqrel and the per-session ++ * cache state; *last receives the largest value reserved for the session. ++ */ ++int64 ++sequence_am_local_nextval(Relation seqrel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last) ++{ ++ Buffer buf; ++ Page page; ++ HeapTupleData seqdatatuple; ++ Form_pg_sequence_data seq; ++ int64 log, ++ fetch; ++ int64 result, ++ next, ++ rescnt = 0; ++ bool logit = false; ++ + /* lock page buffer and read tuple */ + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); + page = BufferGetPage(buf); + +- last = next = result = seq->last_value; ++ *last = next = result = seq->last_value; + fetch = cache; + log = seq->log_cnt; + +@@ -777,7 +811,7 @@ nextval_internal(Oid relid, bool check_permissions) + { + log--; + rescnt++; +- last = next; ++ *last = next; + if (rescnt == 1) /* if it's first result - */ + result = next; /* it's what to return */ + } +@@ -786,14 +820,6 @@ nextval_internal(Oid relid, bool check_permissions) + log -= fetch; /* adjust for any unfetched numbers */ + Assert(log >= 0); + +- /* save info in local cache */ +- elm->increment = incby; +- elm->last = result; /* last returned number */ +- elm->cached = last; /* last fetched number */ +- elm->last_valid = true; +- +- last_used_seq = elm; +- + /* + * If something needs to be WAL logged, acquire an xid, so this + * transaction's commit will trigger a WAL flush and wait for syncrep. +@@ -849,7 +875,7 @@ nextval_internal(Oid relid, bool check_permissions) + } + + /* Now update sequence tuple to the intended final state */ +- seq->last_value = last; /* last fetched number */ ++ seq->last_value = *last; /* last fetched number */ + seq->is_called = true; + seq->log_cnt = log; /* how much is logged */ + +@@ -857,8 +883,6 @@ nextval_internal(Oid relid, bool check_permissions) + + UnlockReleaseBuffer(buf); + +- sequence_close(seqrel, NoLock); +- + return result; + } + +diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h +index e88cbee3b56..14b93aeff3b 100644 +--- a/src/include/commands/sequence.h ++++ b/src/include/commands/sequence.h +@@ -20,6 +20,7 @@ + #include "nodes/parsenodes.h" + #include "parser/parse_node.h" + #include "storage/relfilelocator.h" ++#include "utils/relcache.h" + + + typedef struct FormData_pg_sequence_data +@@ -55,6 +56,24 @@ extern int64 nextval_internal(Oid relid, bool check_permissions); + extern Datum nextval(PG_FUNCTION_ARGS); + extern List *sequence_options(Oid relid); + ++/* ++ * Spock: per-call value-generation hook. Signature matches Paquier's ++ * upstream SeqAM nextval callback; the forward-port replaces the call ++ * site with rel->rd_sequenceam->nextval(...). Single consumer; extension ++ * must reject install when nextval_hook != NULL. ++ */ ++typedef int64 (*nextval_hook_type) (Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++extern PGDLLIMPORT nextval_hook_type nextval_hook; ++ ++/* Spock: stock per-call generator, factored out so hooks can delegate. */ ++extern int64 sequence_am_local_nextval(Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++ + extern ObjectAddress DefineSequence(ParseState *pstate, CreateSeqStmt *seq); + extern ObjectAddress AlterSequence(ParseState *pstate, AlterSeqStmt *stmt); + extern void SequenceChangePersistence(Oid relid, char newrelpersistence); diff --git a/patches/18/pg18-050-nextval-hook.diff b/patches/18/pg18-050-nextval-hook.diff new file mode 100644 index 00000000..45f8f1de --- /dev/null +++ b/patches/18/pg18-050-nextval-hook.diff @@ -0,0 +1,172 @@ +diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c +index a79ef0651a9..a182b53e2e7 100644 +--- a/src/backend/commands/sequence.c ++++ b/src/backend/commands/sequence.c +@@ -96,6 +96,9 @@ static HTAB *seqhashtab = NULL; /* hash table for SeqTable items */ + */ + static SeqTableData *last_used_seq = NULL; + ++/* Spock: nextval generator hook; see sequence.h for contract. */ ++nextval_hook_type nextval_hook = NULL; ++ + static void fill_seq_with_data(Relation rel, HeapTuple tuple); + static void fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum); + static Relation lock_and_open_sequence(SeqTable seq); +@@ -624,24 +627,15 @@ nextval_internal(Oid relid, bool check_permissions) + { + SeqTable elm; + Relation seqrel; +- Buffer buf; +- Page page; + HeapTuple pgstuple; + Form_pg_sequence pgsform; +- HeapTupleData seqdatatuple; +- Form_pg_sequence_data seq; + int64 incby, + maxv, + minv, + cache, +- log, +- fetch, + last; +- int64 result, +- next, +- rescnt = 0; ++ int64 result; + bool cycle; +- bool logit = false; + + /* open and lock sequence */ + init_sequence(relid, &elm, &seqrel); +@@ -686,11 +680,51 @@ nextval_internal(Oid relid, bool check_permissions) + cycle = pgsform->seqcycle; + ReleaseSysCache(pgstuple); + ++ /* Spock: hook overrides; swap for rel->rd_sequenceam->nextval at SeqAM merge. */ ++ if (nextval_hook != NULL) ++ result = (*nextval_hook) (seqrel, incby, maxv, minv, cache, cycle, &last); ++ else ++ result = sequence_am_local_nextval(seqrel, incby, maxv, minv, ++ cache, cycle, &last); ++ ++ /* save info in local cache */ ++ elm->increment = incby; ++ elm->last = result; /* last returned number */ ++ elm->cached = last; /* last fetched number */ ++ elm->last_valid = true; ++ ++ sequence_close(seqrel, NoLock); ++ last_used_seq = elm; ++ return result; ++} ++ ++/* ++ * Spock: stock per-call generator, factored out of nextval_internal so the ++ * nextval_hook can delegate to it. Caller owns seqrel and the per-session ++ * cache state; *last receives the largest value reserved for the session. ++ */ ++int64 ++sequence_am_local_nextval(Relation seqrel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last) ++{ ++ Buffer buf; ++ Page page; ++ HeapTupleData seqdatatuple; ++ Form_pg_sequence_data seq; ++ int64 log, ++ fetch; ++ int64 result, ++ next, ++ rescnt = 0; ++ bool logit = false; ++ + /* lock page buffer and read tuple */ + seq = read_seq_tuple(seqrel, &buf, &seqdatatuple); + page = BufferGetPage(buf); + +- last = next = result = seq->last_value; ++ *last = next = result = seq->last_value; + fetch = cache; + log = seq->log_cnt; + +@@ -777,7 +811,7 @@ nextval_internal(Oid relid, bool check_permissions) + { + log--; + rescnt++; +- last = next; ++ *last = next; + if (rescnt == 1) /* if it's first result - */ + result = next; /* it's what to return */ + } +@@ -786,14 +820,6 @@ nextval_internal(Oid relid, bool check_permissions) + log -= fetch; /* adjust for any unfetched numbers */ + Assert(log >= 0); + +- /* save info in local cache */ +- elm->increment = incby; +- elm->last = result; /* last returned number */ +- elm->cached = last; /* last fetched number */ +- elm->last_valid = true; +- +- last_used_seq = elm; +- + /* + * If something needs to be WAL logged, acquire an xid, so this + * transaction's commit will trigger a WAL flush and wait for syncrep. +@@ -849,7 +875,7 @@ nextval_internal(Oid relid, bool check_permissions) + } + + /* Now update sequence tuple to the intended final state */ +- seq->last_value = last; /* last fetched number */ ++ seq->last_value = *last; /* last fetched number */ + seq->is_called = true; + seq->log_cnt = log; /* how much is logged */ + +@@ -857,8 +883,6 @@ nextval_internal(Oid relid, bool check_permissions) + + UnlockReleaseBuffer(buf); + +- sequence_close(seqrel, NoLock); +- + return result; + } + +diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h +index 9ac0b67683d..d1d41ef2a0b 100644 +--- a/src/include/commands/sequence.h ++++ b/src/include/commands/sequence.h +@@ -20,6 +20,7 @@ + #include "nodes/parsenodes.h" + #include "parser/parse_node.h" + #include "storage/relfilelocator.h" ++#include "utils/relcache.h" + + + typedef struct FormData_pg_sequence_data +@@ -55,6 +56,24 @@ extern int64 nextval_internal(Oid relid, bool check_permissions); + extern Datum nextval(PG_FUNCTION_ARGS); + extern List *sequence_options(Oid relid); + ++/* ++ * Spock: per-call value-generation hook. Signature matches Paquier's ++ * upstream SeqAM nextval callback; the forward-port replaces the call ++ * site with rel->rd_sequenceam->nextval(...). Single consumer; extension ++ * must reject install when nextval_hook != NULL. ++ */ ++typedef int64 (*nextval_hook_type) (Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++extern PGDLLIMPORT nextval_hook_type nextval_hook; ++ ++/* Spock: stock per-call generator, factored out so hooks can delegate. */ ++extern int64 sequence_am_local_nextval(Relation rel, ++ int64 incby, int64 maxv, int64 minv, ++ int64 cache, bool cycle, ++ int64 *last); ++ + extern ObjectAddress DefineSequence(ParseState *pstate, CreateSeqStmt *seq); + extern ObjectAddress AlterSequence(ParseState *pstate, AlterSeqStmt *stmt); + extern void SequenceChangePersistence(Oid relid, char newrelpersistence); diff --git a/sql/spock--6.0.0.sql b/sql/spock--6.0.0.sql index d96a1b9f..622f311b 100644 --- a/sql/spock--6.0.0.sql +++ b/sql/spock--6.0.0.sql @@ -315,6 +315,87 @@ CREATE TABLE spock.sequence_state ( last_value bigint NOT NULL ) WITH (user_catalog_table=true); +-- Distributed sequence access method registry. +-- +-- Maps a sequence (schema, name) to a "kind" identifying which generation +-- method nextval() should use. Sequences without a row use the +-- spock.default_sequence_kind GUC (defaults to 'local'). +-- +-- Keyed on (nspname, relname) rather than seqoid. Names round-trip cleanly +-- through pg_dump / pg_restore; OIDs do not. seqoid is carried as a +-- non-key cache column so the OAT_POST_ALTER hook can find a row by the +-- (rename-stable) OID when ALTER SEQUENCE ... RENAME fires. After a +-- logical restore the seqoid values reference the source cluster and are +-- stale; the next spock.alter_sequence_set_kind() or +-- spock.convert_all_sequences() call on a given sequence refreshes them. +-- +-- Not marked user_catalog_table: this is a node-local registry, by +-- design. Cluster-wide propagation is NOT automatic in Spock 6.0: +-- spock_autoddl_process only replicates LOGSTMT_DDL utility statements +-- (CREATE/ALTER/DROP), not SQL function invocations, so a call to +-- spock.alter_sequence_set_kind() or spock.convert_all_sequences() +-- affects only the local node. Operators must run the same call on +-- every node in the cluster. Running it on only some nodes is +-- supported but inconsistent: the unconverted nodes continue to emit +-- stock sequence values while the converted ones emit snowflakes, so +-- cluster-wide uniqueness is no longer guaranteed. Autoddl-driven +-- propagation is a planned follow-up (see docs/specs/distributed_sequences.md +-- §14 "Known gaps and follow-up work"). +CREATE TABLE spock.sequence_kind ( + nspname name NOT NULL, + relname name NOT NULL, + kind text NOT NULL + CHECK (kind IN ('local','snowflake')), + seqoid oid NOT NULL, + PRIMARY KEY (nspname, relname) +); +CREATE INDEX spock_sequence_kind_seqoid_idx + ON spock.sequence_kind (seqoid); + +SELECT pg_catalog.pg_extension_config_dump('spock.sequence_kind', ''); + +CREATE FUNCTION spock.alter_sequence_set_kind(seqname regclass, kind text) +RETURNS void +AS 'MODULE_PATHNAME', 'spock_alter_sequence_set_kind' +LANGUAGE C VOLATILE; + +REVOKE ALL ON FUNCTION spock.alter_sequence_set_kind(regclass, text) FROM PUBLIC; + +CREATE FUNCTION spock.convert_all_sequences( + method text DEFAULT 'snowflake', + force bool DEFAULT false +) RETURNS integer +AS 'MODULE_PATHNAME', 'spock_convert_all_sequences' +LANGUAGE C VOLATILE; + +REVOKE ALL ON FUNCTION spock.convert_all_sequences(text, bool) FROM PUBLIC; + +CREATE FUNCTION spock.sequence_hook_available() +RETURNS boolean +AS 'MODULE_PATHNAME', 'spock_sequence_hook_available' +LANGUAGE C STABLE STRICT; + +REVOKE ALL ON FUNCTION spock.sequence_hook_available() FROM PUBLIC; + +-- Per-sequence summary. +-- +-- hook_status is 'active' iff Spock's dispatcher is the *root* of the +-- nextval_hook chain. An extension loaded after Spock that chains on +-- top will read as 'inactive' even though Spock is still reachable via +-- the chain; the column is therefore best read as "is Spock the +-- outermost hook?", not "is Spock managing my sequences?". +-- +-- On unpatched PostgreSQL the spock shared library fails to load, so +-- this view doesn't exist there at all -- the column never reads +-- 'inactive' for the patch-absent reason in practice. +CREATE VIEW spock.sequence_info AS +SELECT + sk.seqoid::regclass AS sequence_name, + sk.kind, + CASE WHEN spock.sequence_hook_available() THEN 'active' + ELSE 'inactive' END AS hook_status +FROM spock.sequence_kind sk; + CREATE TABLE spock.depend ( classid oid NOT NULL, objid oid NOT NULL, diff --git a/src/compat/15/spock_compat.h b/src/compat/15/spock_compat.h index 0ac3207b..715ddb73 100644 --- a/src/compat/15/spock_compat.h +++ b/src/compat/15/spock_compat.h @@ -61,8 +61,6 @@ #define ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot) \ ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot, NULL) -#define Form_pg_sequence Form_pg_sequence_data - #define ExecARUpdateTriggers(estate, relinfo, tupleid, fdw_trigtuple, newslot, recheckIndexes) \ ExecARUpdateTriggers(estate, relinfo, NULL, NULL, tupleid, fdw_trigtuple, newslot, recheckIndexes, NULL, false) diff --git a/src/compat/16/spock_compat.h b/src/compat/16/spock_compat.h index 06a2bef1..afcebbcf 100644 --- a/src/compat/16/spock_compat.h +++ b/src/compat/16/spock_compat.h @@ -67,8 +67,6 @@ #define ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot) \ ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot, NULL, NULL) -#define Form_pg_sequence Form_pg_sequence_data - #define ExecARUpdateTriggers(estate, relinfo, tupleid, fdw_trigtuple, newslot, recheckIndexes) \ ExecARUpdateTriggers(estate, relinfo, NULL, NULL, tupleid, fdw_trigtuple, newslot, recheckIndexes, NULL, false) diff --git a/src/compat/17/spock_compat.h b/src/compat/17/spock_compat.h index b7b83596..0f895a07 100644 --- a/src/compat/17/spock_compat.h +++ b/src/compat/17/spock_compat.h @@ -67,8 +67,6 @@ #define ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot) \ ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot, NULL, NULL) -#define Form_pg_sequence Form_pg_sequence_data - #define ExecARUpdateTriggers(estate, relinfo, tupleid, fdw_trigtuple, newslot, recheckIndexes) \ ExecARUpdateTriggers(estate, relinfo, NULL, NULL, tupleid, fdw_trigtuple, newslot, recheckIndexes, NULL, false) diff --git a/src/compat/18/spock_compat.h b/src/compat/18/spock_compat.h index cf632271..2cedbc5c 100644 --- a/src/compat/18/spock_compat.h +++ b/src/compat/18/spock_compat.h @@ -64,8 +64,6 @@ #define ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot) \ ExecBRUpdateTriggers(estate, epqstate, relinfo, tupleid, fdw_trigtuple, slot, NULL, NULL, false) -#define Form_pg_sequence Form_pg_sequence_data - #define ExecARUpdateTriggers(estate, relinfo, tupleid, fdw_trigtuple, newslot, recheckIndexes) \ ExecARUpdateTriggers(estate, relinfo, NULL, NULL, tupleid, fdw_trigtuple, newslot, recheckIndexes, NULL, false) diff --git a/src/spock.c b/src/spock.c index 828c5d13..3d26a734 100644 --- a/src/spock.c +++ b/src/spock.c @@ -65,6 +65,7 @@ #include "spock_output_plugin.h" #include "spock_exception_handler.h" #include "spock_readonly.h" +#include "spock_seqam.h" #include "spock_shmem.h" #include "spock.h" @@ -1285,6 +1286,14 @@ _PG_init(void) NULL, NULL); + /* + * Distributed sequence access methods. Registers the + * spock.default_sequence_kind GUC and attaches the in-core + * nextval_hook. Called before the IsBinaryUpgrade early-return so + * pg_upgrade clusters still recognise the GUC in postgresql.conf. + */ + spock_seqam_init(); + if (IsBinaryUpgrade) return; diff --git a/src/spock_apply.c b/src/spock_apply.c index 9e41b3d7..1a9da4b3 100644 --- a/src/spock_apply.c +++ b/src/spock_apply.c @@ -84,6 +84,7 @@ #include "spock_exception_handler.h" #include "spock_common.h" #include "spock_readonly.h" +#include "spock_seqam.h" #include "spock.h" #include "spock_injection.h" @@ -2282,6 +2283,21 @@ handle_sequence(QueuedMessage *queued_message) reloid = get_relname_relid(relname, nspoid); last_value = pg_strtoint64(last_value_raw); + /* + * Defence in depth: if our local catalog says this sequence is managed + * (snowflake, ...) refuse to apply a setval from the publisher. The + * publisher should not be sending these in the first place (see the + * filter in spock_sequences.c), but a kind-table desync between source + * and dest could otherwise overwrite our locally-generated state. + */ + { + SpockSeqAmKind kind; + + if (spock_seqam_lookup_kind_by_name(nspname, relname, &kind) && + kind != SPOCK_SEQAM_LOCAL) + return; + } + DirectFunctionCall2(setval_oid, ObjectIdGetDatum(reloid), Int64GetDatum(last_value)); } diff --git a/src/spock_executor.c b/src/spock_executor.c index ad469abb..29d4793c 100644 --- a/src/spock_executor.c +++ b/src/spock_executor.c @@ -58,6 +58,7 @@ #include "spock_repset.h" #include "spock_queue.h" #include "spock_dependency.h" +#include "spock_seqam.h" #include "spock.h" @@ -234,6 +235,28 @@ spock_object_access(ObjectAccessType access, if (spknspoid == relnspoid) dropping_spock_obj = true; + + /* + * Sequence drop: clean up any spock.sequence_kind row and the + * shared-memory slot. Without this, dropping a managed + * sequence orphans the catalog row, and a future relation + * landing on the same OID (or even a CREATE SEQUENCE that + * reuses the OID) inherits the stale assignment and the + * stale Snowflake high-water-mark timestamp. + */ + if (get_rel_relkind(objectId) == RELKIND_SEQUENCE && + !dropping_spock_obj) + { + Oid save_userid; + int save_sec_context; + + GetUserIdAndSecContext(&save_userid, &save_sec_context); + SetUserIdAndSecContext(BOOTSTRAP_SUPERUSERID, + save_sec_context | + SECURITY_LOCAL_USERID_CHANGE); + spock_seqam_drop_sequence_record(objectId); + SetUserIdAndSecContext(save_userid, save_sec_context); + } } /* @@ -269,6 +292,28 @@ spock_object_access(ObjectAccessType access, /* Restore previous session privileges */ SetUserIdAndSecContext(save_userid, save_sec_context); } + /* + * Relation-level ALTER on a sequence: keep the spock.sequence_kind row + * in sync with ALTER SEQUENCE ... RENAME TO and ... SET SCHEMA. Both + * commands fire OAT_POST_ALTER with subId == 0 from RenameRelationInternal + * / AlterObjectNamespace_internal in src/backend/commands/. The hook + * runs under superuser like the OAT_DROP arm above: the operator may + * have rename rights without spock.sequence_kind write rights. + */ + else if (access == OAT_POST_ALTER && subId == 0 && + classId == RelationRelationId && + get_rel_relkind(objectId) == RELKIND_SEQUENCE) + { + Oid saved_userid; + int saved_sec_context; + + GetUserIdAndSecContext(&saved_userid, &saved_sec_context); + SetUserIdAndSecContext(BOOTSTRAP_SUPERUSERID, + saved_sec_context | + SECURITY_LOCAL_USERID_CHANGE); + spock_seqam_relocate_sequence_record(objectId); + SetUserIdAndSecContext(saved_userid, saved_sec_context); + } } void diff --git a/src/spock_node.c b/src/spock_node.c index e78c4710..9535deeb 100644 --- a/src/spock_node.c +++ b/src/spock_node.c @@ -165,6 +165,21 @@ validate_subscription_name(const char *name) } } +/* + * Derive a Spock node id from a node name. Truncated to 16 bits so the + * value fits cleanly inside an Oid column and leaves room for downstream + * uses (snowflake sequences mask off the low 10 bits as their node_id). + * Same input -> same output across nodes; callers that need cluster-wide + * uniqueness must verify by inspecting peers. + */ +Oid +spock_node_id_for_name(const char *name) +{ + return (Oid) + (DatumGetUInt32(hash_any((const unsigned char *) name, + strlen(name))) & 0xffff); +} + /* * Add new node to catalog. */ @@ -184,9 +199,7 @@ create_node(SpockNode *node) /* Generate new id unless one was already specified. */ if (node->id == InvalidOid) - node->id = - DatumGetUInt32(hash_any((const unsigned char *) node->name, - strlen(node->name))) & 0xffff; + node->id = spock_node_id_for_name(node->name); rv = makeRangeVar(EXTENSION_NAME, CATALOG_NODE, -1); rel = table_openrv(rv, RowExclusiveLock); diff --git a/src/spock_seqam.c b/src/spock_seqam.c new file mode 100644 index 00000000..81d27832 --- /dev/null +++ b/src/spock_seqam.c @@ -0,0 +1,1406 @@ +/*------------------------------------------------------------------------- + * + * spock_seqam.c + * Distributed sequence access method dispatcher for Spock. + * + * Attaches to the in-core nextval_hook (patches//pgN-050-nextval-hook.diff). + * The dispatcher maps a sequence OID to a kind (via the spock.sequence_kind + * catalog, cached per backend) and dispatches to a per-kind method. Managed + * kinds keep their state in the sequence relation's own heap tuple; the + * dispatcher owns no shared memory. + * + * Per-backend cache invalidation is relcache-driven. Mutators of + * spock.sequence_kind must call CacheInvalidateRelcacheByRelid(seqoid) + * explicitly: CatalogTupleUpdate on a non-system catalog does not fire + * relcache invalidations on its own. + * + * SQL surface: spock.alter_sequence_set_kind, spock.convert_all_sequences, + * spock.sequence_hook_available. + * + * Copyright (c) 2022-2026, pgEdge, Inc. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/genam.h" +#include "access/heapam.h" +#include "access/htup_details.h" +#include "access/parallel.h" +#include "access/xact.h" +#include "access/xlog.h" +#include "access/xloginsert.h" + +#include "catalog/catalog.h" /* IsCatalogNamespace */ +#include "catalog/dependency.h" /* getExtensionOfObject */ +#include "catalog/indexing.h" +#include "catalog/namespace.h" +#include "catalog/pg_class.h" +#include "catalog/pg_index.h" +#include "catalog/pg_sequence.h" +#include "catalog/pg_type.h" /* INT8OID */ + +#include "commands/extension.h" +#include "commands/sequence.h" +#include "commands/trigger.h" /* SessionReplicationRole */ + +#include "utils/syscache.h" + +#include "miscadmin.h" + +#include "funcapi.h" + +#include "nodes/makefuncs.h" + +#include "storage/bufmgr.h" +#include "storage/bufpage.h" +#include "storage/lmgr.h" + +#include "utils/acl.h" +#include "utils/builtins.h" +#include "utils/fmgroids.h" +#include "utils/guc.h" +#include "utils/hsearch.h" +#include "utils/inval.h" +#include "utils/lsyscache.h" +#include "utils/memutils.h" +#include "utils/rel.h" +#include "utils/snapmgr.h" +#include "utils/syscache.h" + +#include "spock.h" +#include "spock_compat.h" +#include "spock_node.h" +#include "spock_seqam.h" + + +/* ----- GUCs ------------------------------------------------------------- */ + +int spock_seqam_default_kind = SPOCK_SEQAM_LOCAL; + +static const struct config_enum_entry SpockSeqAmKindOptions[] = { + {"local", SPOCK_SEQAM_LOCAL, false}, + {"snowflake", SPOCK_SEQAM_SNOWFLAKE, false}, + {NULL, 0, false} +}; + + +/* ----- Per-backend lookup cache ----------------------------------------- */ + +/* + * A backend-local hash entry mapping a sequence OID to the method assignment + * we discovered in spock.sequence_kind (or to "local" if no entry exists). + */ +typedef struct SeqAmCacheEntry +{ + Oid seqoid; /* hash key */ + SpockSeqAmKind kind; +} SeqAmCacheEntry; + +static HTAB *SeqAmCache = NULL; +static Oid SeqAmCatalogRelid = InvalidOid; +static Oid SeqAmCatalogPKIdxRelid = InvalidOid; /* (nspname,relname) PK */ +static Oid SeqAmCatalogOidIdxRelid = InvalidOid; /* secondary index on seqoid */ + +/* + * Resolve both spock.sequence_kind index OIDs (PK on (nspname, relname); + * secondary on seqoid) in a single index-list scan. Cached globally; the + * relcache invalidation callback clears the cache when the catalog or its + * indexes change. Either OID may remain InvalidOid if the corresponding + * index is missing -- callers fall back to a sequential scan. + */ +static void +seqam_catalog_resolve_indexes(Relation rel) +{ + List *idxlist; + ListCell *lc; + + if (OidIsValid(SeqAmCatalogPKIdxRelid) && + OidIsValid(SeqAmCatalogOidIdxRelid)) + return; + + idxlist = RelationGetIndexList(rel); + foreach(lc, idxlist) + { + Oid idxoid = lfirst_oid(lc); + HeapTuple tup; + bool is_pk = false; + + tup = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(idxoid)); + if (HeapTupleIsValid(tup)) + { + is_pk = ((Form_pg_index) GETSTRUCT(tup))->indisprimary; + ReleaseSysCache(tup); + } + + if (is_pk) + SeqAmCatalogPKIdxRelid = idxoid; + else if (!OidIsValid(SeqAmCatalogOidIdxRelid)) + { + char *nm = get_rel_name(idxoid); + + if (nm != NULL) + { + if (strcmp(nm, CATALOG_SEQUENCE_KIND_OID_IDX) == 0) + SeqAmCatalogOidIdxRelid = idxoid; + pfree(nm); + } + } + } + list_free(idxlist); +} + +static inline Oid +seqam_catalog_pkidx_oid(Relation rel) +{ + seqam_catalog_resolve_indexes(rel); + return SeqAmCatalogPKIdxRelid; +} + +static inline Oid +seqam_catalog_oididx_oid(Relation rel) +{ + seqam_catalog_resolve_indexes(rel); + return SeqAmCatalogOidIdxRelid; +} + +/* Method registry, indexed by SpockSeqAmKind. Sized by the enum sentinel. */ +static const SpockSeqAmMethod *SeqAmMethods[SPOCK_SEQAM_NMETHODS]; + + +/* + * Seed the sequence's heap tuple to (last_value = 0, is_called = false, + * log_cnt = 0). Used when a sequence is assigned to a managed kind, so + * the method's first nextval() reads an unambiguous "fresh" state rather + * than inheriting whatever counter the stock generator had emitted. + * + * Mirrors the heap+WAL pattern of sequence_am_local_nextval. Caller must + * hold a sufficient lock on seqrel (alter_sequence_set_kind already holds + * AccessShareLock via LockRelationOid). + */ +static void +seqam_reset_sequence_data(Relation seqrel) +{ + Buffer buf; + Page page; + ItemId lp; + HeapTupleData tupdata; + Form_pg_sequence_data seq; + + buf = ReadBuffer(seqrel, 0); + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(buf); + + /* Validate page magic; catches layout drift across PG major versions. */ + { + SpockSequencePageMagic *sm; + + sm = (SpockSequencePageMagic *) PageGetSpecialPointer(page); + if (sm->magic != SPOCK_SEQUENCE_PAGE_MAGIC) + elog(ERROR, "bad magic number in sequence \"%s\": %08X", + RelationGetRelationName(seqrel), sm->magic); + } + + /* A sequence page carries exactly one tuple at FirstOffsetNumber. */ + Assert(PageGetMaxOffsetNumber(page) == FirstOffsetNumber); + lp = PageGetItemId(page, FirstOffsetNumber); + Assert(ItemIdIsNormal(lp)); + + tupdata.t_data = (HeapTupleHeader) PageGetItem(page, lp); + tupdata.t_len = ItemIdGetLength(lp); + seq = (Form_pg_sequence_data) GETSTRUCT(&tupdata); + + if (RelationNeedsWAL(seqrel)) + GetTopTransactionId(); + + START_CRIT_SECTION(); + + MarkBufferDirty(buf); + + if (RelationNeedsWAL(seqrel)) + { + xl_seq_rec xlrec; + XLogRecPtr recptr; + + XLogBeginInsert(); + XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); + + seq->last_value = 0; + seq->is_called = false; + seq->log_cnt = 0; + +#if PG_VERSION_NUM >= 160000 + xlrec.locator = seqrel->rd_locator; +#else + xlrec.node = seqrel->rd_node; +#endif + + XLogRegisterData(&xlrec, sizeof(xl_seq_rec)); + XLogRegisterData(tupdata.t_data, tupdata.t_len); + + recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); + + PageSetLSN(page, recptr); + } + + seq->last_value = 0; + seq->is_called = false; + seq->log_cnt = 0; + + END_CRIT_SECTION(); + + UnlockReleaseBuffer(buf); +} + +static void +spock_seqam_register_methods(void) +{ + static const SpockSeqAmMethod snowflake_method = { + .name = "snowflake", + .kind = SPOCK_SEQAM_SNOWFLAKE, + .nextval = spock_seqam_snowflake_nextval, + }; + + /* LOCAL has no method entry -- absence means fall through. */ + SeqAmMethods[SPOCK_SEQAM_LOCAL] = NULL; + SeqAmMethods[SPOCK_SEQAM_SNOWFLAKE] = &snowflake_method; +} + + +/* + * Open spock.sequence_kind. Returns NULL when the relation does not exist + * (shared_preload_libraries loaded spock before CREATE EXTENSION). + * Caller closes with the matching lockmode. + */ +static Relation +seqam_catalog_open(LOCKMODE lockmode) +{ + RangeVar *rv; + Oid catrelid; + Relation rel; + + if (OidIsValid(SeqAmCatalogRelid)) + return table_open(SeqAmCatalogRelid, lockmode); + + rv = makeRangeVar(EXTENSION_NAME, CATALOG_SEQUENCE_KIND, -1); + catrelid = RangeVarGetRelid(rv, lockmode, true); + if (!OidIsValid(catrelid)) + return NULL; + rel = table_open(catrelid, NoLock); + SeqAmCatalogRelid = catrelid; + return rel; +} + +/* + * Resolve seqoid -> (nspname, relname) via the syscaches, palloc'ing the + * result strings in CurrentMemoryContext. Both out-parameters are set to + * NULL if the relation no longer exists; callers must check. + */ +static void +seqam_resolve_seq_name(Oid seqoid, char **nspname_out, char **relname_out) +{ + HeapTuple tup; + Form_pg_class classform; + + *nspname_out = NULL; + *relname_out = NULL; + + tup = SearchSysCache1(RELOID, ObjectIdGetDatum(seqoid)); + if (!HeapTupleIsValid(tup)) + return; + classform = (Form_pg_class) GETSTRUCT(tup); + *nspname_out = get_namespace_name(classform->relnamespace); + *relname_out = pstrdup(NameStr(classform->relname)); + ReleaseSysCache(tup); +} + +/* + * Parse the catalog row's kind text into the enum. ereports on garbage. + */ +static SpockSeqAmKind +seqam_parse_kind_datum(Datum kind_datum, const char *context_relname) +{ + text *t = DatumGetTextPP(kind_datum); + char *kindstr = text_to_cstring(t); + SpockSeqAmKind kind; + + if (strcmp(kindstr, "local") == 0) + kind = SPOCK_SEQAM_LOCAL; + else if (strcmp(kindstr, "snowflake") == 0) + kind = SPOCK_SEQAM_SNOWFLAKE; + else + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognised spock.sequence_kind value \"%s\" for sequence \"%s\"", + kindstr, + context_relname ? context_relname : "?"))); + + pfree(kindstr); + return kind; +} + +/* + * Look up the spock.sequence_kind row by (nspname, relname). Returns the + * parsed kind via *kind_out and true if a row exists; false if the catalog + * does not exist (early extension lifecycle) or no row matches. + * + * Caller must be inside a transaction. + */ +bool +spock_seqam_lookup_kind_by_name(const char *nspname, const char *relname, + SpockSeqAmKind *kind_out) +{ + Relation rel; + SysScanDesc scan; + ScanKeyData key[2]; + HeapTuple tup; + bool found = false; + Oid idxoid; + + Assert(nspname != NULL); + Assert(relname != NULL); + Assert(IsTransactionState()); /* systable_beginscan needs a xact */ + + rel = seqam_catalog_open(AccessShareLock); + if (rel == NULL) + return false; + + idxoid = seqam_catalog_pkidx_oid(rel); + + ScanKeyInit(&key[0], + Anum_sequence_kind_nspname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(nspname)); + ScanKeyInit(&key[1], + Anum_sequence_kind_relname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(relname)); + + scan = systable_beginscan(rel, idxoid, OidIsValid(idxoid), NULL, 2, key); + tup = systable_getnext(scan); + if (HeapTupleIsValid(tup)) + { + bool isnull; + Datum d; + + d = heap_getattr(tup, Anum_sequence_kind_kind, + RelationGetDescr(rel), &isnull); + if (isnull) + elog(ERROR, + "spock.sequence_kind row for %s.%s has NULL kind", + nspname, relname); + + *kind_out = seqam_parse_kind_datum(d, relname); + found = true; + } + + systable_endscan(scan); + table_close(rel, AccessShareLock); + + return found; +} + +/* + * Look up the spock.sequence_kind row for seqoid. Returns the parsed kind + * via *kind_out and true if a row exists; false if no row or if the + * sequence's relation no longer resolves to a name (concurrent drop). + * + * The lookup is keyed on (nspname, relname) -- names round-trip across + * pg_dump/restore where OIDs do not. seqoid -> name resolution is via the + * RELOID syscache, which is hot on the same code path that brought us + * here from nextval_internal(). + * + * Caller must be inside a transaction. + */ +static bool +seqam_catalog_lookup(Oid seqoid, SpockSeqAmKind *kind_out) +{ + char *nspname; + char *relname; + bool found; + + seqam_resolve_seq_name(seqoid, &nspname, &relname); + if (nspname == NULL || relname == NULL) + return false; + + found = spock_seqam_lookup_kind_by_name(nspname, relname, kind_out); + + pfree(nspname); + pfree(relname); + + return found; +} + + +/* + * Sinval-driven invalidation callback. Two paths: + * + * - Catalog or wildcard invalidation: destroy the whole cache. + * - Sequence-relation invalidation: remove that one entry. + * + * Safe to free outright because the dispatcher never holds an entry + * pointer across code that can fire callbacks (probe-then-populate + * pattern; see spock_seqam_nextval). + */ +static void +seqam_relcache_invalidate(Datum arg, Oid relid) +{ + (void) arg; + + if (SeqAmCatalogRelid == InvalidOid || relid == InvalidOid || + relid == SeqAmCatalogRelid) + { + HTAB *cache = SeqAmCache; + + /* NULL the pointer first so any nested callback is idempotent. */ + SeqAmCache = NULL; + SeqAmCatalogPKIdxRelid = InvalidOid; + SeqAmCatalogOidIdxRelid = InvalidOid; + SeqAmCatalogRelid = InvalidOid; + if (cache != NULL) + hash_destroy(cache); + } + else if (SeqAmCache != NULL) + (void) hash_search(SeqAmCache, &relid, HASH_REMOVE, NULL); +} + +static void +seqam_init_cache(void) +{ + HASHCTL ctl; + + if (SeqAmCache != NULL) + return; + + MemSet(&ctl, 0, sizeof(ctl)); + ctl.keysize = sizeof(Oid); + ctl.entrysize = sizeof(SeqAmCacheEntry); + ctl.hcxt = AllocSetContextCreate(TopMemoryContext, + "spock seqam cache", + ALLOCSET_SMALL_SIZES); + + SeqAmCache = hash_create("spock seqam backend cache", + 128, &ctl, + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); + + /* + * NOTE: the relcache callback itself is registered once at module load + * in spock_seqam_init(), not here, so that backends which never call a + * managed nextval() still receive DROP SEQUENCE invalidations. + */ +} + +/* ----- The hook --------------------------------------------------------- */ + +/* + * spock_seqam_nextval + * Single-consumer implementation of nextval_hook with the SeqAM- + * compatible signature (mirrors Paquier's upstream + * SequenceAmRoutine.nextval). Returns the value to use as the + * result of nextval(), and sets *last to the prefetch frontier + * so the in-core CACHE fast path is informed. + * + * For sequences Spock has chosen not to manage, delegates to + * sequence_am_local_nextval with the same arguments. For managed + * sequences, dispatches to the per-kind method via SeqAmMethods. + */ +static int64 +spock_seqam_nextval(Relation rel, + int64 incby, int64 maxv, int64 minv, + int64 cache, bool cycle, + int64 *last) +{ + Oid seqoid = RelationGetRelid(rel); + SeqAmCacheEntry *entry; + SpockSeqAmKind kind; + const SpockSeqAmMethod *method; + + /* init_sequence() upstream already opened the relation. */ + Assert(IsTransactionState()); + + if (SeqAmCache == NULL) + seqam_init_cache(); + + /* + * Probe the cache. On a hit we read kind into a local immediately + * and stop referring to `entry`, so the relcache callback is free to + * destroy or modify the hash later without worrying about live + * pointers from our stack frame. + */ + entry = (SeqAmCacheEntry *) hash_search(SeqAmCache, &seqoid, + HASH_FIND, NULL); + if (entry != NULL) + { + kind = entry->kind; + } + else + { + bool found; + + /* + * Catalog lookup acquires a lock and processes pending sinval, + * which fires our callback. Hold no entry pointer across this + * call -- the callback may HASH_REMOVE this seqoid (individual + * inval) or hash_destroy the whole cache (catalog inval). + */ + if (!seqam_catalog_lookup(seqoid, &kind)) + kind = spock_seqam_default_kind; + + /* Re-create if a callback nuked the cache during the lookup. */ + if (SeqAmCache == NULL) + seqam_init_cache(); + + entry = (SeqAmCacheEntry *) hash_search(SeqAmCache, &seqoid, + HASH_ENTER, &found); + entry->kind = kind; + } + + /* Unmanaged sequence: stock semantics. */ + if (kind == SPOCK_SEQAM_LOCAL) + return sequence_am_local_nextval(rel, incby, maxv, minv, + cache, cycle, last); + + /* Managed dispatch is unsupported in these states. */ + Assert(!RecoveryInProgress()); + Assert(!creating_extension); + Assert(!IsBinaryUpgrade); + Assert(kind > SPOCK_SEQAM_LOCAL && kind < SPOCK_SEQAM_NMETHODS); + + method = SeqAmMethods[kind]; + Assert(method != NULL); + Assert(method->nextval != NULL); + + return method->nextval(rel, incby, maxv, minv, cache, cycle, last); +} + + +/* ----- _PG_init entry point -------------------------------------------- */ + +void +spock_seqam_init(void) +{ + /* + * Managed-sequence state lives in the sequence's own heap tuple + * (last_value), so there is no per-sequence shared-memory subsystem to + * configure here. spock.max_managed_sequences from earlier revisions + * is gone; an entry in postgresql.conf for it produces a + * "unrecognized configuration parameter" message at startup. + */ + + DefineCustomEnumVariable("spock.default_sequence_kind", + "Method used for sequences not explicitly registered", + "Determines the dispatch behaviour when a " + "sequence has no row in spock.sequence_kind. " + "'local' means stock PostgreSQL semantics. " + "'snowflake' assigns the Snowflake method to " + "every nextval() call cluster-wide.", + &spock_seqam_default_kind, + SPOCK_SEQAM_LOCAL, + SpockSeqAmKindOptions, + PGC_SUSET, + 0, + NULL, NULL, NULL); + + spock_seqam_register_methods(); + + /* + * Register the relcache invalidation callback once at module load. + * Registering it lazily on first hook firing meant that a backend + * which never called a managed nextval() (e.g. one that ran nothing + * but DDL) would never see invalidations and could leak orphan + * entries in the per-backend cache. + */ + CacheRegisterRelcacheCallback(seqam_relcache_invalidate, (Datum) 0); + + /* + * Install the nextval_hook. Single-consumer: refuse to load if + * another extension already installed the hook, since the hook is + * the full value-generation path and chaining is not supported (see + * contract in commands/sequence.h). + */ + if (nextval_hook != NULL) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("nextval_hook is already installed by another extension"), + errhint("Only one extension may install nextval_hook. " + "Remove the conflicting extension from " + "shared_preload_libraries, or unload it before " + "loading Spock."))); + nextval_hook = spock_seqam_nextval; +} + + +/* ----- DROP SEQUENCE cleanup ------------------------------------------- */ + +/* + * Delete the spock.sequence_kind row (if any) and free the per-sequence + * shared-memory slot for the given sequence OID. Called from the + * object_access OAT_DROP path for RELKIND_SEQUENCE relations. + * + * Cross-node propagation: under autoddl, the DROP SEQUENCE itself + * replicates and fires this hook locally on each node, so no special + * apply-context handling is needed -- every node runs its own delete. + * + * The row is keyed on (nspname, relname). We can still resolve the + * sequence's name from its OID at OAT_DROP time: pg_class hasn't been + * removed yet. If the resolution fails (extension dropping, sequence + * already removed from pg_class), the row -- if any -- becomes orphaned + * and is cleaned up by the next spock.convert_all_sequences() run; this + * is acceptable best-effort behaviour. + */ +void +spock_seqam_drop_sequence_record(Oid seqoid) +{ + Relation rel; + SysScanDesc scan; + ScanKeyData key[2]; + HeapTuple tup; + char *nspname; + char *relname; + + /* + * If the extension itself is being dropped, the catalog table is about + * to disappear too -- nothing to clean up. Cleanup runs under + * superuser privileges because OAT_DROP can fire under a caller who + * has no rights on spock.sequence_kind -- see spock_executor.c for + * the SetUserIdAndSecContext shuffle. + */ + if (get_extension_oid(EXTENSION_NAME, true) == InvalidOid) + return; + + seqam_resolve_seq_name(seqoid, &nspname, &relname); + if (nspname == NULL || relname == NULL) + return; + + rel = seqam_catalog_open(RowExclusiveLock); + if (rel == NULL) + return; + + ScanKeyInit(&key[0], + Anum_sequence_kind_nspname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(nspname)); + ScanKeyInit(&key[1], + Anum_sequence_kind_relname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(relname)); + scan = systable_beginscan(rel, seqam_catalog_pkidx_oid(rel), true, + NULL, 2, key); + tup = systable_getnext(scan); + if (HeapTupleIsValid(tup)) + simple_heap_delete(rel, &tup->t_self); + systable_endscan(scan); + table_close(rel, RowExclusiveLock); +} + + +/* ----- ALTER SEQUENCE RENAME / SET SCHEMA cleanup ---------------------- */ + +/* + * Update the (nspname, relname) of a spock.sequence_kind row when ALTER + * SEQUENCE ... RENAME TO ... or ALTER SEQUENCE ... SET SCHEMA fires. + * + * Called from spock_object_access on OAT_POST_ALTER for RELKIND_SEQUENCE + * with subId == 0. At this point pg_class already reflects the new name; + * we look up our row by the (rename-stable) seqoid via the secondary + * index and rewrite (nspname, relname) if they differ. + * + * Best-effort: if no row matches seqoid (no kind assigned, or seqoid + * column is stale from a logical restore), nothing to do. The next + * spock.alter_sequence_set_kind() or spock.convert_all_sequences() on + * this sequence refreshes the seqoid column. + */ +void +spock_seqam_relocate_sequence_record(Oid seqoid) +{ + Relation rel; + SysScanDesc scan; + ScanKeyData key[1]; + HeapTuple tup; + Oid idxoid; + char *new_nsp; + char *new_rel; + + if (get_extension_oid(EXTENSION_NAME, true) == InvalidOid) + return; + + /* + * OAT_POST_ALTER fires after the catalog tuple update but before the + * CommandCounterIncrement that would publish it to the syscaches. + * Without an explicit CCI here, the syscache lookup below returns + * the pre-rename name and we conclude (wrongly) that nothing changed. + */ + CommandCounterIncrement(); + + seqam_resolve_seq_name(seqoid, &new_nsp, &new_rel); + if (new_nsp == NULL || new_rel == NULL) + return; + + rel = seqam_catalog_open(RowExclusiveLock); + if (rel == NULL) + return; + + idxoid = seqam_catalog_oididx_oid(rel); + + ScanKeyInit(&key[0], + Anum_sequence_kind_seqoid, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(seqoid)); + scan = systable_beginscan(rel, idxoid, OidIsValid(idxoid), NULL, 1, key); + tup = systable_getnext(scan); + if (HeapTupleIsValid(tup)) + { + Datum values[Natts_sequence_kind]; + bool nulls[Natts_sequence_kind]; + bool replaces[Natts_sequence_kind]; + bool row_nsp_isnull; + bool row_rel_isnull; + Datum row_nsp_d; + Datum row_rel_d; + + row_nsp_d = heap_getattr(tup, Anum_sequence_kind_nspname, + RelationGetDescr(rel), &row_nsp_isnull); + row_rel_d = heap_getattr(tup, Anum_sequence_kind_relname, + RelationGetDescr(rel), &row_rel_isnull); + + /* + * Compare with the live name. If unchanged (the typical case for + * non-rename ALTER SEQUENCE: OWNER TO, SET LOGGED/UNLOGGED, ...) + * skip the catalog rewrite. Saves an unnecessary WAL record per + * ALTER on every managed sequence. + */ + if (!row_nsp_isnull && !row_rel_isnull && + strcmp(NameStr(*DatumGetName(row_nsp_d)), new_nsp) == 0 && + strcmp(NameStr(*DatumGetName(row_rel_d)), new_rel) == 0) + { + systable_endscan(scan); + table_close(rel, RowExclusiveLock); + return; + } + + MemSet(nulls, false, sizeof(nulls)); + MemSet(replaces, false, sizeof(replaces)); + + values[Anum_sequence_kind_nspname - 1] = CStringGetDatum(new_nsp); + values[Anum_sequence_kind_relname - 1] = CStringGetDatum(new_rel); + replaces[Anum_sequence_kind_nspname - 1] = true; + replaces[Anum_sequence_kind_relname - 1] = true; + + { + HeapTuple newtup = heap_modify_tuple(tup, RelationGetDescr(rel), + values, nulls, replaces); + + CatalogTupleUpdate(rel, &tup->t_self, newtup); + } + } + systable_endscan(scan); + table_close(rel, RowExclusiveLock); + + /* + * Invalidate per-backend cache entries on this node so subsequent + * nextval() calls re-resolve under the new name. The seqoid hasn't + * changed but the kind lookup for it depends on the (now updated) + * (nspname, relname) row. + */ + CacheInvalidateRelcacheByRelid(seqoid); +} + + +/* ----- SQL-callable functions ------------------------------------------ */ + +PG_FUNCTION_INFO_V1(spock_alter_sequence_set_kind); +PG_FUNCTION_INFO_V1(spock_convert_all_sequences); +PG_FUNCTION_INFO_V1(spock_sequence_hook_available); + +/* + * Parse a "kind" string into a SpockSeqAmKind, ereporting on bad input. + * + * Centralised so the same validation/errmsg fires from every entry point + * (spock_alter_sequence_set_kind, spock_convert_all_sequences). + */ +static SpockSeqAmKind +parse_sequence_kind(const char *kindstr) +{ + if (strcmp(kindstr, "local") == 0) + return SPOCK_SEQAM_LOCAL; + if (strcmp(kindstr, "snowflake") == 0) + return SPOCK_SEQAM_SNOWFLAKE; + + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid sequence kind \"%s\"", kindstr), + errhint("Valid kinds are: local, snowflake."))); + pg_unreachable(); +} + +/* + * Validate that a sequence is a legitimate target for a managed kind. + * + * Caller holds a lock that conflicts with ALTER SEQUENCE, so seqtypid / + * seqmax / seqcache are stable for the duration of this call. + * + * SPOCK_SEQAM_LOCAL is always valid. SPOCK_SEQAM_SNOWFLAKE refuses: + * - seqtypid != INT8OID (a snowflake doesn't fit in a smaller type), + * - seqmax < 1 << 22 (any non-zero snowflake timestamp shifts left + * 22 bits; smaller MAXVALUE excludes the snowflake range), + * - seqcache > 1 (the in-core CACHE fast path would emit (cache-1) + * stock-arithmetic values between hook calls). + * + * Note: snowflake otherwise ignores seqincrement, seqmin, seqcycle. The + * SQL standard treats them as monotonicity contracts on nextval(); managed + * kinds opt out of that contract by construction. We do not currently + * reject non-default values for those clauses, but operators should be + * aware they are advisory once kind != local. + */ +static void +validate_sequence_target(Oid seqoid, SpockSeqAmKind kind) +{ + Form_pg_sequence seqform; + HeapTuple tup; + Oid typid; + int64 maxv; + int64 cachev; + int64 incby; + const char *relname; + + if (kind != SPOCK_SEQAM_SNOWFLAKE) + return; + + tup = SearchSysCache1(SEQRELID, ObjectIdGetDatum(seqoid)); + if (!HeapTupleIsValid(tup)) + elog(ERROR, "cache lookup failed for sequence %u", seqoid); + seqform = (Form_pg_sequence) GETSTRUCT(tup); + typid = seqform->seqtypid; + maxv = seqform->seqmax; + cachev = seqform->seqcache; + incby = seqform->seqincrement; + ReleaseSysCache(tup); + + /* Caller holds a lock on the sequence; the name is stable. */ + relname = get_rel_name(seqoid); + if (relname == NULL) + relname = "?"; + + if (typid != INT8OID) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("sequence \"%s\" cannot be set to kind \"snowflake\"", + relname), + errdetail("Snowflake values are bigint; the sequence is declared with a smaller type."), + errhint("Use ALTER SEQUENCE %s AS bigint, then retry.", + relname))); + + /* + * Snowflake values use the full non-negative int64 range: as the + * 41-bit timestamp field advances toward its year-2092 ceiling, the + * emitted bigint approaches PG_INT64_MAX. Any user-imposed MAXVALUE + * less than PG_INT64_MAX would eventually be violated by some valid + * emission, so require NO MAXVALUE (which sets seqmax to + * PG_INT64_MAX on an ascending sequence). + */ + if (maxv < PG_INT64_MAX) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("sequence \"%s\" cannot be set to kind \"snowflake\"", + relname), + errdetail("Sequence MAXVALUE is " INT64_FORMAT + "; snowflake emissions use the full int64 range.", + maxv), + errhint("Use ALTER SEQUENCE %s NO MAXVALUE, then retry.", + relname))); + + /* + * CACHE > 1 would let the in-core CACHE fast path emit (cache-1) + * values via stock += increment before re-entering the hook; those + * are not valid snowflakes. + */ + if (cachev > 1) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("sequence \"%s\" cannot be set to kind \"snowflake\"", + relname), + errdetail("Sequence CACHE is " INT64_FORMAT + "; managed kinds require CACHE 1.", cachev), + errhint("Use ALTER SEQUENCE %s CACHE 1, then retry.", + relname))); + + /* + * INCREMENT must be in 1..SPOCK_SNOWFLAKE_COUNTER_MASK. + * + * Lower bound: non-positive increments would either collide on the + * counter portion (0) or emit decreasing counters (< 0), violating + * per-node monotonicity. + * + * Upper bound: the snowflake counter is 12 bits; an increment greater + * than COUNTER_MASK (4095) overflows the counter every call, + * degenerating to "one millisecond per nextval". Beyond being useless + * for batch-allocation patterns it costs throughput, and a large + * positive int64 would overflow when cast to int32 in the hot path. + * Reject up front rather than silently degrade or invoke undefined + * behaviour. + * + * Useful range: 1 covers single-id allocation; 32..1000 covers the + * Hibernate pooled-hi/lo patterns (each nextval skips that many + * counter slots so the application can derive the intermediate IDs + * locally without colliding with another node). + */ + if (incby <= 0 || incby > (int64) SPOCK_SNOWFLAKE_COUNTER_MASK) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("sequence \"%s\" cannot be set to kind \"snowflake\"", + relname), + errdetail("Sequence INCREMENT is " INT64_FORMAT + "; managed kinds require an increment in 1..%u.", + incby, (unsigned) SPOCK_SNOWFLAKE_COUNTER_MASK), + errhint("Use ALTER SEQUENCE %s INCREMENT 1, then retry.", + relname))); +} + +/* + * spock.alter_sequence_set_kind(seqname regclass, kind text) + * + * Insert or replace the spock.sequence_kind row for the given sequence. + * 'kind' must be one of: 'local', 'snowflake'. + * + * The row is keyed on (nspname, relname) and also stores the local seqoid + * for the OAT_POST_ALTER hook's by-OID lookup. Effect is node-local: + * autoddl replicates only LOGSTMT_DDL utility statements, not SQL function + * invocations, so cross-node propagation requires the operator to run this + * call on every node (each node then writes its own row with its own + * local seqoid). See sql/spock--6.0.0.sql at spock.sequence_kind for the + * full caveat. + */ +Datum +spock_alter_sequence_set_kind(PG_FUNCTION_ARGS) +{ + Oid seqoid = PG_GETARG_OID(0); + text *kind_in = PG_GETARG_TEXT_PP(1); + char *kindstr = text_to_cstring(kind_in); + SpockSeqAmKind kind; + SpockSeqAmKind row_kind = SPOCK_SEQAM_LOCAL; /* unused unless found */ + Relation rel; + SysScanDesc scan; + ScanKeyData key[2]; + HeapTuple tup; + bool found; + + kind = parse_sequence_kind(kindstr); + + /* + * Lock the sequence before any property check. ShareUpdateExclusive + * conflicts with ALTER SEQUENCE (ShareRowExclusive) so the seqcache + * / seqmax / seqtypid we validate below cannot drift between the + * check and our catalog write, and is self-conflicting so two + * concurrent alter_sequence_set_kind callers serialise. Does not + * conflict with nextval (RowExclusive). + */ + LockRelationOid(seqoid, ShareUpdateExclusiveLock); + + /* Concurrent DROP that committed before our lock: report cleanly. */ + if (get_rel_relkind(seqoid) != RELKIND_SEQUENCE) + { + UnlockRelationOid(seqoid, ShareUpdateExclusiveLock); + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation with OID %u is not a sequence", seqoid))); + } + + /* + * Stash the name now: get_rel_name() can return NULL under syscache + * pressure even while we hold the lock, which aclcheck_error() would + * pass straight to %s. + */ + { + char *seqrelname = get_rel_name(seqoid); + + if (seqrelname == NULL) + elog(ERROR, "cache lookup failed for sequence %u", seqoid); + + /* pg_class_ownercheck was renamed in PG 16; gate via PG_VERSION_NUM. */ +#if PG_VERSION_NUM >= 160000 + if (!object_ownercheck(RelationRelationId, seqoid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_SEQUENCE, seqrelname); +#else + if (!pg_class_ownercheck(seqoid, GetUserId())) + aclcheck_error(ACLCHECK_NOT_OWNER, OBJECT_SEQUENCE, seqrelname); +#endif + pfree(seqrelname); + } + + validate_sequence_target(seqoid, kind); + + { + char *nspname; + char *relname; + + seqam_resolve_seq_name(seqoid, &nspname, &relname); + if (nspname == NULL || relname == NULL) + elog(ERROR, "cache lookup failed for sequence %u", seqoid); + + rel = seqam_catalog_open(RowExclusiveLock); + if (rel == NULL) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_TABLE), + errmsg("spock.sequence_kind catalog not found"), + errhint("The library is loaded but the extension is not " + "yet created. Run CREATE EXTENSION spock first."))); + + ScanKeyInit(&key[0], + Anum_sequence_kind_nspname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(nspname)); + ScanKeyInit(&key[1], + Anum_sequence_kind_relname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(relname)); + scan = systable_beginscan(rel, seqam_catalog_pkidx_oid(rel), true, + NULL, 2, key); + tup = systable_getnext(scan); + found = HeapTupleIsValid(tup); + + /* + * If a row already exists, capture its current kind so we can tell + * whether this call is a true kind change (which must reset the + * sequence's heap-tuple state) or a no-op same-kind re-assertion + * (which must NOT wipe accumulated snowflake state -- doing so + * risks emitting a value that was already issued earlier in the + * same millisecond). + */ + if (found) + { + Datum d; + bool isnull; + char *existing_kindstr; + + d = heap_getattr(tup, Anum_sequence_kind_kind, + RelationGetDescr(rel), &isnull); + Assert(!isnull); + existing_kindstr = TextDatumGetCString(d); + row_kind = parse_sequence_kind(existing_kindstr); + pfree(existing_kindstr); + } + + { + Datum values[Natts_sequence_kind]; + bool nulls[Natts_sequence_kind]; + bool replaces[Natts_sequence_kind]; + HeapTuple newtup; + + MemSet(nulls, false, sizeof(nulls)); + MemSet(replaces, false, sizeof(replaces)); + + values[Anum_sequence_kind_nspname - 1] = CStringGetDatum(nspname); + values[Anum_sequence_kind_relname - 1] = CStringGetDatum(relname); + values[Anum_sequence_kind_kind - 1] = CStringGetTextDatum(kindstr); + values[Anum_sequence_kind_seqoid - 1] = ObjectIdGetDatum(seqoid); + + if (found) + { + /* Update kind+seqoid in place; nspname/relname are the PK. */ + replaces[Anum_sequence_kind_kind - 1] = true; + replaces[Anum_sequence_kind_seqoid - 1] = true; + newtup = heap_modify_tuple(tup, RelationGetDescr(rel), + values, nulls, replaces); + CatalogTupleUpdate(rel, &tup->t_self, newtup); + } + else + { + newtup = heap_form_tuple(RelationGetDescr(rel), values, nulls); + CatalogTupleInsert(rel, newtup); + } + } + + systable_endscan(scan); + table_close(rel, RowExclusiveLock); + } + + /* + * Seed last_value = 0 so the method's first nextval doesn't misread + * stale stock counts as packed state. LOCAL needs no seed. + * + * Only reset on a real transition: a fresh insert (!found) or an + * actual kind change (row_kind != kind). Resetting a sequence + * already in the target kind would clobber its accumulated + * (ts, ctr) state and could cause a same-millisecond duplicate. + */ + if (kind != SPOCK_SEQAM_LOCAL && (!found || row_kind != kind)) + { + Relation seqrel; + + seqrel = table_open(seqoid, RowExclusiveLock); + seqam_reset_sequence_data(seqrel); + table_close(seqrel, NoLock); + } + + /* + * CatalogTupleUpdate on a non-system catalog does not fire relcache + * invalidations on its own; force one keyed on the sequence so other + * backends ON THIS NODE rebuild their cached (seqoid -> kind) entry. + * Cross-node propagation is the operator's job (this function runs + * separately on every node); when autoddl learns to replicate SQL + * function calls, each node's local invocation will fire its own + * invalidation locally. + */ + CacheInvalidateRelcacheByRelid(seqoid); + + CommandCounterIncrement(); + + pfree(kindstr); + PG_RETURN_VOID(); +} + +/* + * spock.convert_all_sequences(method text default 'snowflake', + * force bool default false) + * returns integer + * + * For every user-visible sequence in the current database, insert a row + * into spock.sequence_kind assigning it to the given method. Sequences + * that already have a row are skipped unless force is true. + * + * Returns the number of sequences converted (or re-converted). + * + * NOTE for non-Snowflake methods: a non-trivial implementation would also + * pre-set the sequence's starting value above the max(col) value across + * the cluster. Snowflake values are well above any realistic local + * value (the timestamp field starts at the Spock epoch shifted left by + * 22 bits, so even the first value after epoch is around 2^53 to 2^54); + * they cannot collide with realistic local values, so no such + * pre-seeding is needed here. + */ +Datum +spock_convert_all_sequences(PG_FUNCTION_ARGS) +{ + text *kind_in = PG_GETARG_TEXT_PP(0); + bool force = PG_GETARG_BOOL(1); + char *kindstr = text_to_cstring(kind_in); + SpockSeqAmKind kind; + Relation pgclass; + Relation skrel; + RangeVar *skrv; + SysScanDesc scan; + HeapTuple tup; + int n_converted = 0; + + /* Touches sequences owned by other roles; restrict to superuser. */ + if (!superuser()) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("must be superuser to run spock.convert_all_sequences"), + errhint("Use spock.alter_sequence_set_kind() per sequence to " + "convert sequences you own."))); + + kind = parse_sequence_kind(kindstr); + + /* Database-scoped advisory xact lock: serialise concurrent conversions. */ + { + LOCKTAG tag; + + SET_LOCKTAG_ADVISORY(tag, MyDatabaseId, + (uint32) 0x53504F43, /* "SPOC" */ + (uint32) 0x4B534551, /* "KSEQ" */ + 1); + (void) LockAcquire(&tag, ExclusiveLock, false, false); + } + + skrv = makeRangeVar(EXTENSION_NAME, CATALOG_SEQUENCE_KIND, -1); + skrel = table_openrv(skrv, RowExclusiveLock); + + pgclass = table_open(RelationRelationId, AccessShareLock); + scan = systable_beginscan(pgclass, InvalidOid, false, NULL, 0, NULL); + + while (HeapTupleIsValid(tup = systable_getnext(scan))) + { + Form_pg_class classform = (Form_pg_class) GETSTRUCT(tup); + Oid seqoid; + SysScanDesc sscan; + ScanKeyData key[2]; + HeapTuple srow; + bool exists; + + if (classform->relkind != RELKIND_SEQUENCE) + continue; + + /* + * Skip system / temp schemas. `relnamespace < FirstNormalObjectId` + * would also skip "public" (OID 2200), which is wrong; filter by + * IsCatalogNamespace + isAnyTempNamespace + "pg_*"/"information_schema". + */ + { + Oid nspoid = classform->relnamespace; + char *nspname; + + if (IsCatalogNamespace(nspoid) || isAnyTempNamespace(nspoid)) + continue; + + nspname = get_namespace_name(nspoid); + if (nspname == NULL) + continue; + if (strncmp(nspname, "pg_", 3) == 0 || + strcmp(nspname, "information_schema") == 0) + { + pfree(nspname); + continue; + } + pfree(nspname); + } + + seqoid = classform->oid; + + /* Extension-owned sequences (e.g. Spock's own sub_id_generator). */ + if (OidIsValid(getExtensionOfObject(RelationRelationId, seqoid))) + continue; + + /* + * ShareUpdateExclusive blocks concurrent ALTER SEQUENCE; held to + * commit, so seqcache / seqmax / seqtypid we validate below cannot + * drift before our catalog write. Self-conflicting; doesn't block + * nextval. + */ + LockRelationOid(seqoid, ShareUpdateExclusiveLock); + if (get_rel_relkind(seqoid) != RELKIND_SEQUENCE) + { + UnlockRelationOid(seqoid, ShareUpdateExclusiveLock); + continue; + } + + /* + * Fail the whole conversion on any incompatible target -- partial + * conversion across the cluster is worse than none. + */ + validate_sequence_target(seqoid, kind); + + { + char *nspname = get_namespace_name(classform->relnamespace); + char *relname = NameStr(classform->relname); + + ScanKeyInit(&key[0], + Anum_sequence_kind_nspname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(nspname)); + ScanKeyInit(&key[1], + Anum_sequence_kind_relname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(relname)); + sscan = systable_beginscan(skrel, seqam_catalog_pkidx_oid(skrel), + true, NULL, 2, key); + srow = systable_getnext(sscan); + exists = HeapTupleIsValid(srow); + + /* + * (a) no row -> INSERT; (b) row exists, force or kind change -> + * UPDATE kind+seqoid; (c) row exists, no kind change, seqoid + * drifted (logical restore) -> refresh seqoid only. + */ + { + Datum values[Natts_sequence_kind]; + bool nulls[Natts_sequence_kind]; + bool replaces[Natts_sequence_kind]; + HeapTuple newtup; + bool did_write = false; + bool needs_reset = false; + + MemSet(nulls, false, sizeof(nulls)); + MemSet(replaces, false, sizeof(replaces)); + + values[Anum_sequence_kind_nspname - 1] = CStringGetDatum(nspname); + values[Anum_sequence_kind_relname - 1] = CStringGetDatum(relname); + values[Anum_sequence_kind_kind - 1] = CStringGetTextDatum(kindstr); + values[Anum_sequence_kind_seqoid - 1] = ObjectIdGetDatum(seqoid); + + if (!exists) + { + newtup = heap_form_tuple(RelationGetDescr(skrel), + values, nulls); + CatalogTupleInsert(skrel, newtup); + did_write = true; + needs_reset = (kind != SPOCK_SEQAM_LOCAL); + } + else + { + Datum row_kind_d; + Datum row_seqoid_d; + bool isnull; + SpockSeqAmKind row_kind; + Oid row_seqoid; + + row_kind_d = heap_getattr(srow, Anum_sequence_kind_kind, + RelationGetDescr(skrel), &isnull); + if (isnull) + elog(ERROR, + "spock.sequence_kind row for %s.%s has NULL kind", + nspname, relname); + row_kind = seqam_parse_kind_datum(row_kind_d, relname); + + row_seqoid_d = heap_getattr(srow, Anum_sequence_kind_seqoid, + RelationGetDescr(skrel), &isnull); + row_seqoid = isnull ? InvalidOid : DatumGetObjectId(row_seqoid_d); + + if (force || row_kind != kind) + { + replaces[Anum_sequence_kind_kind - 1] = true; + replaces[Anum_sequence_kind_seqoid - 1] = true; + newtup = heap_modify_tuple(srow, + RelationGetDescr(skrel), + values, nulls, replaces); + CatalogTupleUpdate(skrel, &srow->t_self, newtup); + did_write = true; + /* Reset only on actual transition; force-rewrite + * of the same kind keeps in-flight (ts, ctr). */ + needs_reset = (kind != SPOCK_SEQAM_LOCAL && + row_kind != kind); + } + else if (row_seqoid != seqoid) + { + /* Refresh stale seqoid only. */ + replaces[Anum_sequence_kind_seqoid - 1] = true; + newtup = heap_modify_tuple(srow, + RelationGetDescr(skrel), + values, nulls, replaces); + CatalogTupleUpdate(skrel, &srow->t_self, newtup); + did_write = true; + } + } + + if (needs_reset) + { + Relation seqrel; + + seqrel = table_open(seqoid, RowExclusiveLock); + seqam_reset_sequence_data(seqrel); + table_close(seqrel, NoLock); + } + + if (did_write) + { + CacheInvalidateRelcacheByRelid(seqoid); + n_converted++; + } + } + + systable_endscan(sscan); + } + } + + systable_endscan(scan); + table_close(pgclass, AccessShareLock); + table_close(skrel, RowExclusiveLock); + + CommandCounterIncrement(); + pfree(kindstr); + + PG_RETURN_INT32(n_converted); +} + +/* + * True when Spock is attached to the in-core nextval_hook. The library + * fails to load on unpatched PG (PGDLLIMPORT reference), so this is + * effectively always true once SQL can call it; retained for monitoring + * tools and forward compatibility with a runtime-feature-probe build. + */ +Datum +spock_sequence_hook_available(PG_FUNCTION_ARGS) +{ + PG_RETURN_BOOL((void *) nextval_hook == (void *) spock_seqam_nextval); +} diff --git a/src/spock_seqam_snowflake.c b/src/spock_seqam_snowflake.c new file mode 100644 index 00000000..2f62bba5 --- /dev/null +++ b/src/spock_seqam_snowflake.c @@ -0,0 +1,519 @@ +/*------------------------------------------------------------------------- + * + * spock_seqam_snowflake.c + * Snowflake distributed sequence access method for Spock. + * + * Generates int8 sequence values with the layout: + * + * bit 63 : reserved, always 0 (sign bit, keeps values non-negative) + * bits 62..22 : 41-bit timestamp in milliseconds since + * SPOCK_SNOWFLAKE_EPOCH_MS + * bits 21..12 : 10-bit node id (0..1023) + * bits 11..0 : 12-bit per-millisecond counter (0..4095) + * + * Uniqueness across a Spock cluster is guaranteed by the embedded node id, + * derived from spock.node.id (10 low bits). Operators must pick node + * names whose hashes don't collide on the low 10 bits. Within a node, + * uniqueness is guaranteed by the buffer lock on the sequence's heap + * page held while we read (ts, ctr) from last_value and write the new + * value back. + * + * Persistence: the (ts, ctr) state lives in the sequence's heap tuple + * (last_value), so crash recovery is the same as for a stock PostgreSQL + * sequence: replay of XLOG_SEQ_LOG restores last_value, and the next + * nextval() reads it. No DSA, no shmem slot. The cost is one WAL record + * and one exclusive buffer lock per call -- the explicit trade for + * crash-safety and ~1000 lines of subsystem code. + * + * Clock-skew handling: we keep the *most recent* (ts, ctr) pair in + * last_value (with the node_id bits masked off for interpretation). If + * the wall clock regresses (NTP step, VM migration), we keep emitting + * from the last observed timestamp and advance only the counter. We + * never emit a value whose timestamp is lower than one we have already + * emitted. A WARNING is logged on the first regression per backend per + * minute. + * + * Counter exhaustion (4096 values per millisecond per node): we advance + * to ts+1 and reset the counter. Under sustained > 4M nextval()/s per + * node this would push our notion of time ahead of the wall clock + * indefinitely, which is fine; if the workload subsides, the clock + * catches up and the timestamp field re-syncs. + * + * Copyright (c) 2022-2026, pgEdge, Inc. + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/htup_details.h" +#include "access/xact.h" +#include "access/xlog.h" +#include "access/xloginsert.h" +#include "catalog/pg_sequence.h" +#include "commands/sequence.h" /* xl_seq_rec, XLOG_SEQ_LOG */ +#include "miscadmin.h" +#include "storage/bufmgr.h" +#include "storage/bufpage.h" +#include "storage/lmgr.h" +#include "utils/rel.h" +#include "utils/timestamp.h" + +#include "spock_node.h" +#include "spock_seqam.h" + + +/* + * Per-backend node id cache. + * + * Resolved on first call from spock.node (the local Spock node) and + * masked to the snowflake field width. Storing the result lets the hot + * path skip the catalog lookup on every nextval() call. + * + * -1 is the "unresolved" sentinel; once we resolve, the cached value is + * 1..1023 (zero is reserved). + */ +static int32 SnowflakeBackendNodeId = -1; + +/* + * Rate-limit clock-skew WARNING emission: one per backend per minute. + */ +static TimestampTz SnowflakeLastSkewWarning = 0; +#define SNOWFLAKE_SKEW_WARN_INTERVAL_US INT64CONST(60000000) /* 60 s */ + + +/* + * Resolve the node id once per backend and cache it. + * + * Derives from the local Spock node: spock.node.id is a 16-bit value + * (hash_any(name) & 0xffff at node_create); we mask it to the snowflake + * field's 10 bits. Two nodes whose names hash to the same low-10-bit + * value silently collide cluster-wide -- operators are responsible for + * picking names with distinct low-10-bit hashes across peer nodes. + * + * Errors if Spock has not been initialised (no spock.node row) or if + * the derived id is 0 (reserved). + */ +static int32 +snowflake_resolve_node_id(void) +{ + SpockLocalNode *ln; + + if (SnowflakeBackendNodeId >= 0) + return SnowflakeBackendNodeId; + + ln = get_local_node(false, true); + if (ln == NULL || ln->node == NULL || ln->node->id == InvalidOid) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("snowflake sequences require a configured Spock node"), + errhint("Run spock.node_create() before using snowflake-managed sequences."))); + + SnowflakeBackendNodeId = + (int32) (ln->node->id & SPOCK_SNOWFLAKE_MAX_NODE_ID); + + if (SnowflakeBackendNodeId == 0) + ereport(ERROR, + (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), + errmsg("Spock node \"%s\" hashes to snowflake node 0", ln->node->name), + errhint("Rename the Spock node; node id 0 is reserved."))); + + return SnowflakeBackendNodeId; +} + + +/* + * Current wall-clock time in milliseconds since SPOCK_SNOWFLAKE_EPOCH_MS. + * + * GetCurrentTimestamp() returns TimestampTz, microseconds since + * 2000-01-01 00:00:00 UTC. Unix-epoch ms = us/1000 + 946684800000. + */ +static int64 +snowflake_now_ms_since_epoch(void) +{ + TimestampTz now = GetCurrentTimestamp(); + int64 unix_ms; + + unix_ms = (now / INT64CONST(1000)) + INT64CONST(946684800000); + return unix_ms - SPOCK_SNOWFLAKE_EPOCH_MS; +} + + +/* + * Bit-unpack helper for the (ts_ms, counter) pair we recover from + * last_value. Layout matches the snowflake output but with the 10 + * node_id bits masked off by the caller: the node_id is a per-backend + * constant and is composed back into the *output* value at emit time. + */ +#define SF_STATE_TS_SHIFT SPOCK_SNOWFLAKE_COUNTER_BITS +#define SF_STATE_CTR_MASK SPOCK_SNOWFLAKE_COUNTER_MASK +#define SF_STATE_TS_MASK (((UINT64CONST(1) << SPOCK_SNOWFLAKE_TIMESTAMP_BITS) - 1) \ + << SF_STATE_TS_SHIFT) + +static inline void +sf_unpack(uint64 packed, int64 *ts, int32 *ctr) +{ + *ctr = (int32) (packed & SF_STATE_CTR_MASK); + *ts = (int64) ((packed & SF_STATE_TS_MASK) >> SF_STATE_TS_SHIFT); +} + + +/* + * Locate the sequence tuple in the relation's first (and only) block and + * return a pointer into the pinned, exclusive-locked buffer. Mirrors the + * static read_seq_tuple() helper in src/backend/commands/sequence.c, which + * is not exposed to extensions. + * + * TODO(seqam-merge): drop in favour of the public helper when Paquier's + * SeqAM patch lands. + * + * Caller releases the buffer via UnlockReleaseBuffer(). + */ +static Form_pg_sequence_data +snowflake_lock_seq_tuple(Relation rel, Buffer *buf, HeapTupleData *tupout) +{ + Page page; + ItemId lp; + + *buf = ReadBuffer(rel, 0); + LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); + + page = BufferGetPage(*buf); + + /* + * Validate the page's special-area magic. Catches a future core + * change to sequence page layout that would otherwise let us write + * over the wrong tuple. + */ + { + SpockSequencePageMagic *sm; + + sm = (SpockSequencePageMagic *) PageGetSpecialPointer(page); + if (sm->magic != SPOCK_SEQUENCE_PAGE_MAGIC) + elog(ERROR, "bad magic number in sequence \"%s\": %08X", + RelationGetRelationName(rel), sm->magic); + } + + /* A sequence page carries exactly one tuple at FirstOffsetNumber. */ + Assert(PageGetMaxOffsetNumber(page) == FirstOffsetNumber); + + lp = PageGetItemId(page, FirstOffsetNumber); + Assert(ItemIdIsNormal(lp)); + + tupout->t_data = (HeapTupleHeader) PageGetItem(page, lp); + tupout->t_len = ItemIdGetLength(lp); + + /* + * Pre-PG-12 SELECT FOR UPDATE on a sequence could leave a non-frozen + * xmax behind. Modern core declines such locks but the bug class is + * cheap to defend against; if a tuple carries one, clear it. Hint- + * level update -- no WAL needed. + */ + Assert(!(tupout->t_data->t_infomask & HEAP_XMAX_IS_MULTI)); + if (HeapTupleHeaderGetRawXmax(tupout->t_data) != InvalidTransactionId) + { + HeapTupleHeaderSetXmax(tupout->t_data, InvalidTransactionId); + tupout->t_data->t_infomask &= ~HEAP_XMAX_COMMITTED; + tupout->t_data->t_infomask |= HEAP_XMAX_INVALID; + MarkBufferDirtyHint(*buf, true); + } + + return (Form_pg_sequence_data) GETSTRUCT(tupout); +} + + +/* + * The hot path. Reads the current (ts, ctr) from last_value, composes the + * new (ts, ctr) using the same clock-skew / counter-exhaustion logic as the + * shmem-CAS variant did, writes it back, and emits one XLOG_SEQ_LOG record. + * + * Called from nextval_internal() via Spock's dispatcher. Parallel-mode is + * impossible here -- nextval_internal() PreventCommandIfParallelMode's + * before our hook fires. + * + * Two timestamp-overflow checks: one against the wall clock (top of + * function), one against the synthesised timestamp inside the loop (the + * synthesised value can drift past the wall clock under counter + * exhaustion). Both ereport(ERROR) -- a wrapped 41-bit field would + * silently break monotonicity. + */ +int64 +spock_seqam_snowflake_nextval(Relation rel, + int64 incby, int64 maxv, int64 minv, + int64 cache, bool cycle, + int64 *last) +{ + int32 node_id = snowflake_resolve_node_id(); + int64 value; + int64 now_ms; + int64 cur_ts; + int32 cur_ctr; + int64 new_ts; + int32 new_ctr; + uint64 cur_packed; + Buffer buf; + Page page; + HeapTupleData seqdatatuple; + Form_pg_sequence_data seq; + + (void) maxv; + (void) minv; + (void) cache; + (void) cycle; + + Assert(node_id >= 1 && node_id <= SPOCK_SNOWFLAKE_MAX_NODE_ID); + + now_ms = snowflake_now_ms_since_epoch(); + if (now_ms < 0) + ereport(ERROR, + (errcode(ERRCODE_DATA_EXCEPTION), + errmsg("system clock is before the Spock snowflake epoch"), + errdetail("System clock indicates a time earlier than " + "2023-01-01 UTC. Snowflake sequences cannot " + "produce values until the clock is correct."))); + + if (((uint64) now_ms) >> SPOCK_SNOWFLAKE_TIMESTAMP_BITS) + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("snowflake timestamp field exhausted"), + errdetail("The 41-bit timestamp field exhausted around " + "the year 2092. Snowflake sequences can no " + "longer produce monotonically increasing " + "values; a new epoch is required."))); + + /* + * Read current state from the sequence tuple. The buffer lock + * serialises us against any other backend doing the same on this + * sequence, so we do not need a CAS retry loop. + */ + seq = snowflake_lock_seq_tuple(rel, &buf, &seqdatatuple); + page = BufferGetPage(buf); + + /* + * Strip the 10 node_id bits so we interpret last_value as our packed + * (ts, ctr). A stock-PG last_value (small counter, e.g. 1..N) decodes + * to (ts=0, ctr=N), which is harmless: the first now_ms > 0 path below + * resets the counter. alter_sequence_set_kind() also rewrites + * last_value to 0 explicitly when assigning snowflake, so we don't + * rely on this property. + */ + cur_packed = ((uint64) seq->last_value) & ~SPOCK_SNOWFLAKE_NODE_MASK; + sf_unpack(cur_packed, &cur_ts, &cur_ctr); + + if (now_ms > cur_ts) + { + /* Normal forward path: time advanced, reset counter. */ + new_ts = now_ms; + new_ctr = 0; + } + else + { + int64 candidate_ctr; + + /* + * Either we are within the same millisecond as the last emit + * (very high call rate) or the wall clock regressed. Either + * way: do not regress; use cur_ts and bump the counter by the + * sequence's INCREMENT. + * + * Do the arithmetic in int64 so a user who ALTERs INCREMENT to + * a large value after the kind was set (bypassing + * validate_sequence_target) cannot trigger int32 overflow here. + * validate_sequence_target enforces 1..COUNTER_MASK at SET KIND + * time; this is belt-and-braces. + */ + new_ts = cur_ts; + candidate_ctr = (int64) cur_ctr + incby; + + if (candidate_ctr > (int64) SPOCK_SNOWFLAKE_COUNTER_MASK) + { + /* + * Counter exhausted in this millisecond. Advance the + * timestamp by one ms and reset the counter. We are now + * synthesising a future timestamp; the wall clock will + * catch up. + */ + new_ts = cur_ts + 1; + new_ctr = 0; + + /* + * Sustained > 4M nextval()/s/sequence/node could push the + * synthesised timestamp arbitrarily far into the future and + * ultimately overflow the 41-bit field. Refuse the write + * here rather than silently wrapping. + */ + if (((uint64) new_ts) >> SPOCK_SNOWFLAKE_TIMESTAMP_BITS) + { + UnlockReleaseBuffer(buf); + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("snowflake timestamp field exhausted under saturation"), + errdetail("Per-millisecond counter exhaustion has pushed the " + "synthesised timestamp beyond the 41-bit field " + "(year ~2092). Reduce sustained nextval() throughput " + "or assign additional snowflake-managed sequences " + "to spread the load."))); + } + } + else + { + /* Safe: candidate_ctr is in [0, COUNTER_MASK], fits int32. */ + new_ctr = (int32) candidate_ctr; + } + + if (now_ms < cur_ts) + { + TimestampTz now_us = GetCurrentTimestamp(); + + if (now_us - SnowflakeLastSkewWarning >= + SNOWFLAKE_SKEW_WARN_INTERVAL_US) + { + ereport(WARNING, + (errmsg("wall clock regressed by " INT64_FORMAT " ms on a Spock snowflake sequence", + cur_ts - now_ms), + errdetail("Continuing from the last observed " + "timestamp. No duplicate values " + "will be emitted."))); + SnowflakeLastSkewWarning = now_us; + } + } + } + + /* + * Compose the output value with the node id slotted in. + */ + value = (int64) + (((uint64) new_ts << SPOCK_SNOWFLAKE_TIMESTAMP_SHIFT) | + ((uint64) node_id << SPOCK_SNOWFLAKE_NODE_SHIFT) | + (uint64) new_ctr); + + Assert(value >= 0); + + /* + * WAL batching: we pre-log a reservation covering the next + * SPOCK_SNOWFLAKE_LOG_INTERVAL_MS milliseconds. seq->log_cnt holds + * the next-reservation threshold (encoded as a snowflake). Until + * the current value crosses it, we skip the WAL record. On crash, + * replay restores the reservation, so the next nextval after + * recovery emits at >= the reserved value -- never a duplicate. + * + * Same idea core PG uses for stock sequence caching, in the time + * domain instead of count domain. + */ + { + /* + * Compare timestamp portions only; node_id bits are nominally + * identical between consecutive log writes (node_id is stable + * per Spock node) but stripping them makes the decision immune + * to any future per-backend or per-something-else identity + * scheme. + */ + bool logit; + + logit = (((uint64) value) >> SPOCK_SNOWFLAKE_TIMESTAMP_SHIFT) >= + (((uint64) seq->log_cnt) >> SPOCK_SNOWFLAKE_TIMESTAMP_SHIFT); + + /* Force a log after a checkpoint so crash recovery sees a fresh + * reservation; otherwise replay would restore a pre-checkpoint + * state we have since advanced past. */ + if (!logit) + { + XLogRecPtr redoptr = GetRedoRecPtr(); + + if (PageGetLSN(page) <= redoptr) + logit = true; + } + + /* + * Compute the new threshold before entering the critical section + * so we can ereport(ERROR) cleanly if it would overflow the + * 41-bit field. Threshold encodes (new_ts + interval, 0). + */ + if (logit && RelationNeedsWAL(rel)) + { + int64 future_ts = new_ts + SPOCK_SNOWFLAKE_LOG_INTERVAL_MS; + + if (((uint64) future_ts) >> SPOCK_SNOWFLAKE_TIMESTAMP_BITS) + { + UnlockReleaseBuffer(buf); + ereport(ERROR, + (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), + errmsg("snowflake timestamp field exhausted"), + errdetail("The pre-log reservation window would " + "push the timestamp past the 41-bit " + "field (year ~2092)."))); + } + + /* Acquire xid for syncrep outside the critical section. */ + GetTopTransactionId(); + + START_CRIT_SECTION(); + MarkBufferDirty(buf); + + { + xl_seq_rec xlrec; + XLogRecPtr recptr; + int64 future_value; + + future_value = (int64) + (((uint64) future_ts << SPOCK_SNOWFLAKE_TIMESTAMP_SHIFT) | + ((uint64) node_id << SPOCK_SNOWFLAKE_NODE_SHIFT)); + + XLogBeginInsert(); + XLogRegisterBuffer(0, buf, REGBUF_WILL_INIT); + + /* + * Write the FUTURE state into the tuple before + * XLogRegisterData captures it, so crash recovery + * restores the reservation. + */ + seq->last_value = future_value; + seq->is_called = true; + seq->log_cnt = future_value; + +#if PG_VERSION_NUM >= 160000 + xlrec.locator = rel->rd_locator; +#else + xlrec.node = rel->rd_node; +#endif + XLogRegisterData(&xlrec, sizeof(xl_seq_rec)); + XLogRegisterData(seqdatatuple.t_data, seqdatatuple.t_len); + + recptr = XLogInsert(RM_SEQ_ID, XLOG_SEQ_LOG); + PageSetLSN(page, recptr); + } + + /* Now overwrite with the ACTUAL state for currval/lastval. */ + seq->last_value = value; + seq->is_called = true; + /* seq->log_cnt stays at the threshold we just logged. */ + + END_CRIT_SECTION(); + } + else + { + /* + * No WAL this call: either still within the previous + * reservation window, or the relation does not need WAL + * (temp / unlogged). Write the actual value directly. + */ + START_CRIT_SECTION(); + MarkBufferDirty(buf); + + seq->last_value = value; + seq->is_called = true; + + END_CRIT_SECTION(); + } + } + + UnlockReleaseBuffer(buf); + + /* + * Snowflake does not prefetch: every nextval() must round-trip through + * us to compose a fresh (timestamp, counter, node_id). Setting *last + * to the returned value keeps elm->last == elm->cached so the in-core + * CACHE fast path does not fire on the next call. + */ + *last = value; + return value; +} diff --git a/src/spock_sequences.c b/src/spock_sequences.c index e9496501..a4f2a5bc 100644 --- a/src/spock_sequences.c +++ b/src/spock_sequences.c @@ -35,6 +35,7 @@ #include "spock_compat.h" #include "spock_queue.h" #include "spock_repset.h" +#include "spock_seqam.h" #define CATALOG_SEQUENCE_STATE "sequence_state" @@ -62,13 +63,13 @@ sequence_get_last_value(Oid seqoid) SysScanDesc scan; HeapTuple tup; int64 last_value; - Form_pg_sequence seq; + Form_pg_sequence_data seq; seqrel = table_open(seqoid, AccessShareLock); scan = systable_beginscan(seqrel, 0, false, NULL, 0, NULL); tup = systable_getnext(scan); Assert(HeapTupleIsValid(tup)); - seq = (Form_pg_sequence) GETSTRUCT(tup); + seq = (Form_pg_sequence_data) GETSTRUCT(tup); last_value = seq->last_value; systable_endscan(scan); table_close(seqrel, AccessShareLock); @@ -117,9 +118,22 @@ synchronize_sequences(void) char *nspname; char *relname; StringInfoData json; + SpockSeqAmKind kind; CHECK_FOR_INTERRUPTS(); + /* + * Managed sequences generate independently on every node and must + * not be cluster-propagated. Skip without disturbing their state + * row -- the row may still serve other monitoring purposes. + */ + nspname = get_namespace_name(get_rel_namespace(oldseq->seqoid)); + relname = get_rel_name(oldseq->seqoid); + if (nspname != NULL && relname != NULL && + spock_seqam_lookup_kind_by_name(nspname, relname, &kind) && + kind != SPOCK_SEQAM_LOCAL) + continue; + last_value = sequence_get_last_value(oldseq->seqoid); /* Not enough of the sequence was consumed yet for us to care. */ @@ -197,6 +211,7 @@ synchronize_sequence(Oid seqoid) char *nspname; char *relname; StringInfoData json; + SpockSeqAmKind kind; SpockLocalNode *local_node = get_local_node(true, false); /* Check if the oid points to actual sequence. */ @@ -208,6 +223,22 @@ synchronize_sequence(Oid seqoid) errmsg("\"%s\" is not a sequence", RelationGetRelationName(seqrel)))); + /* + * Managed sequences (snowflake, ...) generate independently on every + * node and must not be cluster-propagated. Resolve by (nspname, + * relname); the catalog may not exist yet (early extension lifecycle) + * in which case spock_seqam_lookup_kind_by_name returns false and we + * proceed as for a local sequence. + */ + nspname = get_namespace_name(RelationGetNamespace(seqrel)); + relname = RelationGetRelationName(seqrel); + if (spock_seqam_lookup_kind_by_name(nspname, relname, &kind) && + kind != SPOCK_SEQAM_LOCAL) + { + table_close(seqrel, AccessShareLock); + return; + } + /* Now search for it in our tracking table. */ rv = makeRangeVar(EXTENSION_NAME, CATALOG_SEQUENCE_STATE, -1); rel = table_openrv(rv, RowExclusiveLock); diff --git a/tests/regress/expected/init.out b/tests/regress/expected/init.out index 4cb7d4b9..8fef60c9 100644 --- a/tests/regress/expected/init.out +++ b/tests/regress/expected/init.out @@ -21,7 +21,12 @@ CREATE USER super SUPERUSER; GRANT ALL ON SCHEMA public TO nonsuper; DO $$ BEGIN - IF (SELECT setting::integer/100 FROM pg_settings WHERE name = 'server_version_num') >= 1000 THEN + -- The pg_current_xlog_location compat shim is for PG < 10 only and lives + -- in the spock schema, which is created later by CREATE EXTENSION spock. + -- Guard on schema presence so this DO block is a no-op on a fresh + -- provider_dsn where spock has not yet been installed. + IF (SELECT setting::integer/100 FROM pg_settings WHERE name = 'server_version_num') >= 1000 + AND EXISTS (SELECT 1 FROM pg_namespace WHERE nspname = 'spock') THEN CREATE OR REPLACE FUNCTION spock.pg_current_xlog_location() RETURNS pg_lsn LANGUAGE SQL AS 'SELECT pg_current_wal_lsn()'; ALTER FUNCTION spock.pg_current_xlog_location() OWNER TO super; @@ -39,17 +44,19 @@ SELECT E'\'' || current_database() || E'\'' AS subdb; \c :provider_dsn SET client_min_messages = 'warning'; CREATE EXTENSION IF NOT EXISTS spock; --- Check default settings of installed Spock extension +-- Check default settings of installed Spock extension. +-- extconfig is cast to regclass[] because the OIDs in it vary with +-- installation order; the names are stable. SELECT extnamespace::regnamespace, extrelocatable, - extconfig, + extconfig::regclass[], extcondition, obj_description(oid) AS comment FROM pg_extension WHERE extname = 'spock'; - extnamespace | extrelocatable | extconfig | extcondition | comment ---------------+----------------+-----------+--------------+-------------------------------- - spock | f | | | PostgreSQL Logical Replication + extnamespace | extrelocatable | extconfig | extcondition | comment +--------------+----------------+-----------------------+--------------+-------------------------------- + spock | f | {spock.sequence_kind} | {""} | PostgreSQL Logical Replication (1 row) SELECT * FROM spock.node_create(node_name := 'test_provider', dsn := (SELECT provider_dsn FROM spock_regress_variables()) || ' user=super'); diff --git a/tests/regress/sql/init.sql b/tests/regress/sql/init.sql index 6698e490..cdf1fc2c 100644 --- a/tests/regress/sql/init.sql +++ b/tests/regress/sql/init.sql @@ -28,7 +28,12 @@ GRANT ALL ON SCHEMA public TO nonsuper; DO $$ BEGIN - IF (SELECT setting::integer/100 FROM pg_settings WHERE name = 'server_version_num') >= 1000 THEN + -- The pg_current_xlog_location compat shim is for PG < 10 only and lives + -- in the spock schema, which is created later by CREATE EXTENSION spock. + -- Guard on schema presence so this DO block is a no-op on a fresh + -- provider_dsn where spock has not yet been installed. + IF (SELECT setting::integer/100 FROM pg_settings WHERE name = 'server_version_num') >= 1000 + AND EXISTS (SELECT 1 FROM pg_namespace WHERE nspname = 'spock') THEN CREATE OR REPLACE FUNCTION spock.pg_current_xlog_location() RETURNS pg_lsn LANGUAGE SQL AS 'SELECT pg_current_wal_lsn()'; ALTER FUNCTION spock.pg_current_xlog_location() OWNER TO super; @@ -45,11 +50,13 @@ SET client_min_messages = 'warning'; CREATE EXTENSION IF NOT EXISTS spock; --- Check default settings of installed Spock extension +-- Check default settings of installed Spock extension. +-- extconfig is cast to regclass[] because the OIDs in it vary with +-- installation order; the names are stable. SELECT extnamespace::regnamespace, extrelocatable, - extconfig, + extconfig::regclass[], extcondition, obj_description(oid) AS comment FROM pg_extension WHERE extname = 'spock'; diff --git a/tests/tap/schedule b/tests/tap/schedule index 66ed7b5e..c3fc7f66 100644 --- a/tests/tap/schedule +++ b/tests/tap/schedule @@ -45,3 +45,4 @@ test: 019_stale_fd_epoll_after_conn_death test: 022_rmgr_progress_post_checkpoint_crash # Upgrade schema match test (builds from source, slow): #test: 018_upgrade_schema_match +