Restructure heap apply retry into two phases for two distinct races#488
Restructure heap apply retry into two phases for two distinct races#488ibrarahmad wants to merge 1 commit into
Conversation
spock_apply_heap_update() and spock_apply_heap_delete() shared a single retry loop that conflated a tight retry for a brief local visibility window with a CV wait for the predecessor insert commit. One iteration count covered both, so under high lag five CV waits could block apply for tens of seconds. Split the loop: phase 1 retries tight up to spock.read_retry_count times with no wait; phase 2 calls the new wait_for_previous_transaction_timeout() once, capped by the new spock.read_retry_wait_ms GUC (default 100 ms).
📝 WalkthroughWalkthroughThis PR adds a two-phase retry mechanism for missing rows encountered during heap UPDATE and DELETE apply operations in replication, implemented via a new ChangesTwo-Phase Heap Apply Retry
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Duplication | 0 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/tap/t/030_read_retry_count_guc.pl (1)
131-221: ⚡ Quick winConsider extracting a GUC test helper to reduce duplication.
The test block for
spock.read_retry_wait_ms(lines 131-221) mirrors the structure forspock.read_retry_count(lines 38-129) with ~90 lines of near-identical test logic. Both blocks verify default value, pg_settings metadata, ALTER SYSTEM behavior, and range validation—only the GUC name, expected values, and variable names differ.Refactoring into a helper function (e.g.,
test_guc_properties($name, $default, $min, $max, $unit)) would reduce duplication and simplify future GUC test additions.Since this follows the pre-existing test pattern, this refactor is optional and can be deferred.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/tap/t/030_read_retry_count_guc.pl` around lines 131 - 221, Extract a reusable helper (e.g., test_guc_properties) that encapsulates the repeated checks (default via scalar_query/SHOW, pg_settings metadata checks for context/unit/min_val/max_val via scalar_query, ALTER SYSTEM SET/RESET with psql_or_bail and sleep, and out-of-range/ boundary checks using system/psql) and call it for both spock.read_retry_count and spock.read_retry_wait_ms with appropriate arguments (name, expected default, min, max, unit); replace the duplicated blocks with calls to test_guc_properties and keep existing helper calls like scalar_query and psql_or_bail inside the new function to preserve behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/tap/t/030_read_retry_count_guc.pl`:
- Around line 131-221: Extract a reusable helper (e.g., test_guc_properties)
that encapsulates the repeated checks (default via scalar_query/SHOW,
pg_settings metadata checks for context/unit/min_val/max_val via scalar_query,
ALTER SYSTEM SET/RESET with psql_or_bail and sleep, and out-of-range/ boundary
checks using system/psql) and call it for both spock.read_retry_count and
spock.read_retry_wait_ms with appropriate arguments (name, expected default,
min, max, unit); replace the duplicated blocks with calls to test_guc_properties
and keep existing helper calls like scalar_query and psql_or_bail inside the new
function to preserve behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9c17f0d4-9ad5-4f4e-b5bb-6d6af74a4664
📒 Files selected for processing (6)
include/spock.hinclude/spock_apply.hsrc/spock.csrc/spock_apply.csrc/spock_apply_heap.ctests/tap/t/030_read_retry_count_guc.pl
Restructure heap apply retry into two phases for two distinct races
Background
When a remote UPDATE or DELETE arrives and
FindReplTupleInLocalReldoes not see the target row,spock_apply_heap_update()andspock_apply_heap_delete()enter a retry loop. That loop is the result of two separate fixes that got layered on top of each other:Commit the earlier fix (Feb 2024), "Retry up to 5 times if tuple not found", said in its message: "For some reason a tuple may not be found for update or delete by primary key under extreme load and concurrency. This is a very rare race condition and a simple retry a few times seems to work." The fix was a tight back-to-back loop, no wait between iterations. It handled a transient local visibility window that resolved itself in microseconds.
Commit the later fix (Sep 2024), "In case of an update or a delete, if we don't find the local tuple, we should wait for the predecessor insert commit", added a
wait_for_previous_transaction()call inside that loop. The wait blocks on the apply group'sprev_processed_cvcondition variable until the predecessor transaction's commit-ts shows up in shared memory.Problem
The two fixes are doing different things, but they share one iteration count and the wait sits inside the tight-retry loop. That creates two problems:
The visibility-window case (the the earlier fix problem) is no longer handled the way the original commit intended, because every iteration now blocks on a CV instead of spinning.
The predecessor-insert case (the the later fix problem) is limited to 5 iterations. With high replication lag, each iteration can block for the full predecessor commit latency, so one missing-row event can stall apply for 5 times that latency. At seconds-scale lag that becomes tens of seconds, and throughput collapses.
A customer reported "aggressive retries in high-lag scenarios significantly slowed replication throughput", which matches the second case. Separate internal feedback flagged "tuple updated concurrently" errors from local concurrency, which matches the first case, where the original tight-spin would have helped and the current CV-gated loop does not.
What this commit changes
The loop is split into two phases. Each phase uses the primitive that fits its race.
Phase 1 is the tight retry. It calls
FindReplTupleInLocalRelup tospock.read_retry_counttimes (default 5), with no wait in between. This is sized for microsecond-scale visibility windows. When the row is present, phase 1 finds it and the function exits.Phase 2 is the bounded predecessor wait. It runs only if phase 1 did not find the row. It calls a new
wait_for_previous_transaction_timeout(int timeout_ms)once, capped atspock.read_retry_wait_ms(default 100 ms). The wait is the same CV-driven one as before, but with a hard deadline. After the wait, we recheck the relation once. If the row is still missing, the function falls through to the existingupdate_missing/delete_missingconflict path. That path already exists and already handles this case.The new
wait_for_previous_transaction_timeout()lives inspock_apply.cand returnsbool(true if the predecessor commit-ts was observed, false on timeout). The existingwait_for_previous_transaction()becomes a one-line wrapper that passestimeout_ms = 0, which means "no deadline". Existing callers keep their current behavior.Behaviour summary
read_retry_wait_ms(100 ms), then fall throughread_retry_wait_ms, then fall throughNew GUC surface
spock.read_retry_countspock.read_retry_wait_msspock.read_retry_countalready existed; its description was updated to call out phase 1 explicitly.spock.read_retry_wait_msis new and caps phase 2. Setting it to 0 disables phase 2 entirely (the loop returns to the conflict path right after phase 1). Both are SIGHUP, so they can be retuned without restarting the apply worker.Alternatives considered
A single wall-clock cap on the existing combined loop. This bounds the worst case but keeps the two races mixed together. It does not restore tight spinning for the visibility-window case.
An adaptive policy keyed off replication lag (lag thresholds, peer comparison, TPS scaling, exponential backoff). This is what the customer initially asked for. It adds shared-memory lag caches, threshold GUCs, peer-comparison logic, and a tuning matrix that nobody will get right. The CV is already event-driven, so there is nothing to back off against. The signal that actually matters is which race we are in, not how lagged the cluster is, and the phased structure answers that statically.
A per-table policy. The apply worker is per-subscription, not per-table, and missing-row events are uniform across tables.
Risk
The existing callers of
wait_for_previous_transaction()inspock_apply.care unaffected. They go through the wrapper, which passestimeout_ms = 0and takes the same code path the function took before.For healthy workloads the default behavior is "tight retry up to 5 times, then wait up to 100 ms". Before this commit the same 5 retries could each wait seconds. The observable difference is faster fall-through to the conflict handler when the row is genuinely missing. The conflict handler is unchanged.
Setting
spock.read_retry_wait_ms = 0reproduces the original the earlier fix behavior (tight spin, no CV wait). Available as a rollback knob if a workload turns out to depend on the previous combined behavior.Tests
tests/tap/t/030_read_retry_count_guc.plcovers both GUCs (28 subtests, passing on PG18): default value,pg_settingsmetadata (context, unit, min, max),ALTER SYSTEM SETpluspg_reload_conf()picks up the new value,ALTER SYSTEM RESETreturns to the default, out-of-range rejected, boundary values accepted.t/001_basic.plalso still passes, which confirms thewait_for_previous_transaction()wrapper has not changed behavior for its existing callers.