syncer: trigger full sync on history gap by okJiang · Pull Request #10670 · tikv/pd

okJiang · 2026-05-12T11:50:33Z

What problem does this PR solve?

Issue Number: Close #10668

When a PD follower requests a region sync index that is no longer available in the leader's history buffer, the leader used to return no records and keep the stream alive. The follower could then keep stale region metadata in memory and in local region storage.

What is changed and how does it work?

Make the leader fall back to full region synchronization when the follower's requested index is zero, older than the history buffer, or ahead of the leader.
Keep the stream unbound until the full snapshot and any catch-up history records are sent, so updates generated during the snapshot are not missed.
Make the follower clear local region cache and persisted region storage when a full sync is triggered after it already had a non-zero history index.
Persist the reset history index so follower restarts do not reload an old index after a full-sync reset.
Apply the same full-sync reset handling to the router region syncer cache path.

Check List

Tests

Unit test

Release note

Fix PD follower region sync recovery when the follower falls behind the leader history buffer.

Summary by CodeRabbit

Bug Fixes
- Improved region synchronization state handling during history→full-sync transitions
- Ensured stale region cache and storage are reset to maintain consistency when sync mode changes
- Fixed history catch-up and restart behavior to detect completion and avoid stale streams
Refactor
- Centralized region-sync response processing into a stateful handler for clearer control flow
- Stream binding now occurs only after appropriate sync phase completion
Tests
- Added tests covering history catch-up, full-sync transitions, cache/storage reset, and server sync scenarios

ti-chi-bot · 2026-05-12T11:50:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-05-12T11:50:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign siddontang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-12T11:52:15Z

📝 Walkthrough

Walkthrough

Leader detects unrecoverable history gaps and falls back to streaming a full region snapshot; followers track history vs full-sync phases, reset cache/storage on transition, persist rebuilt regions, and resume streaming with consistent region state.

Changes

Region syncer history-overflow recovery

Layer / File(s)	Summary
History buffer persistence & accessor `pkg/syncer/history_buffer.go`	`resetWithIndex` now persists immediately; `getFirstIndex()` added to read the buffer's first index under lock.
Server: history vs full-sync routing & full-sync implementation `pkg/syncer/server.go`	`syncHistoryRegion` → `syncFullRegions` routing added; full-sync streams full region batches, catches up history with overflow detection/restart, and binds stream only when appropriate.
Client: stateful response handling & cache/storage reset `pkg/syncer/client.go`, `pkg/syncer/client_test.go`	Introduced `regionSyncState`, extracted `handleRegionSyncResponse`, added `resetRegionCacheAndStorage`, and updated receive loop; tests validate reset and full-sync transitions.
Router: sync-mode flags for leader responses `pkg/mcs/router/server/sync.go`	Per-stream `syncingHistory` and `fullSyncing` flags gate interpretation of leader responses and trigger router-side cache reset on history→full transition.
Tests: server & client validations `pkg/syncer/server_test.go`, `pkg/syncer/client_test.go`	New server tests for fallback/catch-up scenarios and client tests for cache/storage reset and state transitions; includes `testSyncRegionsServer` test double.

sequenceDiagram
  participant Leader as Leader (history buffer)
  participant Server as PD Leader / sync server
  participant Client as PD Follower / RegionSyncer
  participant Storage as Follower region storage & cache

  Leader->>Server: follower StartIndex request
  alt StartIndex within retained history
    Server->>Client: send incremental history batch (StartIndex>0)
    Client->>Storage: apply incremental updates
  else StartIndex outside history or StartIndex==0
    Server->>Client: send full-sync initial batch (StartIndex=0)
    Client->>Client: detect history→full transition (regionSyncState)
    Client->>Storage: reset cache and flush/clear region storage
    Server->>Client: stream full snapshot and remaining regions
    Client->>Storage: persist snapshot and replay catch-up history
    Server->>Client: send completion (StartIndex = catchUpIndex)
    Client->>Client: clear fullSyncing/syncingHistory and enable streaming
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

region syncer: follower may serve stale region cache forever after history buffer overflow #10666: Implements leader detection of history-buffer overflow and follower-side reset + full-sync fallback described in the issue.

Possibly related PRs

tikv/pd#10672: Related refactor of client/server sync handling and stateful response processing.
tikv/pd#10685: Related changes to RegionSyncer.Sync binding/unbinding and stream termination behavior.

Suggested labels

lgtm, approved

Suggested reviewers

rleungx
bufferflies

Poem

🐰 I hopped through history, then found a big gap,
A full-sync was summoned — I emptied my map.
Caches cleared, stores flushed, regions rebuilt in a whirl,
Now follower and leader both see the same world. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'syncer: trigger full sync on history gap' clearly summarizes the main change: implementing full sync triggering when a history gap is detected in region synchronization.
Description check	✅ Passed	The PR description follows the template structure with a clear problem statement, implementation details, checklist items, and release notes. All required sections are completed.
Linked Issues check	✅ Passed	The code changes comprehensively address all requirements from `#10668`: leader detects history gaps and triggers full sync, follower clears cache/storage on full sync reset, history index is persisted, and stream binding is delayed until snapshot is sent.
Out of Scope Changes check	✅ Passed	All changes are directly related to implementing full sync on history gaps. No unrelated refactoring, unrelated bug fixes, or scope creep detected in the modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

pkg/syncer/server_test.go (1)

40-45: ⚡ Quick win

Snapshot responses in the test double to avoid pointer aliasing across sends.

Send appends the original pointer; if the caller reuses/mutates the same proto between sends, earlier assertions can read mutated data. Store a cloned message in responses.

Proposed patch

 import (
 	"context"
 	"testing"
 
 	"github.com/stretchr/testify/require"
 	"google.golang.org/grpc"
+	"google.golang.org/protobuf/proto"
@@
 func (s *testSyncRegionsServer) Send(resp *pdpb.SyncRegionResponse) error {
-	s.responses = append(s.responses, resp)
+	if resp == nil {
+		s.responses = append(s.responses, nil)
+		return nil
+	}
+	s.responses = append(s.responses, proto.Clone(resp).(*pdpb.SyncRegionResponse))
 	if s.onSend != nil {
 		s.onSend(resp)
 	}
 	return nil
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server_test.go` around lines 40 - 45, testSyncRegionsServer.Send
currently appends the incoming *pdpb.SyncRegionResponse pointer directly into
responses which allows later mutations to affect earlier snapshots; instead,
deep-copy (clone) the resp before storing it and invoke onSend with the cloned
copy as well. Update testSyncRegionsServer.Send to create a clone of the
*pdpb.SyncRegionResponse (using the appropriate protobuf clone utility for your
proto package), append that clone to the responses slice, and pass the clone to
s.onSend if set.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/syncer/server_test.go`:
- Around line 40-45: testSyncRegionsServer.Send currently appends the incoming
*pdpb.SyncRegionResponse pointer directly into responses which allows later
mutations to affect earlier snapshots; instead, deep-copy (clone) the resp
before storing it and invoke onSend with the cloned copy as well. Update
testSyncRegionsServer.Send to create a clone of the *pdpb.SyncRegionResponse
(using the appropriate protobuf clone utility for your proto package), append
that clone to the responses slice, and pass the clone to s.onSend if set.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32ad935d-e2a3-46b9-94c9-7e8f0ad8eb7d

📥 Commits

Reviewing files that changed from the base of the PR and between 87710779c704a48e87a472e1031844823c7d3475 and 457178c.

📒 Files selected for processing (6)

pkg/mcs/router/server/sync.go
pkg/syncer/client.go
pkg/syncer/client_test.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
pkg/syncer/server_test.go

🚧 Files skipped from review as they are similar to previous changes (5)

pkg/syncer/history_buffer.go
pkg/mcs/router/server/sync.go
pkg/syncer/client_test.go
pkg/syncer/server.go
pkg/syncer/client.go

Signed-off-by: okjiang <[email protected]>

ti-chi-bot · 2026-05-20T04:15:31Z

@okJiang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`6b56372`	link	true	`/test pull-error-log-review`
pull-unit-test-next-gen-1	`6b56372`	link	true	`/test pull-unit-test-next-gen-1`
pull-unit-test-next-gen-3	`6b56372`	link	true	`/test pull-unit-test-next-gen-3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/syncer/server.go (1)
612-625: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

closeAllClient can still hang shutdown on a blocked send.

Line 623 still does a synchronous gRPC Send after sender.close(). Closing done wakes Sync, but it does not interrupt a blocked transport write, so RunServer can wait forever in closeAllClient. Please reuse the bounded-send path here, or make the close notification best-effort and non-blocking.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 612 - 625, The loop in closeAllClient
still performs a blocking gRPC send (sender.stream.Send) after calling
sender.close(), which can hang shutdown; replace the direct synchronous Send
with the existing bounded/non-blocking send path used elsewhere (e.g., the
sender.enqueue/sendCh/trySend helper or the sender.asyncSend method) so the
close notification is best-effort and does not block RunServer; specifically,
stop calling sender.stream.Send directly in closeAllClient and instead push the
close response into the sender's bounded channel or use its non-blocking
try-send helper (or spawn a goroutine with a select+timeout fallback) to ensure
shutdown cannot be blocked by a stuck transport write.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server.go`:
- Around line 455-477: You are holding s.mu across network I/O
(syncHistoryRecords and stream.Send), which blocks broadcast() and other
readers; change the loop so you only lock to read/validate history and make a
local copy of the records, then unlock before calling syncHistoryRecords or
stream.Send. Specifically, in the loop around
s.history.recordsFrom(catchUpIndex) only hold s.mu to call recordsFrom, check
the overflow condition against s.history.getFirstIndex(), and make a copy of the
returned slice; then release s.mu and call s.syncHistoryRecords(catchUpIndex,
copiedRecords, stream); after that, re-acquire the lock to advance/inspect
shared state as needed and continue. Also move the final stream.Send of resp
outside the s.mu lock so no gRPC send occurs while s.mu is held (this will avoid
blocking broadcast() and RLock() callers).
- Around line 433-443: The code currently recurses by calling
s.syncFullRegions(ctx, name, stream) when catchUpIndex <
s.history.getFirstIndex(), which can cause unbounded goroutine/stack growth
under churn; instead refactor syncFullRegions to use an outer retry loop: remove
the recursive call and continue the outer for/while loop that surrounds the
chunk-fetching logic (the loop that uses s.history.recordsFrom(catchUpIndex)),
resetting any per-attempt state (e.g. catchUpIndex and any temp buffers) as
needed so the function retries from the top without recursion; ensure symbols
mentioned (syncFullRegions, catchUpIndex, s.history.recordsFrom,
s.history.getFirstIndex, stream, ctx, name) are used to locate and update the
logic.

---

Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 612-625: The loop in closeAllClient still performs a blocking gRPC
send (sender.stream.Send) after calling sender.close(), which can hang shutdown;
replace the direct synchronous Send with the existing bounded/non-blocking send
path used elsewhere (e.g., the sender.enqueue/sendCh/trySend helper or the
sender.asyncSend method) so the close notification is best-effort and does not
block RunServer; specifically, stop calling sender.stream.Send directly in
closeAllClient and instead push the close response into the sender's bounded
channel or use its non-blocking try-send helper (or spawn a goroutine with a
select+timeout fallback) to ensure shutdown cannot be blocked by a stuck
transport write.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8bc49a1-7d9d-48db-8068-ee60f4d46585

📥 Commits

Reviewing files that changed from the base of the PR and between 457178c and 6b56372.

📒 Files selected for processing (6)

pkg/mcs/router/server/sync.go
pkg/syncer/client.go
pkg/syncer/client_test.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
pkg/syncer/server_test.go

coderabbitai · 2026-05-20T04:21:06Z

+	for {
+		records := s.history.recordsFrom(catchUpIndex)
+		if len(records) == 0 {
+			if catchUpIndex < s.history.getFirstIndex() {
+				log.Warn("region history buffer overflow during full synchronization, restart full synchronization",
+					zap.String("requested-server", name),
+					zap.String("server", s.server.Name()),
+					zap.Uint64("catch-up-index", catchUpIndex),
+					zap.Uint64("first-index", s.history.getFirstIndex()))
+				return s.syncFullRegions(ctx, name, stream)
+			}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace recursive full-sync retries with an outer loop.

Line 442 re-enters syncFullRegions recursively when catch-up overflows. Under sustained churn this path can repeat many times, so recovery starts growing the goroutine stack instead of simply retrying.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/syncer/server.go` around lines 433 - 443, The code currently recurses by calling s.syncFullRegions(ctx, name, stream) when catchUpIndex < s.history.getFirstIndex(), which can cause unbounded goroutine/stack growth under churn; instead refactor syncFullRegions to use an outer retry loop: remove the recursive call and continue the outer for/while loop that surrounds the chunk-fetching logic (the loop that uses s.history.recordsFrom(catchUpIndex)), resetting any per-attempt state (e.g. catchUpIndex and any temp buffers) as needed so the function retries from the top without recursion; ensure symbols mentioned (syncFullRegions, catchUpIndex, s.history.recordsFrom, s.history.getFirstIndex, stream, ctx, name) are used to locate and update the logic.

coderabbitai · 2026-05-20T04:21:07Z

+	s.mu.Lock()
+	defer s.mu.Unlock()
+	for {
+		records := s.history.recordsFrom(catchUpIndex)
+		if len(records) == 0 {
+			if catchUpIndex < s.history.getFirstIndex() {
+				return nil, errors.Errorf("region history buffer overflow during full sync catch-up, catch-up-index %d, first-index %d", catchUpIndex, s.history.getFirstIndex())
+			}
+			break
+		}
+		if err := s.syncHistoryRecords(catchUpIndex, records, stream); err != nil {
+			return nil, err
+		}
+		catchUpIndex += uint64(len(records))
+	}
+	resp := &pdpb.SyncRegionResponse{
+		Header:     &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()},
+		StartIndex: catchUpIndex,
+	}
+	if err := stream.Send(resp); err != nil {
+		log.Error("failed to send sync region response", errs.ZapError(errs.ErrGRPCSend, err))
+		return nil, err
+	}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid doing gRPC sends while holding s.mu.

From Line 455, the global stream lock stays held while syncHistoryRecords and the completion stream.Send perform network I/O. If that follower stalls here, broadcast() blocks on RLock() and every other downstream stops receiving updates.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/syncer/server.go` around lines 455 - 477, You are holding s.mu across network I/O (syncHistoryRecords and stream.Send), which blocks broadcast() and other readers; change the loop so you only lock to read/validate history and make a local copy of the records, then unlock before calling syncHistoryRecords or stream.Send. Specifically, in the loop around s.history.recordsFrom(catchUpIndex) only hold s.mu to call recordsFrom, check the overflow condition against s.history.getFirstIndex(), and make a copy of the returned slice; then release s.mu and call s.syncHistoryRecords(catchUpIndex, copiedRecords, stream); after that, re-acquire the lock to advance/inspect shared state as needed and continue. Also move the final stream.Send of resp outside the s.mu lock so no gRPC send occurs while s.mu is held (this will avoid blocking broadcast() and RLock() callers).

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 12, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 12, 2026

okJiang force-pushed the codex/region-syncer-full-sync branch from 8fb3d34 to 8771077 Compare May 13, 2026 04:08

okJiang marked this pull request as ready for review May 13, 2026 08:11

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026

okJiang force-pushed the codex/region-syncer-full-sync branch from 8771077 to 457178c Compare May 15, 2026 04:10

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

syncer: trigger full sync on history gap

6b56372

Signed-off-by: okjiang <[email protected]>

okJiang force-pushed the codex/region-syncer-full-sync branch from 457178c to 6b56372 Compare May 20, 2026 04:11

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncer: trigger full sync on history gap#10670

syncer: trigger full sync on history gap#10670
okJiang wants to merge 1 commit into
tikv:masterfrom
okJiang:codex/region-syncer-full-sync

okJiang commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

ti-chi-bot Bot commented May 12, 2026

Uh oh!

ti-chi-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ti-chi-bot Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

coderabbitai Bot May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

okJiang commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 12, 2026

Uh oh!

ti-chi-bot Bot commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented May 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

okJiang commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading