Skip to content

syncer: trigger full sync on history gap#10670

Open
okJiang wants to merge 1 commit into
tikv:masterfrom
okJiang:codex/region-syncer-full-sync
Open

syncer: trigger full sync on history gap#10670
okJiang wants to merge 1 commit into
tikv:masterfrom
okJiang:codex/region-syncer-full-sync

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented May 12, 2026

What problem does this PR solve?

Issue Number: Close #10668

When a PD follower requests a region sync index that is no longer available in the leader's history buffer, the leader used to return no records and keep the stream alive. The follower could then keep stale region metadata in memory and in local region storage.

What is changed and how does it work?

  • Make the leader fall back to full region synchronization when the follower's requested index is zero, older than the history buffer, or ahead of the leader.
  • Keep the stream unbound until the full snapshot and any catch-up history records are sent, so updates generated during the snapshot are not missed.
  • Make the follower clear local region cache and persisted region storage when a full sync is triggered after it already had a non-zero history index.
  • Persist the reset history index so follower restarts do not reload an old index after a full-sync reset.
  • Apply the same full-sync reset handling to the router region syncer cache path.

Check List

Tests

  • Unit test

Release note

Fix PD follower region sync recovery when the follower falls behind the leader history buffer.

Summary by CodeRabbit

  • Bug Fixes

    • Improved region synchronization state handling during history→full-sync transitions
    • Ensured stale region cache and storage are reset to maintain consistency when sync mode changes
    • Fixed history catch-up and restart behavior to detect completion and avoid stale streams
  • Refactor

    • Centralized region-sync response processing into a stateful handler for clearer control flow
    • Stream binding now occurs only after appropriate sync phase completion
  • Tests

    • Added tests covering history catch-up, full-sync transitions, cache/storage reset, and server sync scenarios

Review Change Stack

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 12, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign siddontang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

Leader detects unrecoverable history gaps and falls back to streaming a full region snapshot; followers track history vs full-sync phases, reset cache/storage on transition, persist rebuilt regions, and resume streaming with consistent region state.

Changes

Region syncer history-overflow recovery

Layer / File(s) Summary
History buffer persistence & accessor
pkg/syncer/history_buffer.go
resetWithIndex now persists immediately; getFirstIndex() added to read the buffer's first index under lock.
Server: history vs full-sync routing & full-sync implementation
pkg/syncer/server.go
syncHistoryRegionsyncFullRegions routing added; full-sync streams full region batches, catches up history with overflow detection/restart, and binds stream only when appropriate.
Client: stateful response handling & cache/storage reset
pkg/syncer/client.go, pkg/syncer/client_test.go
Introduced regionSyncState, extracted handleRegionSyncResponse, added resetRegionCacheAndStorage, and updated receive loop; tests validate reset and full-sync transitions.
Router: sync-mode flags for leader responses
pkg/mcs/router/server/sync.go
Per-stream syncingHistory and fullSyncing flags gate interpretation of leader responses and trigger router-side cache reset on history→full transition.
Tests: server & client validations
pkg/syncer/server_test.go, pkg/syncer/client_test.go
New server tests for fallback/catch-up scenarios and client tests for cache/storage reset and state transitions; includes testSyncRegionsServer test double.
sequenceDiagram
  participant Leader as Leader (history buffer)
  participant Server as PD Leader / sync server
  participant Client as PD Follower / RegionSyncer
  participant Storage as Follower region storage & cache

  Leader->>Server: follower StartIndex request
  alt StartIndex within retained history
    Server->>Client: send incremental history batch (StartIndex>0)
    Client->>Storage: apply incremental updates
  else StartIndex outside history or StartIndex==0
    Server->>Client: send full-sync initial batch (StartIndex=0)
    Client->>Client: detect history→full transition (regionSyncState)
    Client->>Storage: reset cache and flush/clear region storage
    Server->>Client: stream full snapshot and remaining regions
    Client->>Storage: persist snapshot and replay catch-up history
    Server->>Client: send completion (StartIndex = catchUpIndex)
    Client->>Client: clear fullSyncing/syncingHistory and enable streaming
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • tikv/pd#10672: Related refactor of client/server sync handling and stateful response processing.
  • tikv/pd#10685: Related changes to RegionSyncer.Sync binding/unbinding and stream termination behavior.

Suggested labels

lgtm, approved

Suggested reviewers

  • rleungx
  • bufferflies

Poem

🐰 I hopped through history, then found a big gap,
A full-sync was summoned — I emptied my map.
Caches cleared, stores flushed, regions rebuilt in a whirl,
Now follower and leader both see the same world. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'syncer: trigger full sync on history gap' clearly summarizes the main change: implementing full sync triggering when a history gap is detected in region synchronization.
Description check ✅ Passed The PR description follows the template structure with a clear problem statement, implementation details, checklist items, and release notes. All required sections are completed.
Linked Issues check ✅ Passed The code changes comprehensively address all requirements from #10668: leader detects history gaps and triggers full sync, follower clears cache/storage on full sync reset, history index is persisted, and stream binding is delayed until snapshot is sent.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing full sync on history gaps. No unrelated refactoring, unrelated bug fixes, or scope creep detected in the modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@okJiang okJiang force-pushed the codex/region-syncer-full-sync branch from 8fb3d34 to 8771077 Compare May 13, 2026 04:08
@okJiang okJiang marked this pull request as ready for review May 13, 2026 08:11
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 13, 2026
@okJiang okJiang force-pushed the codex/region-syncer-full-sync branch from 8771077 to 457178c Compare May 15, 2026 04:10
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/syncer/server_test.go (1)

40-45: ⚡ Quick win

Snapshot responses in the test double to avoid pointer aliasing across sends.

Send appends the original pointer; if the caller reuses/mutates the same proto between sends, earlier assertions can read mutated data. Store a cloned message in responses.

Proposed patch
 import (
 	"context"
 	"testing"
 
 	"github.com/stretchr/testify/require"
 	"google.golang.org/grpc"
+	"google.golang.org/protobuf/proto"
@@
 func (s *testSyncRegionsServer) Send(resp *pdpb.SyncRegionResponse) error {
-	s.responses = append(s.responses, resp)
+	if resp == nil {
+		s.responses = append(s.responses, nil)
+		return nil
+	}
+	s.responses = append(s.responses, proto.Clone(resp).(*pdpb.SyncRegionResponse))
 	if s.onSend != nil {
 		s.onSend(resp)
 	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server_test.go` around lines 40 - 45, testSyncRegionsServer.Send
currently appends the incoming *pdpb.SyncRegionResponse pointer directly into
responses which allows later mutations to affect earlier snapshots; instead,
deep-copy (clone) the resp before storing it and invoke onSend with the cloned
copy as well. Update testSyncRegionsServer.Send to create a clone of the
*pdpb.SyncRegionResponse (using the appropriate protobuf clone utility for your
proto package), append that clone to the responses slice, and pass the clone to
s.onSend if set.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/syncer/server_test.go`:
- Around line 40-45: testSyncRegionsServer.Send currently appends the incoming
*pdpb.SyncRegionResponse pointer directly into responses which allows later
mutations to affect earlier snapshots; instead, deep-copy (clone) the resp
before storing it and invoke onSend with the cloned copy as well. Update
testSyncRegionsServer.Send to create a clone of the *pdpb.SyncRegionResponse
(using the appropriate protobuf clone utility for your proto package), append
that clone to the responses slice, and pass the clone to s.onSend if set.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 32ad935d-e2a3-46b9-94c9-7e8f0ad8eb7d

📥 Commits

Reviewing files that changed from the base of the PR and between 87710779c704a48e87a472e1031844823c7d3475 and 457178c.

📒 Files selected for processing (6)
  • pkg/mcs/router/server/sync.go
  • pkg/syncer/client.go
  • pkg/syncer/client_test.go
  • pkg/syncer/history_buffer.go
  • pkg/syncer/server.go
  • pkg/syncer/server_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/syncer/history_buffer.go
  • pkg/mcs/router/server/sync.go
  • pkg/syncer/client_test.go
  • pkg/syncer/server.go
  • pkg/syncer/client.go

@okJiang okJiang force-pushed the codex/region-syncer-full-sync branch from 457178c to 6b56372 Compare May 20, 2026 04:11
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 20, 2026

@okJiang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 6b56372 link true /test pull-error-log-review
pull-unit-test-next-gen-1 6b56372 link true /test pull-unit-test-next-gen-1
pull-unit-test-next-gen-3 6b56372 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/syncer/server.go (1)

612-625: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

closeAllClient can still hang shutdown on a blocked send.

Line 623 still does a synchronous gRPC Send after sender.close(). Closing done wakes Sync, but it does not interrupt a blocked transport write, so RunServer can wait forever in closeAllClient. Please reuse the bounded-send path here, or make the close notification best-effort and non-blocking.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 612 - 625, The loop in closeAllClient
still performs a blocking gRPC send (sender.stream.Send) after calling
sender.close(), which can hang shutdown; replace the direct synchronous Send
with the existing bounded/non-blocking send path used elsewhere (e.g., the
sender.enqueue/sendCh/trySend helper or the sender.asyncSend method) so the
close notification is best-effort and does not block RunServer; specifically,
stop calling sender.stream.Send directly in closeAllClient and instead push the
close response into the sender's bounded channel or use its non-blocking
try-send helper (or spawn a goroutine with a select+timeout fallback) to ensure
shutdown cannot be blocked by a stuck transport write.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server.go`:
- Around line 455-477: You are holding s.mu across network I/O
(syncHistoryRecords and stream.Send), which blocks broadcast() and other
readers; change the loop so you only lock to read/validate history and make a
local copy of the records, then unlock before calling syncHistoryRecords or
stream.Send. Specifically, in the loop around
s.history.recordsFrom(catchUpIndex) only hold s.mu to call recordsFrom, check
the overflow condition against s.history.getFirstIndex(), and make a copy of the
returned slice; then release s.mu and call s.syncHistoryRecords(catchUpIndex,
copiedRecords, stream); after that, re-acquire the lock to advance/inspect
shared state as needed and continue. Also move the final stream.Send of resp
outside the s.mu lock so no gRPC send occurs while s.mu is held (this will avoid
blocking broadcast() and RLock() callers).
- Around line 433-443: The code currently recurses by calling
s.syncFullRegions(ctx, name, stream) when catchUpIndex <
s.history.getFirstIndex(), which can cause unbounded goroutine/stack growth
under churn; instead refactor syncFullRegions to use an outer retry loop: remove
the recursive call and continue the outer for/while loop that surrounds the
chunk-fetching logic (the loop that uses s.history.recordsFrom(catchUpIndex)),
resetting any per-attempt state (e.g. catchUpIndex and any temp buffers) as
needed so the function retries from the top without recursion; ensure symbols
mentioned (syncFullRegions, catchUpIndex, s.history.recordsFrom,
s.history.getFirstIndex, stream, ctx, name) are used to locate and update the
logic.

---

Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 612-625: The loop in closeAllClient still performs a blocking gRPC
send (sender.stream.Send) after calling sender.close(), which can hang shutdown;
replace the direct synchronous Send with the existing bounded/non-blocking send
path used elsewhere (e.g., the sender.enqueue/sendCh/trySend helper or the
sender.asyncSend method) so the close notification is best-effort and does not
block RunServer; specifically, stop calling sender.stream.Send directly in
closeAllClient and instead push the close response into the sender's bounded
channel or use its non-blocking try-send helper (or spawn a goroutine with a
select+timeout fallback) to ensure shutdown cannot be blocked by a stuck
transport write.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e8bc49a1-7d9d-48db-8068-ee60f4d46585

📥 Commits

Reviewing files that changed from the base of the PR and between 457178c and 6b56372.

📒 Files selected for processing (6)
  • pkg/mcs/router/server/sync.go
  • pkg/syncer/client.go
  • pkg/syncer/client_test.go
  • pkg/syncer/history_buffer.go
  • pkg/syncer/server.go
  • pkg/syncer/server_test.go

Comment thread pkg/syncer/server.go
Comment on lines +433 to +443
for {
records := s.history.recordsFrom(catchUpIndex)
if len(records) == 0 {
if catchUpIndex < s.history.getFirstIndex() {
log.Warn("region history buffer overflow during full synchronization, restart full synchronization",
zap.String("requested-server", name),
zap.String("server", s.server.Name()),
zap.Uint64("catch-up-index", catchUpIndex),
zap.Uint64("first-index", s.history.getFirstIndex()))
return s.syncFullRegions(ctx, name, stream)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace recursive full-sync retries with an outer loop.

Line 442 re-enters syncFullRegions recursively when catch-up overflows. Under sustained churn this path can repeat many times, so recovery starts growing the goroutine stack instead of simply retrying.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 433 - 443, The code currently recurses by
calling s.syncFullRegions(ctx, name, stream) when catchUpIndex <
s.history.getFirstIndex(), which can cause unbounded goroutine/stack growth
under churn; instead refactor syncFullRegions to use an outer retry loop: remove
the recursive call and continue the outer for/while loop that surrounds the
chunk-fetching logic (the loop that uses s.history.recordsFrom(catchUpIndex)),
resetting any per-attempt state (e.g. catchUpIndex and any temp buffers) as
needed so the function retries from the top without recursion; ensure symbols
mentioned (syncFullRegions, catchUpIndex, s.history.recordsFrom,
s.history.getFirstIndex, stream, ctx, name) are used to locate and update the
logic.

Comment thread pkg/syncer/server.go
Comment on lines +455 to +477
s.mu.Lock()
defer s.mu.Unlock()
for {
records := s.history.recordsFrom(catchUpIndex)
if len(records) == 0 {
if catchUpIndex < s.history.getFirstIndex() {
return nil, errors.Errorf("region history buffer overflow during full sync catch-up, catch-up-index %d, first-index %d", catchUpIndex, s.history.getFirstIndex())
}
break
}
if err := s.syncHistoryRecords(catchUpIndex, records, stream); err != nil {
return nil, err
}
catchUpIndex += uint64(len(records))
}
resp := &pdpb.SyncRegionResponse{
Header: &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()},
StartIndex: catchUpIndex,
}
if err := stream.Send(resp); err != nil {
log.Error("failed to send sync region response", errs.ZapError(errs.ErrGRPCSend, err))
return nil, err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid doing gRPC sends while holding s.mu.

From Line 455, the global stream lock stays held while syncHistoryRecords and the completion stream.Send perform network I/O. If that follower stalls here, broadcast() blocks on RLock() and every other downstream stops receiving updates.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 455 - 477, You are holding s.mu across
network I/O (syncHistoryRecords and stream.Send), which blocks broadcast() and
other readers; change the loop so you only lock to read/validate history and
make a local copy of the records, then unlock before calling syncHistoryRecords
or stream.Send. Specifically, in the loop around
s.history.recordsFrom(catchUpIndex) only hold s.mu to call recordsFrom, check
the overflow condition against s.history.getFirstIndex(), and make a copy of the
returned slice; then release s.mu and call s.syncHistoryRecords(catchUpIndex,
copiedRecords, stream); after that, re-acquire the lock to advance/inspect
shared state as needed and continue. Also move the final stream.Send of resp
outside the s.mu lock so no gRPC send occurs while s.mu is held (this will avoid
blocking broadcast() and RLock() callers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

region syncer: trigger full region sync when follower index gap exceeds history buffer size

1 participant