server: resync follower region cache reset by okJiang · Pull Request #10689 · tikv/pd

okJiang · 2026-05-19T07:34:01Z

What problem does this PR solve?

Issue Number: Close #10667

Stacked on #10682. Please review the new commits on this branch until #10682 is merged.

What is changed and how does it work?

Allow PD followers to handle the existing region cache reset admin APIs when PD-Allow-Follower-Handle is set.

For follower reset requests, the follower stops its region syncer, removes stale local region cache and region storage entries, resets the sync index to 0, and reconnects to the leader. This forces a full region sync so the follower rebuilds local state from the leader.

The leader-side reset cache behavior is unchanged.

The region syncer now treats startIndex == 0 as full sync, falls back to full sync when requested history is missing, sends history generated during the snapshot phase before binding the stream, retries full sync if the history buffer overflows during catch-up, and marks the follower syncer running only after the full sync completion response.

Check List

Tests

Unit test
Integration test

Code changes

Has HTTP APIs changed (Don't forget to add the declarative for the new API)
Has persistent data change

Release note

PD followers can reset stale local region cache through region cache admin APIs and automatically resync regions from the leader.

ti-chi-bot · 2026-05-19T07:34:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lhy1024 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-19T07:34:16Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds follower-local HTTP handling for region reads and cache reset: refactors syncer history reset, adjusts syncer full-sync behavior, implements Server.ResetFollowerRegionCache with deletion/storage helpers, wires middleware/router/admin for follower-scoped requests, and adds integration and unit tests.

Changes

Follower Region Cache and Reset Implementation

Layer / File(s)	Summary
Syncer History Index Reset `pkg/syncer/history_buffer.go`, `pkg/syncer/client.go`	Centralizes buffer reset in `resetWithIndexLocked()`. `resetWithIndexAndPersist()` persists while holding the lock. Exposes `RegionSyncer.ResetHistoryIndex()` to reset and persist the next history index.
Region Syncer Server History Handling `pkg/syncer/server.go`	`syncHistoryRegion` early-returns to `syncFullRegions` when `startIndex == 0`; falls back to `syncFullRegions` when history records are missing; `syncFullRegions` short-circuits when `GetRegions()` is empty.
Server-Side Region Cache Reset Implementation `server/server.go`	Adds `ResetFollowerRegionCache(regionIDs ...uint64)` to stop sync, flush storage, delete follower region meta (single/all), reset sync history index to 0, and restart sync. Includes deletion and storage-scan helpers and a mutex to serialize resets.
Middleware, Router, and Admin Endpoint Wiring `server/api/middleware.go`, `server/api/router.go`, `server/api/admin.go`	Extends `clusterMiddleware` to optionally serve follower-synced clusters, adds `withFollowerRegionReset()` middleware option and `regionResetRouter`, and routes admin cache deletion endpoints to call `ResetFollowerRegionCache()` for follower-synced requests.
Integration & Unit Tests `tests/server/api/api_test.go`, `pkg/syncer/server_test.go`	Adds `TestFollowerRegionResetCacheWithNoForward` and helpers to inject stale follower regions, perform follower-scoped admin deletes (including concurrent), wait for convergence, and assert storage state. Adds `TestSyncFallsBackToFullSyncWhenHistoryMissing` for syncer fallback behavior.

Sequence Diagram

sequenceDiagram
  participant Server
  participant PDLeaderResolver
  participant RegionSyncer
  participant RegionStorage
  participant BasicCluster
  Server->>PDLeaderResolver: Resolve current PD leader client URL
  Server->>RegionSyncer: Stop sync with leader
  Server->>RegionStorage: Flush local region storage
  Server->>BasicCluster: Delete region metadata (one/all)
  Server->>RegionSyncer: ResetHistoryIndex(0)
  Server->>RegionSyncer: Restart sync with leader client URL

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

region syncer: trigger full region sync when follower index gap exceeds history buffer size #10668: Implements history-buffer reset and reset API aligned with leader/follower history-gap handling.
client: only send follower region requests after follower full region sync #10690: Adds server-side follower-region cache reset and syncer history index control relevant to follower readiness.

Possibly related PRs

tikv/pd#10682: Related middleware/router changes for follower-synced request handling.
tikv/pd#10672: Related edits to pkg/syncer/server.go around full-sync handling and sync flow.

Suggested labels

size/L, lgtm, approved

Suggested reviewers

lhy1024
rleungx
bufferflies

Poem

🐰 I nudged the sync and brushed the trail,

Cleared cobwebbed caches without fail.
Followers tidy, histories reset,
Tests passed — I’m proud as a pet.
Carrots for commits, a hopping hail!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 4.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'server: resync follower region cache reset' clearly describes the main change: enabling followers to reset their local region cache and resync from the leader.
Linked Issues check	✅ Passed	The PR fully addresses issue `#10667` objectives: decouples follower-servable region paths from leader-only middleware, makes region cache admin endpoints addressable on followers via the opt-in header, ensures follower-local paths skip redirector and middleware checks, and includes both unit and integration tests validating the implementation.
Out of Scope Changes check	✅ Passed	All code changes are directly related to enabling followers to handle region cache reset via HTTP APIs and resync from the leader, staying within the scope of issue `#10667` objectives.
Description check	✅ Passed	The PR description comprehensively addresses the problem statement, implementation details, and testing; it follows the required template structure with all essential sections completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/syncer/server.go (1)

234-251: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing-history path should force full sync, not silently continue.

When startIndex != 0 and no history is available, returning nil binds the stream without backfilling. After leader restart/history loss, followers can remain permanently incomplete for unchanged regions.

🔧 Suggested fix

 	if len(records) == 0 {
 		if s.history.getNextIndex() == startIndex {
 			...
 			return stream.Send(resp)
 		}
 		log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex))
-		return nil
+		return s.syncFullRegions(ctx, name, stream)
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 234 - 251, When len(records)==0 and
s.history.getNextIndex() != startIndex (i.e. history is missing after a restart)
do not return nil; instead force a full sync by replying to the follower with a
SyncRegionResponse that indicates StartIndex = 0 (so the follower knows to
request a full backfill). Update the branch that currently logs "no history
regions from index..." to construct and send a pdpb.SyncRegionResponse (same
shape as the other response) with StartIndex set to 0 via stream.Send(...)
instead of returning nil; use the same symbols in this file (records,
startIndex, s.history.getNextIndex(), stream.Send, pdpb.SyncRegionResponse) to
locate and change the behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/server.go`:
- Around line 944-952: The code currently calls leader.GetClientUrls() and then
uses leaderURLs with syncer.StartSyncWithLeader, but StartSyncWithLeader expects
gRPC listen URLs; replace the call to leader.GetClientUrls() with
leader.GetListenUrls() (update the leaderURLs variable accordingly) and change
the error message to reflect "no listen url" -> e.g., "pd leader has no listen
url from GetListenUrls"; keep the existing
StopSyncWithLeader()/StartSyncWithLeader(leaderURLs[0]) usage so the syncer is
restarted with the proper listen URL.

---

Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 234-251: When len(records)==0 and s.history.getNextIndex() !=
startIndex (i.e. history is missing after a restart) do not return nil; instead
force a full sync by replying to the follower with a SyncRegionResponse that
indicates StartIndex = 0 (so the follower knows to request a full backfill).
Update the branch that currently logs "no history regions from index..." to
construct and send a pdpb.SyncRegionResponse (same shape as the other response)
with StartIndex set to 0 via stream.Send(...) instead of returning nil; use the
same symbols in this file (records, startIndex, s.history.getNextIndex(),
stream.Send, pdpb.SyncRegionResponse) to locate and change the behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 20f7e27c-b470-4645-bfa1-5f5ce82b3d02

📥 Commits

Reviewing files that changed from the base of the PR and between 94df6da and 33328e2.

📒 Files selected for processing (8)

pkg/syncer/client.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
server/api/admin.go
server/api/middleware.go
server/api/router.go
server/server.go
tests/server/api/api_test.go

codecov · 2026-05-19T07:45:24Z

Codecov Report

❌ Patch coverage is 79.55556% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.12%. Comparing base (94df6da) to head (b47a9eb).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10689      +/-   ##
==========================================
+ Coverage   79.03%   79.12%   +0.09%     
==========================================
  Files         536      536              
  Lines       73103    73410     +307     
==========================================
+ Hits        57777    58089     +312     
+ Misses      11234    11225       -9     
- Partials     4092     4096       +4

Flag	Coverage Δ
unittests	`79.12% <79.55%> (+0.09%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rleungx · 2026-05-20T01:50:30Z

+		return errors.New("pd leader has no listen url")
+	}
+
+	syncer := s.cluster.GetRegionSyncer()


This reset path needs to be serialized. Two concurrent follower reset requests can both call StopSyncWithLeader and then defer StartSyncWithLeader; StartSyncWithLeader overwrites clientCancel each time, so one of the newly started syncer goroutines can become unreachable by future stops and later resets or leader changes may hang in wg.Wait(). Please guard the whole stop/delete/reset/start sequence with a mutex, or provide an atomic RegionSyncer restart method, and add a concurrent DELETE test.

Fixed in the latest push. ResetFollowerRegionCache now serializes the stop/delete/reset/start sequence with a mutex, and TestFollowerRegionResetCacheWithNoForward covers concurrent DELETE requests for /pd/api/v1/admin/cache/regions.

rleungx · 2026-05-20T01:50:39Z

-			return s.syncFullRegions(ctx, name, stream)
-		}
 		log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex))
 		return nil


This branch should trigger a full sync instead of binding the stream and continuing. If the leader has lost the requested history range after restart, the follower can remain permanently incomplete for regions that do not change again, while later keepalives/broadcasts can still make the follower look synced to the local-read API. Please call syncFullRegions here, or otherwise send an explicit full-sync signal, so the follower rebuilds its cache.

Fixed in the latest push. When the requested history range is missing and the follower is not already at the leader next index, the leader now falls back to full region sync. Added TestSyncFallsBackToFullSyncWhenHistoryMissing for this case.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/server.go`:
- Around line 957-968: The reset logic currently drives follower-storage cleanup
from in-memory BasicCluster membership (s.basicCluster.GetRegions()), which
leaves storage-only stale regions undeleted; change both paths so deletion is
driven by on-disk storage: for the regionIDs loop continue to call
deleteFollowerRegion(regionID) (which must perform storage lookups) and for the
"all regions" path replace the GetRegions() iteration with a storage-backed
enumeration that lists all stored follower regions and calls
deleteFollowerRegionStorage/deleteFollowerRegion for each entry, then call
s.basicCluster.ResetRegionCache(); also update the duplicate block (the similar
code around the 978-991 area) the same way so both full-reset and
selective-reset use storage enumeration/lookup instead of BasicCluster
membership, ensuring StartSyncWithLeader cannot reload stale disk-only regions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1db89c88-74c7-4722-841a-aceaf0229b0a

📥 Commits

Reviewing files that changed from the base of the PR and between 33328e2 and 8036fae.

📒 Files selected for processing (8)

pkg/syncer/client.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
server/api/admin.go
server/api/middleware.go
server/api/router.go
server/server.go
tests/server/api/api_test.go

okJiang · 2026-05-20T07:48:40Z

+		}
+		return stream.Send(resp)
+	}
 	lastIndex := 0


When the FollowerRegion begins a reset and the server calls Sync4Regions, the LastIndex becomes inconsistent with the actual index in the server's history buffer.

How should we handle this?

This is the broader recovery problem covered by #10670. To make LastIndex fully consistent after a full snapshot, the server needs to catch up history generated during full sync and then send a final empty response with the catch-up index; the client also needs to keep streamingRunning false while it is still in full-sync mode. I would keep that full index-recovery flow in #10670 rather than expanding this reset-cache PR further.

If we don't fix this issue, won't there be a problem once this PR is merged?

Even after the region cache is reset, subsequent region synchronization will still have issues. Isn't this effectively a bug?

Fixed in 2a3f13d. The leader now records the history index before full sync, sends any region history generated while the snapshot is being sent, and only completes the full sync after the catch-up records are sent and the downstream stream is bound.

Verified with:

git diff --check

make gotest GOTEST_ARGS="./pkg/syncer -count=1"

What happens if the local History Buffer overflows during a full data synchronization? Specifically, what should be done if the number of changes during this period exceeds the length of the local buffer?

Covered in b47a9eb.

If the leader loses the catch-up history that was recorded at the start of a full sync, it must not bind the follower stream with a partial cache. The current path detects catchUpIndex < history.firstIndex() and restarts full sync from a fresh snapshot, so the follower only marks local reads as running after it receives the final completion response from a complete sync attempt.

I added TestFullSyncRestartsWhenHistoryBufferOverflowsDuringCatchUp with a one-entry history buffer. The test forces history overflow while the first full-sync response is blocked, then verifies that the leader sends a restarted full snapshot and only then sends the completion response.

Verified with:

git diff --check

make gotest GOTEST_ARGS='./pkg/syncer -run "TestSyncFallsBackToFullSyncWhenHistoryMissing|TestFullSyncRestartsWhenHistoryBufferOverflowsDuringCatchUp|TestClientWaitsForFullSyncCompletionBeforeRunning" -count=1'

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server_test.go`:
- Around line 362-382: The test creates a cancellable context with ctx, cancel
:= context.WithCancel(...) but only calls cancel() on the success path, risking
a leaked goroutine running syncer.Sync; fix by immediately deferring cancel()
right after context creation (i.e., add defer cancel() immediately after the
context.WithCancel call) so the sync goroutine started with syncer.Sync will
always be signaled to stop on test exit or failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb5a8e03-8a5f-41cc-975f-127d6174cb5d

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8a89c and ead9946.

📒 Files selected for processing (9)

pkg/syncer/client.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
pkg/syncer/server_test.go
server/api/admin.go
server/api/middleware.go
server/api/router.go
server/server.go
tests/server/api/api_test.go

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/api/middleware.go (1)

132-153: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't gate follower reset requests on RegionSyncer.IsRunning().

The new DELETE endpoints are the recovery path when follower sync is stale, reconnecting, or already stopped. Requiring IsRunning() here blocks them during exactly that window and can turn concurrent/retried DELETEs into 500s before they ever reach followerRegionResetMu.

🔧 Narrow fix

 	rc := m.s.DirectlyGetRaftCluster()
-	if rc == nil || !rc.GetRegionSyncer().IsRunning() {
+	if rc == nil {
 		return nil
 	}
+	if r.Method == http.MethodGet && !rc.GetRegionSyncer().IsRunning() {
+		return nil
+	}
 	return rc
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/api/middleware.go` around lines 132 - 153, The
getFollowerSyncedCluster currently returns nil when
rc.GetRegionSyncer().IsRunning() is false, which incorrectly blocks DELETE
recovery endpoints; change the logic so that for DELETE (when
m.allowFollowerRegionReset is used) you do NOT require RegionSyncer.IsRunning()
— still return nil if rc == nil, but skip the IsRunning() check for the DELETE
branch (only enforce IsRunning() for the GET/allowFollowerSyncedRegion path).
Update getFollowerSyncedCluster to branch on r.Method after retrieving rc: if rc
== nil return nil; if r.Method == http.MethodGet require
rc.GetRegionSyncer().IsRunning(); otherwise (DELETE) return rc so the follower
reset can reach followerRegionResetMu.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/api/admin.go`:
- Around line 72-79: The Swagger response annotations for the follower reset
path need updating to reflect the new behavior in isFollowerSyncedClusterRequest
handling: change the success response description to the follower-specific
message ("The region is removed from follower cache and the follower starts to
resync regions from leader.") and add a 500 response for failures from
ResetFollowerRegionCache (returning the error). Update the corresponding
annotations for the other handler referenced (lines around the
ResetLeaderRegionCache/ResetFollowerRegionCache pairing) so both follower and
leader cases have correct success/500 docs, then regenerate the OpenAPI spec
with make swagger-spec (SWAGGER=1).

In `@server/server.go`:
- Around line 942-956: The reset path races with leaderLoop because this method
only locks followerRegionResetMu while leaderLoop can call
StopSyncWithLeader/StartSyncWithLeader concurrently; either (A) protect
leaderLoop's restart calls with the same followerRegionResetMu (add Lock/Unlock
around leaderLoop's Stop/Start sequence) or (B) implement an atomic syncer API
on the RegionSyncer (e.g., ResetSyncWithLeader([]string) or
RestartWithLeader(url string) that internally does StopSyncWithLeader and
StartSyncWithLeader under its own mutex) and replace the StopSyncWithLeader +
deferred StartSyncWithLeader sequence here (and calls from leaderLoop) with that
single atomic call to avoid overwritten clientCancel/orphaned goroutine and
stale leader URL issues.
- Around line 959-980: The flush is happening before deletions so restarted
syncer (StartSyncWithLeader) may reload the old entries; move the
s.storage.Flush() call to after the deletion branch (after
deleteFollowerRegionStorage / deleteFollowerRegion loop) and before
syncer.ResetHistoryIndex(0), preserving the same error handling/wrapping logic
(update resetErr if Flush returns an error) so storage is persisted only once
deletions are applied.

---

Outside diff comments:
In `@server/api/middleware.go`:
- Around line 132-153: The getFollowerSyncedCluster currently returns nil when
rc.GetRegionSyncer().IsRunning() is false, which incorrectly blocks DELETE
recovery endpoints; change the logic so that for DELETE (when
m.allowFollowerRegionReset is used) you do NOT require RegionSyncer.IsRunning()
— still return nil if rc == nil, but skip the IsRunning() check for the DELETE
branch (only enforce IsRunning() for the GET/allowFollowerSyncedRegion path).
Update getFollowerSyncedCluster to branch on r.Method after retrieving rc: if rc
== nil return nil; if r.Method == http.MethodGet require
rc.GetRegionSyncer().IsRunning(); otherwise (DELETE) return rc so the follower
reset can reach followerRegionResetMu.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 05e0125f-1840-43a7-9d9b-164dc52c999e

📥 Commits

Reviewing files that changed from the base of the PR and between ead9946 and 825d496.

📒 Files selected for processing (9)

pkg/syncer/client.go
pkg/syncer/history_buffer.go
pkg/syncer/server.go
pkg/syncer/server_test.go
server/api/admin.go
server/api/middleware.go
server/api/router.go
server/server.go
tests/server/api/api_test.go

okJiang · 2026-05-20T08:43:30Z

Addressed the CodeRabbit outside-diff comment in #10689 (review).

DELETE follower reset requests no longer require RegionSyncer.IsRunning(), while GET follower reads still do. I also added coverage in TestFollowerRegionResetCacheWithNoForward for reset while the follower syncer is stopped.

Verified with go test ./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1 and git diff --check.

okJiang · 2026-05-20T08:50:54Z

/test pull-unit-test-next-gen-2

okJiang · 2026-05-20T09:02:16Z

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T09:13:35Z

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T09:24:36Z

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T09:34:51Z

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

rleungx · 2026-05-20T10:03:39Z

-		log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex))
-		return nil
+		log.Warn("no history regions from index, fall back to full sync", zap.Uint64("index", startIndex))
+		return s.syncFullRegions(ctx, name, stream)


When this falls back to full sync, the follower can mark the syncer as running after the first response batch. For large clusters, follower reads may see a partial region cache before full sync finishes. Can we add an explicit completion signal or keep follower local reads disabled until the full sync is done?

Fixed in b3c130e. Full sync now sends an empty completion response after all full-sync batches, and the follower client keeps local follower reads disabled until that completion response is received.

Coverage added in TestClientWaitsForFullSyncCompletionBeforeRunning and the existing fallback full sync test. Verified with:

make gotest GOTEST_ARGS="./pkg/syncer -run 'TestSyncFallsBackToFullSyncWhenHistoryMissing|TestClientWaitsForFullSyncCompletionBeforeRunning' -count=1"

make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"

rleungx · 2026-05-20T10:03:46Z

+		if !m.allowFollowerSyncedRegion {
+			return nil
+		}
+	case http.MethodDelete:


This reuses PD-Allow-Follower-Handle for a DELETE operation, but the existing header is read-oriented. Existing clients or proxies that already attach it could unexpectedly reset follower region cache. Could we use a dedicated opt-in signal or add compatibility coverage for this behavior?

Fixed in b3c130e. The follower reset path now requires an explicit PD-Follower-Region-Reset header in addition to PD-Allow-Follower-Handle, so existing read-oriented callers/proxies that only carry the old header will not trigger a follower reset.

Coverage was added in the existing follower reset test. Verified with:

make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"

rleungx · 2026-05-20T10:03:54Z

+		resetErr = errors.Wrap(err, "flush follower region storage")
+	}
+	if len(regionIDs) == 0 {
+		regions, err := s.loadFollowerRegionStorage()


This loads all local regions into memory before deleting them. On large clusters, reset may use a lot of memory or take too long. Can we delete page by page while scanning LoadRange instead?

Fixed in b3c130e. The all-region follower reset now scans local region storage page by page with LoadRange and deletes each loaded page immediately, instead of loading all local regions into memory first.

Verified with:

make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"

okJiang · 2026-05-20T15:04:19Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/client TestTSOFollowerProxy: the test timed out waiting for a server not started error, then the failpoint callback was blocked on an unbuffered channel during cleanup. This is outside the follower region reset path changed by this PR, and the PR diff does not touch tests/integrations/client/client_test.go or the delayStartServer failpoint path.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T15:14:15Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in client/resource_group/controller TestQPS, with 1310 greater than the expected upper bound 1000 in the concurrency=1000,reserveN=10,limit=400000 case. This is outside the follower region reset path changed by this PR, and the PR diff does not touch client/resource_group/controller.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T15:39:46Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/tso TestLegacyTSOClientSuite/TestTSONotLeaderWhenRebaseErr: ResignLeaderWithRetry returned [PD:etcd:ErrEtcdMoveLeader] etcdserver: request timed out, leader transfer took too long. The PR diff does not touch tests/integrations/tso/client_test.go or the TSO client/rebase path; the only leader-loop change here adds an uncontended lock around region syncer start/stop, and this test does not exercise follower region reset.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T16:13:46Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/mcs/resourcemanager TestResourceManagerClientTestSuite, specifically standalone-resource-manager-with-client-discovery/TestResourceGroupRUConsumption. This is outside the follower region reset path changed by this PR, and the PR diff does not touch tests/integrations/mcs/resourcemanager or the resource manager client/controller code.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T16:37:06Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/client TestHTTPClientTestSuite/TestRedirectWithMetrics with a data race between client/metrics.InitAndRegisterMetrics() and client/servicediscovery.getClusterInfo(). This is outside the follower region reset path changed by this PR, and the PR diff does not touch tests/integrations/client, client/metrics, or client/servicediscovery.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T16:54:44Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/client TestClientLeaderChange: the test timed out after 5 minutes while the embedded etcd member repeatedly failed to publish member attributes with etcdserver: request timed out. This is outside the follower region reset path changed by this PR, and the PR diff does not touch tests/integrations/client or the client leader-change test path.

Also checked the outside-diff CodeRabbit comment in #10689 (review). The current code already allows DELETE follower reset requests through without requiring RegionSyncer.IsRunning(); the IsRunning() check is only applied to GET follower local-read requests, so no code change is needed for that comment.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-20T17:07:48Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in tests/integrations/mcs/keyspace TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery because the wait condition in tso_keyspace_group_test.go:745 was never satisfied, with repeated ErrKeyspaceGroupModRevisionStale / TSO service discovery errors.

This is outside the follower region reset path changed by this PR, and the PR diff does not touch tests/integrations/mcs/keyspace, TSO keyspace group manager, or TSO service discovery code.

Rerunning the failed required job.

/test pull-unit-test-next-gen-3

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T02:58:53Z

Checked the latest pull-error-log-review failure from #10689 (comment). It failed because 2a3f13d added a new log.Error("failed to send sync region response", ...), which requires log approver approval.

Fixed in b778b7a by lowering that new full-sync send-failure log to log.Warn, matching the existing non-critical send-failure handling in this path.

Verified with:

git diff --check
make gotest GOTEST_ARGS="./pkg/syncer -count=1"

okJiang · 2026-05-21T03:13:24Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in pkg/mcs/tso/server TestTSOKeyspaceGroupManagerSuite/TestWatchFailed with repeated etcd context-canceled errors during test shutdown.

Also checked the current GitHub Actions failure in chunks (9, Microservice Integration(!TSO)): it failed in tests/integrations/mcs/scheduling TestMeta/TestStoreWatch because the wait condition at tests/integrations/mcs/scheduling/meta_test.go:106 was never satisfied. The report-coverage failure is caused by the missing covprofile_9 artifact from that failed chunk.

Both failures are outside the follower region reset / syncer path changed by this PR; the PR diff does not touch pkg/mcs/tso/server, tests/integrations/mcs/scheduling, or the scheduling metadata watch path.

Rerunning the failed required Prow job and the failed GitHub Actions jobs.

/test pull-unit-test-next-gen-3

rleungx · 2026-05-21T03:13:35Z

+		resetErr = errors.Wrap(err, "flush follower region storage")
+	}
+	if len(regionIDs) == 0 {
+		if err := s.deleteFollowerRegionStorage(); resetErr == nil && err != nil {


Do we need to delete it? If we have 10M region, how long will it take?

Fixed in d095867.

The all-region reset still needs to delete local region storage so storage-only stale regions cannot be reloaded later. To reduce the 10M-region cost, it now scans region keys and deletes by parsed region ID, without loading and decoding every region value before deletion.

Verified with:

git diff --check

make gotest GOTEST_ARGS='./server -run "TestDeleteFollowerRegionStorage|TestDeleteFollowerRegionStorageReturnsStorageErrors|TestDeleteFollowerRegion|TestResetFollowerRegionCacheRequiresRegionStorage|TestParseRegionIDFromStorageKey" -count=1'

@okJiang can we test it locally and report some test results here

Local scale test result with the in-memory storage backend:

100,000 regions: old path delete cost 74.67 ms, new path delete cost 41.45 ms, about 1.80x faster.

1,000,000 regions: old path delete cost 811.82 ms, new path delete cost 466.39 ms, about 1.74x faster.

Command used with a temporary local-only test, not committed:

PD_FOLLOWER_REGION_DELETE_SCALE=100000 go test ./server -run '^TestFollowerRegionStorageDeleteScaleLocalOnly$' -count=1 -v

PD_FOLLOWER_REGION_DELETE_SCALE=1000000 go test ./server -run '^TestFollowerRegionStorageDeleteScaleLocalOnly$' -count=1 -v

This is not a real disk/etcd latency benchmark. It only verifies the changed path: the reset is still O(N), but it no longer does one LoadRegion read and protobuf unmarshal per region before deletion. For 10M regions it is still a linear admin recovery operation, but the avoidable per-region value load is removed.

okJiang · 2026-05-21T03:14:06Z

Correction on the GitHub Actions rerun: the Prow rerun command above was posted, but gh run rerun --failed for the failed GitHub Actions run was rejected because this token does not have repository admin rights. No code change is needed for that GHA failure based on the current logs; it is the same unrelated tests/integrations/mcs/scheduling TestMeta/TestStoreWatch wait-condition failure described above.

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T03:32:54Z

Checked the latest Prow failures in #10689 (comment).

pull-unit-test-next-gen-2 failed in tests/server/apiv2/handlers TestAffinityHandlerTestSuite/TestAffinityListWithEmptyID: the microservice run returned an empty affinity group map for ids=&ids=group-2.
pull-unit-test-next-gen-3 failed in tests/integrations/client TestFollowerForwardAndHandleTestSuite/TestGetTsoByFollowerForwarding2: ResignLeader failed because etcd leader transfer timed out.

The latest PR commit only changed server/server.go and server/server_test.go to avoid loading region values during follower storage reset. These failures are outside that path, so rerunning the two failed Prow jobs.

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T03:55:00Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment). It failed in pkg/mcs/tso/server TestTSOKeyspaceGroupManagerSuite/TestWatchFailed: the assertion at tests/integrations/mcs/tso/keyspace_group_manager_test.go:181 expected keyspace group 0x9d but got 0x0.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This TSO keyspace group watch failure is outside that path, so rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T04:19:29Z

+				return err
+			}
+			meta := &metapb.Region{Id: regionID}
+			if err := s.deleteFollowerRegionMeta(meta); err != nil && firstErr == nil {


Deleting regions one by one feels a bit slow. Is there a way to delete everything at once, or perhaps just recreate a new one from scratch?

Fixed in 776055e.

The full reset path still scans local region storage page by page, but each page is now deleted in one storage transaction instead of issuing one DeleteRegion write per region. This keeps memory bounded and reduces the write cost to one batch per LoadRange page.

I did not recreate the storage from scratch because current region storage is shared by the running server/syncer and does not expose a safe swap/recreate path. There is also no generic range-delete API on the storage interface, so batching the existing page scan is the minimal safe change here.

Verified with:

git diff --check

make gotest for the follower-region reset/delete server tests

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T04:47:17Z

Checked the latest Prow failures in #10689 (comment).

pull-unit-test-next-gen-2 failed in tests/server/api TestFollowerRegionAPIWithNoForward by timing out while starting the test cluster. The test did not reach the follower-region API assertions. I reran the same test locally with go test -tags nextgen,without_dashboard ./tests/server/api -run TestFollowerRegionAPIWithNoForward -count=1, and it passed.
pull-unit-test-next-gen-3 failed in tests/integrations/client TestConfigTTLAfterTransferLeader by timing out while starting the test cluster / transferring leadership. I reran the same test locally under tests/integrations with go test -tags nextgen,without_dashboard ./client -run TestConfigTTLAfterTransferLeader -count=1, and it passed.

The latest PR commit only changed the follower region storage reset path in server/server.go and its unit tests. These two failures are startup/leader-transfer timeouts outside that changed path, so no code change is needed here. Rerunning the failed Prow jobs.

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T05:02:59Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in tests/integrations/tso TestMicroserviceTSOServer because the test timed out while starting the microservice TSO test cluster. The test did not reach follower region reset or syncer code. I reran the same test locally under tests/integrations with go test -tags nextgen,without_dashboard ./tso -run TestMicroserviceTSOServer -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This TSO microservice startup timeout is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T05:22:55Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in tests/integrations/mcs/resourcemanager TestResourceManagerClientTestSuite/standalone-resource-manager-with-client-discovery/TestResourceGroupController: one timing assertion observed 492.485574ms, above the 300ms limit. I reran the same subtest locally under tests/integrations with go test -tags nextgen,without_dashboard ./mcs/resourcemanager -run 'TestResourceManagerClientTestSuite/standalone-resource-manager-with-client-discovery/TestResourceGroupController$' -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This resource manager timing failure is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T05:44:39Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in tests/integrations/client TestUnavailableTimeAfterLeaderIsReady because the test timed out while starting the test cluster. The test did not reach follower region reset or syncer code. I reran the same test locally under tests/integrations with go test -tags nextgen,without_dashboard ./client -run TestUnavailableTimeAfterLeaderIsReady -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This client cluster-startup timeout is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T06:04:01Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed again in tests/integrations/mcs/tso TestTSOKeyspaceGroupManagerSuite/TestWatchFailed: the assertion at tests/integrations/mcs/tso/keyspace_group_manager_test.go:181 expected keyspace group 0x90 but got 0x0, with ErrKeyspaceGroupModRevisionStale in the log. I reran the same subtest locally under tests/integrations with go test -tags nextgen,without_dashboard ./mcs/tso -run 'TestTSOKeyspaceGroupManagerSuite/TestWatchFailed$' -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This TSO keyspace-group watch failure is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T06:26:15Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in tests/integrations/mcs/resourcemanager TestSwitchModeDuringWorkload/pd-to-standalone: the wait condition at tests/integrations/mcs/resourcemanager/resource_manager_test.go:492 was never satisfied after switching resource-manager mode. I reran the same subtest locally under tests/integrations with go test -tags nextgen,without_dashboard ./mcs/resourcemanager -run 'TestSwitchModeDuringWorkload/pd-to-standalone$' -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This resource-manager mode-switch workload failure is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

okJiang · 2026-05-21T06:45:26Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in tests/integrations/client TestRouterClientEnabledSuite because the test timed out while starting the initial PD test cluster; the log shows repeated etcd peer connection failures and etcdserver: request timed out before the suite setup finished. I reran the same suite locally under tests/integrations with go test -tags nextgen,without_dashboard ./client -run TestRouterClientEnabledSuite -count=1, and it passed.

The latest PR commit only changed server/server.go and server/server_test.go for follower region storage reset deletion. This client test-cluster startup timeout is outside that changed path, so no code change is needed here. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T07:19:53Z

Checked the latest pull-unit-test-next-gen-3 failure from #10689 (comment).

It failed in client/resource_group/controller TestQPS: the concurrency=1000,reserveN=10,limit=400000 case observed RU 450430, above the test tolerance. I reran the same test locally under client with go test ./resource_group/controller -run TestQPS -count=1, and it passed.

This QPS limiter test is outside the follower region reset / syncer full-sync path changed by this PR. The new commit b47a9ebe0 only adds syncer test coverage for the history-buffer overflow review question. Rerunning the failed Prow job.

/test pull-unit-test-next-gen-3

Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot · 2026-05-22T03:47:25Z

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-2	`7e4ffd0`	link	true	`/test pull-unit-test-next-gen-2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 19, 2026

ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 19, 2026

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread server/server.go

rleungx reviewed May 20, 2026

View reviewed changes

okJiang force-pushed the codex/follower-region-reset-cache-resync branch 2 times, most recently from 71e13be to 8036fae Compare May 20, 2026 07:04

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread server/server.go

okJiang force-pushed the codex/follower-region-reset-cache-resync branch from 8036fae to 1a8a89c Compare May 20, 2026 07:12

okJiang commented May 20, 2026

View reviewed changes

okJiang force-pushed the codex/follower-region-reset-cache-resync branch from 1a8a89c to ead9946 Compare May 20, 2026 08:02

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread pkg/syncer/server_test.go Outdated

okJiang force-pushed the codex/follower-region-reset-cache-resync branch from ead9946 to 825d496 Compare May 20, 2026 08:12

coderabbitai Bot reviewed May 20, 2026

View reviewed changes

Comment thread server/api/admin.go

Comment thread server/server.go

Comment thread server/server.go

okJiang force-pushed the codex/follower-region-reset-cache-resync branch 2 times, most recently from 0c0111f to 07b65b7 Compare May 20, 2026 08:42

rleungx reviewed May 20, 2026

View reviewed changes

okJiang force-pushed the codex/follower-region-reset-cache-resync branch from 07b65b7 to b3c130e Compare May 20, 2026 10:20

okJiang added 2 commits May 21, 2026 10:52

syncer: catch up history after full sync

2a3f13d

Signed-off-by: okjiang <819421878@qq.com>

syncer: lower full sync send failure log level

b778b7a

Signed-off-by: okjiang <819421878@qq.com>

rleungx reviewed May 21, 2026

View reviewed changes

server: avoid loading regions during follower storage reset

d095867

Signed-off-by: okjiang <819421878@qq.com>

okJiang commented May 21, 2026

View reviewed changes

server: batch delete follower region storage

776055e

Signed-off-by: okjiang <819421878@qq.com>

syncer: test full sync history overflow

b47a9eb

Signed-off-by: okjiang <819421878@qq.com>

okJiang mentioned this pull request May 21, 2026

TestTSOKeyspaceGroupSplitElection is flaky #9842

Open

server: tidy follower region reset tests

7e4ffd0

Signed-off-by: okjiang <819421878@qq.com>

Conversation

okJiang commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Uh oh!

ti-chi-bot Bot commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okJiang May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 20, 2026

Uh oh!

okJiang commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

okJiang May 21, 2026 •

edited

Loading