server: resync follower region cache reset#10689
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds follower-local HTTP handling for region reads and cache reset: refactors syncer history reset, adjusts syncer full-sync behavior, implements Server.ResetFollowerRegionCache with deletion/storage helpers, wires middleware/router/admin for follower-scoped requests, and adds integration and unit tests. ChangesFollower Region Cache and Reset Implementation
Sequence DiagramsequenceDiagram
participant Server
participant PDLeaderResolver
participant RegionSyncer
participant RegionStorage
participant BasicCluster
Server->>PDLeaderResolver: Resolve current PD leader client URL
Server->>RegionSyncer: Stop sync with leader
Server->>RegionStorage: Flush local region storage
Server->>BasicCluster: Delete region metadata (one/all)
Server->>RegionSyncer: ResetHistoryIndex(0)
Server->>RegionSyncer: Restart sync with leader client URL
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/syncer/server.go (1)
234-251:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing-history path should force full sync, not silently continue.
When
startIndex != 0and no history is available, returningnilbinds the stream without backfilling. After leader restart/history loss, followers can remain permanently incomplete for unchanged regions.🔧 Suggested fix
if len(records) == 0 { if s.history.getNextIndex() == startIndex { ... return stream.Send(resp) } log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex)) - return nil + return s.syncFullRegions(ctx, name, stream) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/syncer/server.go` around lines 234 - 251, When len(records)==0 and s.history.getNextIndex() != startIndex (i.e. history is missing after a restart) do not return nil; instead force a full sync by replying to the follower with a SyncRegionResponse that indicates StartIndex = 0 (so the follower knows to request a full backfill). Update the branch that currently logs "no history regions from index..." to construct and send a pdpb.SyncRegionResponse (same shape as the other response) with StartIndex set to 0 via stream.Send(...) instead of returning nil; use the same symbols in this file (records, startIndex, s.history.getNextIndex(), stream.Send, pdpb.SyncRegionResponse) to locate and change the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/server.go`:
- Around line 944-952: The code currently calls leader.GetClientUrls() and then
uses leaderURLs with syncer.StartSyncWithLeader, but StartSyncWithLeader expects
gRPC listen URLs; replace the call to leader.GetClientUrls() with
leader.GetListenUrls() (update the leaderURLs variable accordingly) and change
the error message to reflect "no listen url" -> e.g., "pd leader has no listen
url from GetListenUrls"; keep the existing
StopSyncWithLeader()/StartSyncWithLeader(leaderURLs[0]) usage so the syncer is
restarted with the proper listen URL.
---
Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 234-251: When len(records)==0 and s.history.getNextIndex() !=
startIndex (i.e. history is missing after a restart) do not return nil; instead
force a full sync by replying to the follower with a SyncRegionResponse that
indicates StartIndex = 0 (so the follower knows to request a full backfill).
Update the branch that currently logs "no history regions from index..." to
construct and send a pdpb.SyncRegionResponse (same shape as the other response)
with StartIndex set to 0 via stream.Send(...) instead of returning nil; use the
same symbols in this file (records, startIndex, s.history.getNextIndex(),
stream.Send, pdpb.SyncRegionResponse) to locate and change the behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 20f7e27c-b470-4645-bfa1-5f5ce82b3d02
📒 Files selected for processing (8)
pkg/syncer/client.gopkg/syncer/history_buffer.gopkg/syncer/server.goserver/api/admin.goserver/api/middleware.goserver/api/router.goserver/server.gotests/server/api/api_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10689 +/- ##
==========================================
+ Coverage 79.03% 79.12% +0.09%
==========================================
Files 536 536
Lines 73103 73410 +307
==========================================
+ Hits 57777 58089 +312
+ Misses 11234 11225 -9
- Partials 4092 4096 +4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| return errors.New("pd leader has no listen url") | ||
| } | ||
|
|
||
| syncer := s.cluster.GetRegionSyncer() |
There was a problem hiding this comment.
This reset path needs to be serialized. Two concurrent follower reset requests can both call StopSyncWithLeader and then defer StartSyncWithLeader; StartSyncWithLeader overwrites clientCancel each time, so one of the newly started syncer goroutines can become unreachable by future stops and later resets or leader changes may hang in wg.Wait(). Please guard the whole stop/delete/reset/start sequence with a mutex, or provide an atomic RegionSyncer restart method, and add a concurrent DELETE test.
There was a problem hiding this comment.
Fixed in the latest push. ResetFollowerRegionCache now serializes the stop/delete/reset/start sequence with a mutex, and TestFollowerRegionResetCacheWithNoForward covers concurrent DELETE requests for /pd/api/v1/admin/cache/regions.
| return s.syncFullRegions(ctx, name, stream) | ||
| } | ||
| log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex)) | ||
| return nil |
There was a problem hiding this comment.
This branch should trigger a full sync instead of binding the stream and continuing. If the leader has lost the requested history range after restart, the follower can remain permanently incomplete for regions that do not change again, while later keepalives/broadcasts can still make the follower look synced to the local-read API. Please call syncFullRegions here, or otherwise send an explicit full-sync signal, so the follower rebuilds its cache.
There was a problem hiding this comment.
Fixed in the latest push. When the requested history range is missing and the follower is not already at the leader next index, the leader now falls back to full region sync. Added TestSyncFallsBackToFullSyncWhenHistoryMissing for this case.
71e13be to
8036fae
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/server.go`:
- Around line 957-968: The reset logic currently drives follower-storage cleanup
from in-memory BasicCluster membership (s.basicCluster.GetRegions()), which
leaves storage-only stale regions undeleted; change both paths so deletion is
driven by on-disk storage: for the regionIDs loop continue to call
deleteFollowerRegion(regionID) (which must perform storage lookups) and for the
"all regions" path replace the GetRegions() iteration with a storage-backed
enumeration that lists all stored follower regions and calls
deleteFollowerRegionStorage/deleteFollowerRegion for each entry, then call
s.basicCluster.ResetRegionCache(); also update the duplicate block (the similar
code around the 978-991 area) the same way so both full-reset and
selective-reset use storage enumeration/lookup instead of BasicCluster
membership, ensuring StartSyncWithLeader cannot reload stale disk-only regions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1db89c88-74c7-4722-841a-aceaf0229b0a
📒 Files selected for processing (8)
pkg/syncer/client.gopkg/syncer/history_buffer.gopkg/syncer/server.goserver/api/admin.goserver/api/middleware.goserver/api/router.goserver/server.gotests/server/api/api_test.go
8036fae to
1a8a89c
Compare
| } | ||
| return stream.Send(resp) | ||
| } | ||
| lastIndex := 0 |
There was a problem hiding this comment.
When the FollowerRegion begins a reset and the server calls Sync4Regions, the LastIndex becomes inconsistent with the actual index in the server's history buffer.
How should we handle this?
There was a problem hiding this comment.
This is the broader recovery problem covered by #10670. To make LastIndex fully consistent after a full snapshot, the server needs to catch up history generated during full sync and then send a final empty response with the catch-up index; the client also needs to keep streamingRunning false while it is still in full-sync mode. I would keep that full index-recovery flow in #10670 rather than expanding this reset-cache PR further.
There was a problem hiding this comment.
If we don't fix this issue, won't there be a problem once this PR is merged?
Even after the region cache is reset, subsequent region synchronization will still have issues. Isn't this effectively a bug?
There was a problem hiding this comment.
Fixed in 2a3f13d. The leader now records the history index before full sync, sends any region history generated while the snapshot is being sent, and only completes the full sync after the catch-up records are sent and the downstream stream is bound.
Verified with:
git diff --checkmake gotest GOTEST_ARGS="./pkg/syncer -count=1"
There was a problem hiding this comment.
What happens if the local History Buffer overflows during a full data synchronization? Specifically, what should be done if the number of changes during this period exceeds the length of the local buffer?
There was a problem hiding this comment.
Covered in b47a9eb.
If the leader loses the catch-up history that was recorded at the start of a full sync, it must not bind the follower stream with a partial cache. The current path detects catchUpIndex < history.firstIndex() and restarts full sync from a fresh snapshot, so the follower only marks local reads as running after it receives the final completion response from a complete sync attempt.
I added TestFullSyncRestartsWhenHistoryBufferOverflowsDuringCatchUp with a one-entry history buffer. The test forces history overflow while the first full-sync response is blocked, then verifies that the leader sends a restarted full snapshot and only then sends the completion response.
Verified with:
git diff --checkmake gotest GOTEST_ARGS='./pkg/syncer -run "TestSyncFallsBackToFullSyncWhenHistoryMissing|TestFullSyncRestartsWhenHistoryBufferOverflowsDuringCatchUp|TestClientWaitsForFullSyncCompletionBeforeRunning" -count=1'
1a8a89c to
ead9946
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/syncer/server_test.go`:
- Around line 362-382: The test creates a cancellable context with ctx, cancel
:= context.WithCancel(...) but only calls cancel() on the success path, risking
a leaked goroutine running syncer.Sync; fix by immediately deferring cancel()
right after context creation (i.e., add defer cancel() immediately after the
context.WithCancel call) so the sync goroutine started with syncer.Sync will
always be signaled to stop on test exit or failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fb5a8e03-8a5f-41cc-975f-127d6174cb5d
📒 Files selected for processing (9)
pkg/syncer/client.gopkg/syncer/history_buffer.gopkg/syncer/server.gopkg/syncer/server_test.goserver/api/admin.goserver/api/middleware.goserver/api/router.goserver/server.gotests/server/api/api_test.go
ead9946 to
825d496
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/api/middleware.go (1)
132-153:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't gate follower reset requests on
RegionSyncer.IsRunning().The new DELETE endpoints are the recovery path when follower sync is stale, reconnecting, or already stopped. Requiring
IsRunning()here blocks them during exactly that window and can turn concurrent/retried DELETEs into 500s before they ever reachfollowerRegionResetMu.🔧 Narrow fix
rc := m.s.DirectlyGetRaftCluster() - if rc == nil || !rc.GetRegionSyncer().IsRunning() { + if rc == nil { return nil } + if r.Method == http.MethodGet && !rc.GetRegionSyncer().IsRunning() { + return nil + } return rc }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/api/middleware.go` around lines 132 - 153, The getFollowerSyncedCluster currently returns nil when rc.GetRegionSyncer().IsRunning() is false, which incorrectly blocks DELETE recovery endpoints; change the logic so that for DELETE (when m.allowFollowerRegionReset is used) you do NOT require RegionSyncer.IsRunning() — still return nil if rc == nil, but skip the IsRunning() check for the DELETE branch (only enforce IsRunning() for the GET/allowFollowerSyncedRegion path). Update getFollowerSyncedCluster to branch on r.Method after retrieving rc: if rc == nil return nil; if r.Method == http.MethodGet require rc.GetRegionSyncer().IsRunning(); otherwise (DELETE) return rc so the follower reset can reach followerRegionResetMu.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/api/admin.go`:
- Around line 72-79: The Swagger response annotations for the follower reset
path need updating to reflect the new behavior in isFollowerSyncedClusterRequest
handling: change the success response description to the follower-specific
message ("The region is removed from follower cache and the follower starts to
resync regions from leader.") and add a 500 response for failures from
ResetFollowerRegionCache (returning the error). Update the corresponding
annotations for the other handler referenced (lines around the
ResetLeaderRegionCache/ResetFollowerRegionCache pairing) so both follower and
leader cases have correct success/500 docs, then regenerate the OpenAPI spec
with make swagger-spec (SWAGGER=1).
In `@server/server.go`:
- Around line 942-956: The reset path races with leaderLoop because this method
only locks followerRegionResetMu while leaderLoop can call
StopSyncWithLeader/StartSyncWithLeader concurrently; either (A) protect
leaderLoop's restart calls with the same followerRegionResetMu (add Lock/Unlock
around leaderLoop's Stop/Start sequence) or (B) implement an atomic syncer API
on the RegionSyncer (e.g., ResetSyncWithLeader([]string) or
RestartWithLeader(url string) that internally does StopSyncWithLeader and
StartSyncWithLeader under its own mutex) and replace the StopSyncWithLeader +
deferred StartSyncWithLeader sequence here (and calls from leaderLoop) with that
single atomic call to avoid overwritten clientCancel/orphaned goroutine and
stale leader URL issues.
- Around line 959-980: The flush is happening before deletions so restarted
syncer (StartSyncWithLeader) may reload the old entries; move the
s.storage.Flush() call to after the deletion branch (after
deleteFollowerRegionStorage / deleteFollowerRegion loop) and before
syncer.ResetHistoryIndex(0), preserving the same error handling/wrapping logic
(update resetErr if Flush returns an error) so storage is persisted only once
deletions are applied.
---
Outside diff comments:
In `@server/api/middleware.go`:
- Around line 132-153: The getFollowerSyncedCluster currently returns nil when
rc.GetRegionSyncer().IsRunning() is false, which incorrectly blocks DELETE
recovery endpoints; change the logic so that for DELETE (when
m.allowFollowerRegionReset is used) you do NOT require RegionSyncer.IsRunning()
— still return nil if rc == nil, but skip the IsRunning() check for the DELETE
branch (only enforce IsRunning() for the GET/allowFollowerSyncedRegion path).
Update getFollowerSyncedCluster to branch on r.Method after retrieving rc: if rc
== nil return nil; if r.Method == http.MethodGet require
rc.GetRegionSyncer().IsRunning(); otherwise (DELETE) return rc so the follower
reset can reach followerRegionResetMu.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 05e0125f-1840-43a7-9d9b-164dc52c999e
📒 Files selected for processing (9)
pkg/syncer/client.gopkg/syncer/history_buffer.gopkg/syncer/server.gopkg/syncer/server_test.goserver/api/admin.goserver/api/middleware.goserver/api/router.goserver/server.gotests/server/api/api_test.go
0c0111f to
07b65b7
Compare
|
Addressed the CodeRabbit outside-diff comment in #10689 (review). DELETE follower reset requests no longer require Verified with |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
3 similar comments
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
|
/test pull-unit-test-next-gen-2 |
| log.Warn("no history regions from index, the leader may be restarted", zap.Uint64("index", startIndex)) | ||
| return nil | ||
| log.Warn("no history regions from index, fall back to full sync", zap.Uint64("index", startIndex)) | ||
| return s.syncFullRegions(ctx, name, stream) |
There was a problem hiding this comment.
When this falls back to full sync, the follower can mark the syncer as running after the first response batch. For large clusters, follower reads may see a partial region cache before full sync finishes. Can we add an explicit completion signal or keep follower local reads disabled until the full sync is done?
There was a problem hiding this comment.
Fixed in b3c130e. Full sync now sends an empty completion response after all full-sync batches, and the follower client keeps local follower reads disabled until that completion response is received.
Coverage added in TestClientWaitsForFullSyncCompletionBeforeRunning and the existing fallback full sync test. Verified with:
make gotest GOTEST_ARGS="./pkg/syncer -run 'TestSyncFallsBackToFullSyncWhenHistoryMissing|TestClientWaitsForFullSyncCompletionBeforeRunning' -count=1"make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"
| if !m.allowFollowerSyncedRegion { | ||
| return nil | ||
| } | ||
| case http.MethodDelete: |
There was a problem hiding this comment.
This reuses PD-Allow-Follower-Handle for a DELETE operation, but the existing header is read-oriented. Existing clients or proxies that already attach it could unexpectedly reset follower region cache. Could we use a dedicated opt-in signal or add compatibility coverage for this behavior?
There was a problem hiding this comment.
Fixed in b3c130e. The follower reset path now requires an explicit PD-Follower-Region-Reset header in addition to PD-Allow-Follower-Handle, so existing read-oriented callers/proxies that only carry the old header will not trigger a follower reset.
Coverage was added in the existing follower reset test. Verified with:
make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"
| resetErr = errors.Wrap(err, "flush follower region storage") | ||
| } | ||
| if len(regionIDs) == 0 { | ||
| regions, err := s.loadFollowerRegionStorage() |
There was a problem hiding this comment.
This loads all local regions into memory before deleting them. On large clusters, reset may use a lot of memory or take too long. Can we delete page by page while scanning LoadRange instead?
There was a problem hiding this comment.
Fixed in b3c130e. The all-region follower reset now scans local region storage page by page with LoadRange and deletes each loaded page immediately, instead of loading all local regions into memory first.
Verified with:
make gotest GOTEST_ARGS="./tests/server/api -tags without_dashboard -run TestFollowerRegionResetCacheWithNoForward -count=1"
07b65b7 to
b3c130e
Compare
|
Checked the latest Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest Also checked the outside-diff CodeRabbit comment in #10689 (review). The current code already allows DELETE follower reset requests through without requiring Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
|
Checked the latest This is outside the follower region reset path changed by this PR, and the PR diff does not touch Rerunning the failed required job. /test pull-unit-test-next-gen-3 |
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
|
Checked the latest Fixed in b778b7a by lowering that new full-sync send-failure log to Verified with:
|
|
Checked the latest Also checked the current GitHub Actions failure in Both failures are outside the follower region reset / syncer path changed by this PR; the PR diff does not touch Rerunning the failed required Prow job and the failed GitHub Actions jobs. /test pull-unit-test-next-gen-3 |
| resetErr = errors.Wrap(err, "flush follower region storage") | ||
| } | ||
| if len(regionIDs) == 0 { | ||
| if err := s.deleteFollowerRegionStorage(); resetErr == nil && err != nil { |
There was a problem hiding this comment.
Do we need to delete it? If we have 10M region, how long will it take?
There was a problem hiding this comment.
Fixed in d095867.
The all-region reset still needs to delete local region storage so storage-only stale regions cannot be reloaded later. To reduce the 10M-region cost, it now scans region keys and deletes by parsed region ID, without loading and decoding every region value before deletion.
Verified with:
git diff --checkmake gotest GOTEST_ARGS='./server -run "TestDeleteFollowerRegionStorage|TestDeleteFollowerRegionStorageReturnsStorageErrors|TestDeleteFollowerRegion|TestResetFollowerRegionCacheRequiresRegionStorage|TestParseRegionIDFromStorageKey" -count=1'
There was a problem hiding this comment.
@okJiang can we test it locally and report some test results here
There was a problem hiding this comment.
Local scale test result with the in-memory storage backend:
- 100,000 regions: old path delete cost 74.67 ms, new path delete cost 41.45 ms, about 1.80x faster.
- 1,000,000 regions: old path delete cost 811.82 ms, new path delete cost 466.39 ms, about 1.74x faster.
Command used with a temporary local-only test, not committed:
PD_FOLLOWER_REGION_DELETE_SCALE=100000 go test ./server -run '^TestFollowerRegionStorageDeleteScaleLocalOnly$' -count=1 -vPD_FOLLOWER_REGION_DELETE_SCALE=1000000 go test ./server -run '^TestFollowerRegionStorageDeleteScaleLocalOnly$' -count=1 -v
This is not a real disk/etcd latency benchmark. It only verifies the changed path: the reset is still O(N), but it no longer does one LoadRegion read and protobuf unmarshal per region before deletion. For 10M regions it is still a linear admin recovery operation, but the avoidable per-region value load is removed.
|
Correction on the GitHub Actions rerun: the Prow rerun command above was posted, but |
Signed-off-by: okjiang <819421878@qq.com>
|
Checked the latest Prow failures in #10689 (comment).
The latest PR commit only changed /test pull-unit-test-next-gen-2 |
|
Checked the latest The latest PR commit only changed /test pull-unit-test-next-gen-3 |
| return err | ||
| } | ||
| meta := &metapb.Region{Id: regionID} | ||
| if err := s.deleteFollowerRegionMeta(meta); err != nil && firstErr == nil { |
There was a problem hiding this comment.
Deleting regions one by one feels a bit slow. Is there a way to delete everything at once, or perhaps just recreate a new one from scratch?
There was a problem hiding this comment.
Fixed in 776055e.
The full reset path still scans local region storage page by page, but each page is now deleted in one storage transaction instead of issuing one DeleteRegion write per region. This keeps memory bounded and reduces the write cost to one batch per LoadRange page.
I did not recreate the storage from scratch because current region storage is shared by the running server/syncer and does not expose a safe swap/recreate path. There is also no generic range-delete API on the storage interface, so batching the existing page scan is the minimal safe change here.
Verified with:
git diff --checkmake gotestfor the follower-region reset/delete server tests
Signed-off-by: okjiang <819421878@qq.com>
|
Checked the latest Prow failures in #10689 (comment).
The latest PR commit only changed the follower region storage reset path in /test pull-unit-test-next-gen-2 |
|
Checked the latest It failed in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
|
Checked the latest It failed in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
|
Checked the latest It failed in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
|
Checked the latest It failed again in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
|
Checked the latest It failed in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
|
Checked the latest It failed in The latest PR commit only changed /test pull-unit-test-next-gen-3 |
Signed-off-by: okjiang <819421878@qq.com>
|
Checked the latest It failed in This QPS limiter test is outside the follower region reset / syncer full-sync path changed by this PR. The new commit /test pull-unit-test-next-gen-3 |
Signed-off-by: okjiang <819421878@qq.com>
|
@okJiang: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: Close #10667
Stacked on #10682. Please review the new commits on this branch until #10682 is merged.
What is changed and how does it work?
Check List
Tests
Code changes
Release note