Skip to content

checker: address split scatter pending follow-ups#10691

Merged
ti-chi-bot[bot] merged 4 commits into
tikv:masterfrom
lhy1024:split-scatter-pending-fixes-master
May 25, 2026
Merged

checker: address split scatter pending follow-ups#10691
ti-chi-bot[bot] merged 4 commits into
tikv:masterfrom
lhy1024:split-scatter-pending-fixes-master

Conversation

@lhy1024
Copy link
Copy Markdown
Member

@lhy1024 lhy1024 commented May 19, 2026

What problem does this PR solve?

Issue Number: ref #10592 pick from #10678

What is changed and how does it work?

Backport the split-scatter pending follow-up fixes from the release-8.5 cherry-pick back to master.

Reset the process-global pending gauge when creating a split-scatter controller, avoid recording or retaining pending split-scatter entries while split-scatter scheduling is disabled, add dispatch backoff for blocked pending work, and count missing pending regions when delaying retries.

Keep the master test style by using prometheus/testutil for metric assertions.

Check List

Tests

  • Unit test

Release note

PD now avoids stale pending load-based split-scatter entries and excessive retry loops when split-scatter scheduling is disabled or pending split regions are not ready.

Summary by CodeRabbit

  • New Features

    • Added dispatch throttling and explicit retry delays for split-scatter scheduling; pending items with missing regions are delayed and retried later. Pending counters are initialized/reset correctly.
  • Bug Fixes

    • Prevented recording/dispatch when scheduling is disabled and ensured pending state is cleared.
    • Improved backoff when operator limits are reached or no candidates are available; missing-region occurrences now increment a metric.
  • Tests

    • Added coverage for pending gauge reset, disabled scheduling, missing-region retries, and dispatch backoff.

Review Change Stack

lhy1024 added 2 commits May 19, 2026 21:17
Reset the process-global pending gauge when creating a split-scatter controller.

Avoid recording or retaining pending entries while split-scatter is disabled, add a cheap dispatch backoff for blocked pending work, and count missing pending regions when they are delayed.

Signed-off-by: lhy1024 <[email protected]>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels May 19, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a840edaa-e256-47de-83e3-1ce58cbaf38e

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5a225 and 64529c3.

📒 Files selected for processing (3)
  • pkg/schedule/checker/checker_controller.go
  • pkg/schedule/checker/split_scatter.go
  • pkg/schedule/checker/split_scatter_test.go

📝 Walkthrough

Walkthrough

Adds nextDispatchAt-based throttling to split-scatter dispatch, detects missing regions during candidate collection and sets retry backoff under the pending lock, provides helpers to clear/throttle pending/dispatch state, skips recording when scheduling is disabled, and updates tests for disabled and timing behavior.

Changes

Split-scatter dispatch scheduling and backoff

Layer / File(s) Summary
Controller state, gauge init, and cleanup
pkg/schedule/checker/split_scatter.go, pkg/schedule/checker/split_scatter_test.go, pkg/schedule/checker/checker_controller.go
Adds nextDispatchAt to splitScatterController, initializes pending gauge to zero, and defers clearing pending state during patrol cleanup.
Collect missing pending entries and schedule retry
pkg/schedule/checker/split_scatter.go, pkg/schedule/checker/split_scatter_test.go
collectTopPendingSplitScatter snapshots pending entries whose target/source regions are missing and calls delayMissingPendingSplitScatter to set retryAt = now + splitScatterRetryBackoff and increment the missing-region metric.
Pending/dispatch control helpers
pkg/schedule/checker/split_scatter.go
Adds clearPendingSplitScatter, skipDispatchUntil, and delayNextDispatch; recordSplitScatterBatch now returns early when GetSplitScatterScheduleLimit() == 0.
Dispatch logic: throttling, limit, and backoff
pkg/schedule/checker/split_scatter.go, pkg/schedule/checker/split_scatter_test.go
dispatchSplitScatterRegions observes nextDispatchAt, clears pending when scheduling is disabled, delays next dispatch when operator limit reached or no candidates, and otherwise dispatches candidates; tests updated/added for these flows.
Tests: disabled behavior, timing, and test helpers
pkg/schedule/checker/split_scatter_test.go
Adds tests verifying controller cleanup resets pending/gauge, recording skipped when limit is 0, dispatch clears pending when disabled, missing-region delays and metric increment, dispatch backoff when no candidates, and includes test helpers to set/read pending.retryAt and nextDispatchAt.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tikv/pd#10621: Follow-up to split-scatter implementation; modifies the same split-scatter controller and recording logic.
  • tikv/pd#10652: Adds RecordSplitScatterBatch usage in the scheduling server that overlaps with these recording/dispatch changes.

Suggested labels

size/XL, needs-cherry-pick-release-8.5

Suggested reviewers

  • rleungx
  • bufferflies
  • okJiang

Poem

🐰 I nudge the pending into timed retreat,
Missing hops wait till clocks repeat,
When limits stand, I clear the trail,
Then back I hop to try again without fail,
A small soft thump — the scheduler’s beat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: addressing split scatter pending follow-ups by adding backoff handling, dispatch scheduling, and missing-region tracking.
Description check ✅ Passed The description includes the required issue reference, a detailed commit message explaining the changes, marks unit tests as included, and provides a release note. All key information is present.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed do-not-merge/needs-linked-issue do-not-merge/needs-triage-completed labels May 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/schedule/checker/split_scatter.go`:
- Around line 380-391: The early throttle check using skipDispatchUntil(now)
happens before cleanupExpiredPendingSplitScatter() and the disabled-limit
handling, causing stale pending entries/gauges to remain if nextDispatchAt is in
the future; reorder the logic in the function so you first call
cleanupExpiredPendingSplitScatter(), then read limit via
c.cluster.GetCheckerConfig().GetSplitScatterScheduleLimit() and handle the
limit==0 case by incrementing splitScatterDispatchDisabledCounter and calling
c.clearPendingSplitScatter(), and only after those steps perform the
skipDispatchUntil(now) check (using the same skipDispatchUntil function) to
early-return if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 504320bc-587f-4c8f-a2a2-78a33d6dd1cb

📥 Commits

Reviewing files that changed from the base of the PR and between 089337a and dfa9555.

📒 Files selected for processing (2)
  • pkg/schedule/checker/split_scatter.go
  • pkg/schedule/checker/split_scatter_test.go

Comment thread pkg/schedule/checker/split_scatter.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 90.19608% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.05%. Comparing base (94df6da) to head (64529c3).
⚠️ Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10691      +/-   ##
==========================================
+ Coverage   79.03%   79.05%   +0.02%     
==========================================
  Files         536      536              
  Lines       73103    73258     +155     
==========================================
+ Hits        57777    57915     +138     
- Misses      11234    11239       +5     
- Partials     4092     4104      +12     
Flag Coverage Δ
unittests 79.05% <90.19%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 20, 2026

@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 21, 2026

@rleungx @bufferflies @okJiang PTAL

Comment thread pkg/schedule/checker/split_scatter.go Outdated
pendingCount := len(c.pending)
if pendingCount > 0 {
c.pending = make(map[uint64]splitScatterPendingItem)
c.updatePendingGaugeLocked()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the metrics reset zero if pending count is zero,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 64529c3. clearPendingSplitScatter now always refreshes splitScatterPendingGauge after clearing pending, so the metric is reset to 0 even when the pending count is already 0. PatrolRegions also calls the same cleanup path when it exits.

c.clearPendingSplitScatter()
return
}
if c.skipDispatchUntil(now) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we needs to add new metrics recording why the split controller doesn't work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing checker event metrics already cover the main reasons why split-scatter dispatch does not make progress. They are reported through pd_checker_event_count{type="split_scatter_checker", name=...}, including dispatch-disabled, dispatch-schedule-limit, dispatch-region-missing, dispatch-schedule-disabled, dispatch-not-fully-replicated, dispatch-scatter-failed, dispatch-store-limit, and dispatch-add-operator-failed. The skipDispatchUntil path is only the retry backoff after one of those reasons has already been recorded, so I do not add another metric for it here.

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 25, 2026
@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, liyishuai, rleungx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bufferflies,rleungx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 25, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-25 02:53:48.867725282 +0000 UTC m=+234298.837890343: ☑️ agreed by bufferflies.
  • 2026-05-25 07:40:14.918985997 +0000 UTC m=+251484.889151059: ☑️ agreed by rleungx.

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

1 similar comment
@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

3 similar comments
@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 25, 2026

/retest

@ti-chi-bot ti-chi-bot Bot merged commit c2665cc into tikv:master May 25, 2026
32 checks passed
@lhy1024 lhy1024 deleted the split-scatter-pending-fixes-master branch May 25, 2026 11:31
lhy1024 added a commit that referenced this pull request May 26, 2026
ref #10592\n\nBackport the split-scatter pending follow-up fixes from the release-8.5 cherry-pick back to master.

Reset the process-global pending gauge when creating a split-scatter controller, avoid recording or retaining pending split-scatter entries while split-scatter scheduling is disabled, add dispatch backoff for blocked pending work, and count missing pending regions when delaying retries.

Keep the master test style by using prometheus/testutil for metric assertions.\n\nSigned-off-by: lhy1024 <[email protected]>\nSigned-off-by: lhy1024 <[email protected]>

(cherry picked from commit c2665cc)
Signed-off-by: lhy1024 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants