checker: address split scatter pending follow-ups#10691
Conversation
Reset the process-global pending gauge when creating a split-scatter controller. Avoid recording or retaining pending entries while split-scatter is disabled, add a cheap dispatch backoff for blocked pending work, and count missing pending regions when they are delayed. Signed-off-by: lhy1024 <[email protected]>
Signed-off-by: lhy1024 <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds nextDispatchAt-based throttling to split-scatter dispatch, detects missing regions during candidate collection and sets retry backoff under the pending lock, provides helpers to clear/throttle pending/dispatch state, skips recording when scheduling is disabled, and updates tests for disabled and timing behavior. ChangesSplit-scatter dispatch scheduling and backoff
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/schedule/checker/split_scatter.go`:
- Around line 380-391: The early throttle check using skipDispatchUntil(now)
happens before cleanupExpiredPendingSplitScatter() and the disabled-limit
handling, causing stale pending entries/gauges to remain if nextDispatchAt is in
the future; reorder the logic in the function so you first call
cleanupExpiredPendingSplitScatter(), then read limit via
c.cluster.GetCheckerConfig().GetSplitScatterScheduleLimit() and handle the
limit==0 case by incrementing splitScatterDispatchDisabledCounter and calling
c.clearPendingSplitScatter(), and only after those steps perform the
skipDispatchUntil(now) check (using the same skipDispatchUntil function) to
early-return if needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 504320bc-587f-4c8f-a2a2-78a33d6dd1cb
📒 Files selected for processing (2)
pkg/schedule/checker/split_scatter.gopkg/schedule/checker/split_scatter_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10691 +/- ##
==========================================
+ Coverage 79.03% 79.05% +0.02%
==========================================
Files 536 536
Lines 73103 73258 +155
==========================================
+ Hits 57777 57915 +138
- Misses 11234 11239 +5
- Partials 4092 4104 +12
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: lhy1024 <[email protected]>
|
@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@rleungx @bufferflies @okJiang PTAL |
| pendingCount := len(c.pending) | ||
| if pendingCount > 0 { | ||
| c.pending = make(map[uint64]splitScatterPendingItem) | ||
| c.updatePendingGaugeLocked() |
There was a problem hiding this comment.
Will the metrics reset zero if pending count is zero,
There was a problem hiding this comment.
Updated in 64529c3. clearPendingSplitScatter now always refreshes splitScatterPendingGauge after clearing pending, so the metric is reset to 0 even when the pending count is already 0. PatrolRegions also calls the same cleanup path when it exits.
| c.clearPendingSplitScatter() | ||
| return | ||
| } | ||
| if c.skipDispatchUntil(now) { |
There was a problem hiding this comment.
Do we needs to add new metrics recording why the split controller doesn't work?
There was a problem hiding this comment.
I think the existing checker event metrics already cover the main reasons why split-scatter dispatch does not make progress. They are reported through pd_checker_event_count{type="split_scatter_checker", name=...}, including dispatch-disabled, dispatch-schedule-limit, dispatch-region-missing, dispatch-schedule-disabled, dispatch-not-fully-replicated, dispatch-scatter-failed, dispatch-store-limit, and dispatch-add-operator-failed. The skipDispatchUntil path is only the retry backoff after one of those reasons has already been recorded, so I do not add another metric for it here.
Signed-off-by: lhy1024 <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bufferflies, liyishuai, rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/retest |
1 similar comment
|
/retest |
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
ref #10592\n\nBackport the split-scatter pending follow-up fixes from the release-8.5 cherry-pick back to master. Reset the process-global pending gauge when creating a split-scatter controller, avoid recording or retaining pending split-scatter entries while split-scatter scheduling is disabled, add dispatch backoff for blocked pending work, and count missing pending regions when delaying retries. Keep the master test style by using prometheus/testutil for metric assertions.\n\nSigned-off-by: lhy1024 <[email protected]>\nSigned-off-by: lhy1024 <[email protected]> (cherry picked from commit c2665cc) Signed-off-by: lhy1024 <[email protected]>
What problem does this PR solve?
Issue Number: ref #10592 pick from #10678
What is changed and how does it work?
Check List
Tests
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests