Skip to content

server,mcs: pass split reason to scheduling service#10652

Merged
ti-chi-bot[bot] merged 3 commits into
tikv:masterfrom
lhy1024:mcs-affinity-split-reason
May 12, 2026
Merged

server,mcs: pass split reason to scheduling service#10652
ti-chi-bot[bot] merged 3 commits into
tikv:masterfrom
lhy1024:mcs-affinity-split-reason

Conversation

@lhy1024
Copy link
Copy Markdown
Member

@lhy1024 lhy1024 commented May 9, 2026

What problem does this PR solve?

Issue Number: ref #10592, ref #9764

When the scheduling service runs independently, AskBatchSplit needs the split reason to keep MCS behavior consistent with the local PD path. Without it, affinity split checks and load-based split-scatter handling cannot work correctly in the MCS path.

This PR intentionally carries the minimal split-reason plumbing that overlaps PR 10648 so the affinity fix can be validated independently. PR 10648 remains the owner of the broader split-scatter keyspace and range changes; rebasing either PR after the other merges should drop or preserve the overlapping plumbing without removing this affinity guard.

What is changed and how does it work?

Forward the split reason when PD sends AskBatchSplit requests to the scheduling service.

Use the forwarded reason in the scheduling service AskBatchSplit handler to run affinity AllowSplit checks and reject disallowed auto splits with ChangeSplit{AutoSplitEnabled:false}.

Record load-based split-scatter batches in the scheduling service for SplitReason_LOAD, matching the local PD HandleAskBatchSplit path.

Update github.com/pingcap/kvproto to the upstream version that includes schedulingpb.AskBatchSplitRequest.reason.

Check List

Tests

  • Unit test
  • Manual test

Manual test commands:

GOFLAGS='-mod=readonly -buildvcs=false' GO111MODULE=on go test ./pkg/mcs/scheduling/server -count=1
GOFLAGS='-mod=readonly -buildvcs=false' GO111MODULE=on go test ./pkg/mcs/scheduling/server -run TestAskBatchSplitRejectsAffinityAutoSplit -count=20
GOFLAGS='-mod=readonly -buildvcs=false' GO111MODULE=on go test ./pkg/schedule/affinity -count=1
GOFLAGS='-mod=readonly -buildvcs=false' GO111MODULE=on go test ./server -run 'TestNewSchedulingAskBatchSplitRequestPreservesReason|TestConvertSchedulingHeaderPreservesError' -count=1
GOFLAGS='-mod=readonly -buildvcs=false' GO111MODULE=on go test ./server/cluster -run TestHandleAskBatchSplit -count=1

Release note

Fix the scheduling service AskBatchSplit path to respect split reasons for affinity split checks and load-based split scatter.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds an ID guard in region validation, forwards split Reason to the scheduler, enforces affinity constraints (denying splits via heartbeat ChangeSplit), records LOAD split-scatter batches, refactors scheduling→PD header/error mapping with tests, and updates kvproto pinned versions across modules.

Changes

Affinity-Constrained Splits

Layer / File(s) Summary
Region ID guard
pkg/core/region.go, tests/server/cluster/cluster_work_test.go
Validate requested region ID matches tree region; test added for "region id mismatch".
Request Forwarding
server/grpc_service.go
Server sets Reason when forwarding AskBatchSplit to the scheduling service.
Header / Error Mapping
server/grpc_service.go, server/grpc_service_test.go
convertHeader refactored to map scheduling errors to pdpb.ErrorType; tests added verifying mapping and message preservation.
Affinity Split Handling
pkg/mcs/scheduling/server/grpc_service.go
Scheduling service checks affinity constraints; denies disallowed splits by emitting RegionHeartbeatResponse.ChangeSplit.AutoSplitEnabled=false and records split-scatter batches for LOAD-reason splits.
Scheduling Tests & Helpers
pkg/mcs/scheduling/server/grpc_service_test.go
Adds heartbeat stream capture, wait helper, test region factory, scheduling setup, and TestAskBatchSplitRejectsAffinityAutoSplit.
Dependency Version Pins
go.mod, client/go.mod, tests/integrations/go.mod, tools/go.mod
Re-pin github.com/pingcap/kvproto to a newer pseudo-version in module files (no replace directives added).

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant PDServer as PD Server
  participant Scheduler as SchedulingService
  participant AffMgr as AffinityManager
  participant HBStream as HeartbeatStream

  Client->>PDServer: AskBatchSplit (includes Reason)
  PDServer->>Scheduler: Forward AskBatchSplitRequest (sets Reason)
  Scheduler->>AffMgr: CheckSplitAllowed(region)
  AffMgr-->>Scheduler: Deny / Allow
  alt Deny
    Scheduler->>HBStream: Send ChangeSplit (AutoSplitEnabled=false)
    Scheduler-->>PDServer: Return UNKNOWN error (no split IDs)
    PDServer-->>Client: Return UNKNOWN error
  else Allowed and Reason==LOAD
    Scheduler->>Scheduler: RecordSplitScatterBatch(originalRegionID, epoch+1, newRegionIDs)
    Scheduler-->>PDServer: Return split IDs
    PDServer-->>Client: Return split IDs
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • tikv/pd#10621: Related changes around propagating SplitReason and recording split-scatter batches for load-based split scatter flows.

Suggested reviewers

  • rleungx
  • okJiang

🐰 I hopped through modules, pins in paw,
I nudged split reasons into law,
When affinity says "hold tight,"
Heartbeats hush until it's right,
Splits wait patient for the light.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'server,mcs: pass split reason to scheduling service' is concise, clear, and directly describes the main change—forwarding the split reason to the scheduling service for proper affinity and load-based split handling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description follows the template with all required sections properly completed including problem statement with issue references, detailed explanation of changes and commit message, comprehensive checklist with unit and manual tests listed, and release notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/mcs/scheduling/server/grpc_service_test.go`:
- Around line 112-116: The test currently checks
heartbeatResp.GetChangeSplit().GetAutoSplitEnabled() without first asserting
ChangeSplit exists; update the assertion in grpc_service_test.go to first verify
heartbeatResp.GetChangeSplit() is non-nil (or use heartbeatResp.ChangeSplit !=
nil via the generated struct) and then assert GetAutoSplitEnabled() is false;
locate the usage in the test around RegionHeartbeatResponse handling (symbols:
RegionHeartbeatResponse, GetChangeSplit, GetAutoSplitEnabled) and add the nil
check/assert immediately before the AutoSplitEnabled assertion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 96496609-3e7d-4dc0-904d-6287a904cbb9

📥 Commits

Reviewing files that changed from the base of the PR and between f4813f3 and 509456f4d02eae16090973124aeb3435e060ff5e.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • client/go.mod
  • go.mod
  • pkg/mcs/scheduling/server/grpc_service.go
  • pkg/mcs/scheduling/server/grpc_service_test.go
  • server/grpc_service.go
  • tests/integrations/go.mod
  • tools/go.mod

Comment thread pkg/mcs/scheduling/server/grpc_service_test.go
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/mcs/scheduling/server/grpc_service_test.go (1)

67-73: 💤 Low value

Consider adding a comment explaining the drain loop.

The inner loop that drains the channel is correct but subtle. A brief comment would clarify that it's clearing stale messages to ensure a clean state for subsequent test assertions.

📝 Suggested comment
 		select {
 		case <-stream.ch:
+			// Drain any additional stale messages to ensure clean state
 			for {
 				select {
 				case <-stream.ch:
 				default:
 					return
 				}
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/mcs/scheduling/server/grpc_service_test.go` around lines 67 - 73, Add a
short inline comment above the drain loop that iterates over stream.ch to
explain its purpose: note that the for-select loop is intentionally non-blocking
(using default) to clear any pending/stale messages from the channel so
subsequent test assertions start from a clean state; reference the channel
variable stream.ch and the non-blocking select pattern so future readers
understand this is a deliberate drain, not an accidental busy-wait or infinite
loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/mcs/scheduling/server/grpc_service_test.go`:
- Around line 67-73: Add a short inline comment above the drain loop that
iterates over stream.ch to explain its purpose: note that the for-select loop is
intentionally non-blocking (using default) to clear any pending/stale messages
from the channel so subsequent test assertions start from a clean state;
reference the channel variable stream.ch and the non-blocking select pattern so
future readers understand this is a deliberate drain, not an accidental
busy-wait or infinite loop.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 69718e30-1e32-46c1-81da-f1dbcb797474

📥 Commits

Reviewing files that changed from the base of the PR and between 509456f4d02eae16090973124aeb3435e060ff5e and 398a84d0004fd40835dbfd6d8d5ea2da62c33232.

⛔ Files ignored due to path filters (4)
  • client/go.sum is excluded by !**/*.sum
  • go.sum is excluded by !**/*.sum
  • tests/integrations/go.sum is excluded by !**/*.sum
  • tools/go.sum is excluded by !**/*.sum
📒 Files selected for processing (1)
  • pkg/mcs/scheduling/server/grpc_service_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/grpc_service_test.go (1)

33-62: ⚡ Quick win

Add a test case for ALREADY_BOOTSTRAPPED mapping.

convertHeader now maps schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED, but this branch is not covered in the table yet.

Proposed test case addition
 	{
 		name: "not bootstrapped",
 		in:   &schedulingpb.Error{Type: schedulingpb.ErrorType_NOT_BOOTSTRAPPED, Message: "cluster is not initialized"},
 		want: &pdpb.Error{Type: pdpb.ErrorType_NOT_BOOTSTRAPPED, Message: "cluster is not initialized"},
 	},
+	{
+		name: "already bootstrapped",
+		in:   &schedulingpb.Error{Type: schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED, Message: "cluster is already bootstrapped"},
+		want: &pdpb.Error{Type: pdpb.ErrorType_ALREADY_BOOTSTRAPPED, Message: "cluster is already bootstrapped"},
+	},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/grpc_service_test.go` around lines 33 - 62, Add a test case to the
TestConvertSchedulingHeaderPreservesError test table to cover
schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED: update the testCases slice inside
TestConvertSchedulingHeaderPreservesError to include an entry with name like
"already bootstrapped", in set in to &schedulingpb.Error{Type:
schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED, Message: "already bootstrapped"}
and want set to &pdpb.Error{Type: pdpb.ErrorType_ALREADY_BOOTSTRAPPED, Message:
"already bootstrapped"} so the convertHeader mapping for ALREADY_BOOTSTRAPPED is
exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/grpc_service_test.go`:
- Around line 33-62: Add a test case to the
TestConvertSchedulingHeaderPreservesError test table to cover
schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED: update the testCases slice inside
TestConvertSchedulingHeaderPreservesError to include an entry with name like
"already bootstrapped", in set in to &schedulingpb.Error{Type:
schedulingpb.ErrorType_ALREADY_BOOTSTRAPPED, Message: "already bootstrapped"}
and want set to &pdpb.Error{Type: pdpb.ErrorType_ALREADY_BOOTSTRAPPED, Message:
"already bootstrapped"} so the convertHeader mapping for ALREADY_BOOTSTRAPPED is
exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb429d38-d9fc-4483-aed3-6d8149a29872

📥 Commits

Reviewing files that changed from the base of the PR and between 398a84d0004fd40835dbfd6d8d5ea2da62c33232 and 9360f1956a8c04df1fc7d6b46e29a0c60fdcaf7e.

📒 Files selected for processing (5)
  • pkg/core/region.go
  • pkg/mcs/scheduling/server/grpc_service_test.go
  • server/grpc_service.go
  • server/grpc_service_test.go
  • tests/server/cluster/cluster_work_test.go

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 11, 2026
@lhy1024 lhy1024 force-pushed the mcs-affinity-split-reason branch from 37f4f72 to baea05b Compare May 11, 2026 06:13
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 91.48936% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.06%. Comparing base (c938cb5) to head (02d8521).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10652      +/-   ##
==========================================
- Coverage   79.07%   79.06%   -0.01%     
==========================================
  Files         535      535              
  Lines       72708    72959     +251     
==========================================
+ Hits        57492    57687     +195     
- Misses      11162    11194      +32     
- Partials     4054     4078      +24     
Flag Coverage Δ
unittests 79.06% <91.48%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lhy1024 lhy1024 requested review from bufferflies and okJiang May 11, 2026 10:01
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 11, 2026
@ti-chi-bot ti-chi-bot Bot added the lgtm label May 11, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 11, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bufferflies, HunDunDM

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [HunDunDM,bufferflies]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 11, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 11, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-11 11:13:02.598583944 +0000 UTC m=+91351.131363263: ☑️ agreed by HunDunDM.
  • 2026-05-11 12:26:45.732316696 +0000 UTC m=+95774.265096015: ☑️ agreed by bufferflies.

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 12, 2026

/retest

@lhy1024 lhy1024 force-pushed the mcs-affinity-split-reason branch from 60d426a to 02d8521 Compare May 12, 2026 04:14
@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 12, 2026

/retest

3 similar comments
@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 12, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 12, 2026

/retest

@lhy1024
Copy link
Copy Markdown
Member Author

lhy1024 commented May 12, 2026

/retest

@ti-chi-bot ti-chi-bot Bot merged commit 428acf2 into tikv:master May 12, 2026
32 checks passed
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #10663.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants