checker, server: support split-scatter for MCS keyspaces#10648
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR forwards split reasons via gRPC metadata from PD to the scheduling service, the scheduler extracts the reason and records LOAD-driven split-scatter batches (collecting new region IDs), and split-scatter range hint resolution and scatter-group selection are extended to be txn-keyspace-aware; tests and keyspace utilities were added/updated. ChangesSplit-Scatter Metadata Forwarding and Load-Based Recording with Keyspace Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/mcs/scheduling/server/grpc_service.go (1)
353-365: 💤 Low valueConsider validating the parsed split reason enum value.
While the current implementation correctly defaults to
ADMINfor missing or malformed metadata, it doesn't validate that the parsed integer corresponds to a validpdpb.SplitReasonenum value. For example, metadata containing"999"would parse successfully but yield an invalid enum.Since this is an internal PD-to-scheduling-service contract (not a public API), the risk is low. However, adding a simple range check would make the code more defensive:
func splitReasonFromContext(ctx context.Context) pdpb.SplitReason { values := metadata.ValueFromIncomingContext(ctx, grpcutil.SplitReasonMetadataKey) if len(values) == 0 { return pdpb.SplitReason_ADMIN } reason, err := strconv.ParseInt(values[len(values)-1], 10, 32) if err != nil || reason < 0 || reason > int64(pdpb.SplitReason_LOAD) { return pdpb.SplitReason_ADMIN } return pdpb.SplitReason(reason) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/mcs/scheduling/server/grpc_service.go` around lines 353 - 365, splitReasonFromContext currently parses an int from metadata but doesn't verify it maps to a valid pdpb.SplitReason; update splitReasonFromContext to parse the last metadata value (grpcutil.SplitReasonMetadataKey) as before, but after strconv.ParseInt also check the parsed value is within the valid enum range (e.g. >= 0 && <= int64(pdpb.SplitReason_LOAD)) and if not (or on parse error) return pdpb.SplitReason_ADMIN; keep the existing fallback behavior for missing metadata.tests/server/split_scatter_forward_test.go (1)
104-106: 💤 Low valueConsider logging the gRPC server error for test debuggability.
While discarding the
Serveerror is acceptable for test code (sincegrpcServer.Stop()will cleanly terminate it), logging the error would help diagnose failures if the server fails to start:go func() { if err := grpcServer.Serve(listener); err != nil { t.Logf("grpcServer.Serve returned: %v", err) } }()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/server/split_scatter_forward_test.go` around lines 104 - 106, Update the anonymous goroutine that starts the gRPC server to log any Serve error instead of discarding it: call grpcServer.Serve(listener) and if it returns a non-nil error, use the test logger (t.Logf) to report the error (e.g., "grpcServer.Serve returned: %v") so test failures to start the server are visible; modify the goroutine around grpcServer.Serve and listener to capture and log the returned error via t.Logf.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/schedule/checker/split_scatter_group.go`:
- Around line 85-97: The branch in the decoder currently treats any decoded key
starting with 'x' as txn-keyspace and strips the first
splitScatterKeyspacePrefixLen bytes, which can mis-classify raw keys; modify the
logic in the function that returns splitScatterDecodedKey so that before
constructing keyspacePrefix/keyspaceID you fully validate the keyspace encoding
using the canonical parser (e.g., call the same parser used for TiDB txn
keyspace validation) and only strip the prefix when that parser confirms a valid
txn keyspace encoding; otherwise return splitScatterDecodedKey{rawKey: decoded}
unchanged. Also add a regression test targeting the decoder that feeds an
'x'-prefixed raw key (e.g., starts with x\x00\x00\x00t...) and asserts it
remains a rawKey, ensuring splitScatterKeyspacePrefixLen-based heuristics are
not used alone.
---
Nitpick comments:
In `@pkg/mcs/scheduling/server/grpc_service.go`:
- Around line 353-365: splitReasonFromContext currently parses an int from
metadata but doesn't verify it maps to a valid pdpb.SplitReason; update
splitReasonFromContext to parse the last metadata value
(grpcutil.SplitReasonMetadataKey) as before, but after strconv.ParseInt also
check the parsed value is within the valid enum range (e.g. >= 0 && <=
int64(pdpb.SplitReason_LOAD)) and if not (or on parse error) return
pdpb.SplitReason_ADMIN; keep the existing fallback behavior for missing
metadata.
In `@tests/server/split_scatter_forward_test.go`:
- Around line 104-106: Update the anonymous goroutine that starts the gRPC
server to log any Serve error instead of discarding it: call
grpcServer.Serve(listener) and if it returns a non-nil error, use the test
logger (t.Logf) to report the error (e.g., "grpcServer.Serve returned: %v") so
test failures to start the server are visible; modify the goroutine around
grpcServer.Serve and listener to capture and log the returned error via t.Logf.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 932111be-84de-4105-b36a-bf16bf0091e7
📥 Commits
Reviewing files that changed from the base of the PR and between 38a0f9a and fb3f55d9ca32620427e5065b0ad656354a2e8e0f.
📒 Files selected for processing (7)
pkg/mcs/scheduling/server/grpc_service.gopkg/mcs/scheduling/server/split_scatter_test.gopkg/schedule/checker/split_scatter_group.gopkg/schedule/checker/split_scatter_test.gopkg/utils/grpcutil/grpcutil.goserver/grpc_service.gotests/server/split_scatter_forward_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10648 +/- ##
========================================
Coverage 79.06% 79.06%
========================================
Files 535 536 +1
Lines 73065 73231 +166
========================================
+ Hits 57767 57901 +134
- Misses 11211 11231 +20
- Partials 4087 4099 +12
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/schedule/checker/split_scatter_test.go (1)
519-525: 💤 Low valueNit: prefer the existing
splitScatterTestTableIDconstant.
makeSplitScatterTableGroupForKeyspace(splitScatterTestKeyspaceID, 42, true)and the matchingcodec.GenerateTableKey(42)re-encode the literal42, which is already exposed assplitScatterTestTableID. Using the constant keeps the test in lockstep if the test table ID is ever changed.Proposed change
- wantRange := splitScatterKeyspacePrefixRange(splitScatterTestKeyspaceID, codec.GenerateTableKey(42)) - wantRange.scatterGroup = makeSplitScatterTableGroupForKeyspace(splitScatterTestKeyspaceID, 42, true) + wantRange := splitScatterKeyspacePrefixRange(splitScatterTestKeyspaceID, codec.GenerateTableKey(splitScatterTestTableID)) + wantRange.scatterGroup = makeSplitScatterTableGroupForKeyspace(splitScatterTestKeyspaceID, splitScatterTestTableID, true)Same applies to the analogous literals at lines 439-440.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/schedule/checker/split_scatter_test.go` around lines 519 - 525, Replace the hard-coded table ID literal 42 with the existing splitScatterTestTableID constant wherever used in this test; specifically update the codec.GenerateTableKey(42) call and the makeSplitScatterTableGroupForKeyspace(splitScatterTestKeyspaceID, 42, true) call (both used when calling splitScatterKeyspacePrefixRange and when building wantRange.scatterGroup) to use splitScatterTestTableID instead, and do the same for the analogous occurrences earlier in the file (the other codec.GenerateTableKey and makeSplitScatterTableGroupForKeyspace usages).pkg/mcs/scheduling/server/split_scatter_test.go (1)
51-59: 💤 Low value
AllocIDstub ignores the requestedCount.The fake hardcodes
count := uint32(1)regardless ofreq.GetCount(), so when production code requests N IDs in one call (the standard PDAskBatchSplitpath allocates a region ID plus peer IDs per split, often via a singleAllocIDwithcount > 1), the stub silently returns only one ID and a mismatchedCount. IfAskBatchSplitever switches from one-by-one allocations to batched allocations, these tests would either start failing in obscure ways or — worse — pass while exercising a non-realistic allocation path.Proposed change
func (c *splitScatterPDClient) AllocID(context.Context, *pdpb.AllocIDRequest, ...grpc.CallOption) (*pdpb.AllocIDResponse, error) { - count := uint32(1) - c.next += uint64(count) + count := req.GetCount() + if count == 0 { + count = 1 + } + c.next += uint64(count) return &pdpb.AllocIDResponse{ Header: &pdpb.ResponseHeader{}, Id: c.next - 1, Count: count, }, nil }(also requires naming the request parameter)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/mcs/scheduling/server/split_scatter_test.go` around lines 51 - 59, The AllocID stub in splitScatterPDClient should respect the requested count instead of hardcoding 1: rename the unnamed request parameter to req (*pdpb.AllocIDRequest), read count := req.GetCount(), set count = 1 if zero, then advance c.next by uint64(count) and return the response using that count (keep the existing pattern of returning Id: c.next-1 and Count: count); update the function signature AllocID and the body to use req.GetCount() accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/mcs/scheduling/server/split_scatter_test.go`:
- Around line 138-147: mcsSplitScatterPendingIDs is reading
Controller.splitScatter.pending via reflection without holding
splitScatter.pendingMu; fix by honoring the lock: add a test-accessor method on
Controller (e.g., Controller.SplitScatterPendingRegionIDs or
Controller.SplitScatterPendingIDs) that acquires splitScatter.pendingMu, copies
the map keys to a []uint64 and returns them, then update
mcsSplitScatterPendingIDs to call that accessor (or alternatively, if you prefer
reflection, use reflection to obtain the splitScatter.pendingMu and Lock/Unlock
it around reading pending); reference symbols: mcsSplitScatterPendingIDs,
Controller, splitScatter, pending, pendingMu.
---
Nitpick comments:
In `@pkg/mcs/scheduling/server/split_scatter_test.go`:
- Around line 51-59: The AllocID stub in splitScatterPDClient should respect the
requested count instead of hardcoding 1: rename the unnamed request parameter to
req (*pdpb.AllocIDRequest), read count := req.GetCount(), set count = 1 if zero,
then advance c.next by uint64(count) and return the response using that count
(keep the existing pattern of returning Id: c.next-1 and Count: count); update
the function signature AllocID and the body to use req.GetCount() accordingly.
In `@pkg/schedule/checker/split_scatter_test.go`:
- Around line 519-525: Replace the hard-coded table ID literal 42 with the
existing splitScatterTestTableID constant wherever used in this test;
specifically update the codec.GenerateTableKey(42) call and the
makeSplitScatterTableGroupForKeyspace(splitScatterTestKeyspaceID, 42, true) call
(both used when calling splitScatterKeyspacePrefixRange and when building
wantRange.scatterGroup) to use splitScatterTestTableID instead, and do the same
for the analogous occurrences earlier in the file (the other
codec.GenerateTableKey and makeSplitScatterTableGroupForKeyspace usages).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2ce03ca2-1e93-4b5f-a822-3922629f3b8e
📥 Commits
Reviewing files that changed from the base of the PR and between fb3f55d9ca32620427e5065b0ad656354a2e8e0f and cb9e5c4c4183fd9f8d894fbce21585858461cf81.
📒 Files selected for processing (5)
pkg/mcs/scheduling/server/main_test.gopkg/mcs/scheduling/server/split_scatter_test.gopkg/schedule/checker/split_scatter.gopkg/schedule/checker/split_scatter_group.gopkg/schedule/checker/split_scatter_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/keyspace/util_test.go (1)
64-81: ⚡ Quick winAdd a partial-carry case for
next prefixincrement behavior.Current coverage has no-carry and full-carry, but misses intermediate carry propagation (e.g.
0x0102ff -> 0x010300), which is core to the new boundary logic.Proposed test extension
func TestMakeRegionBound(t *testing.T) { re := require.New(t) encodeKey := func(key []byte) []byte { return []byte(codec.EncodeBytes(key)) } regionBound := MakeRegionBound(0x010203) re.Equal(encodeKey([]byte{'r', 0x01, 0x02, 0x03}), regionBound.RawLeftBound) re.Equal(encodeKey([]byte{'r', 0x01, 0x02, 0x04}), regionBound.RawRightBound) re.Equal(encodeKey([]byte{'x', 0x01, 0x02, 0x03}), regionBound.TxnLeftBound) re.Equal(encodeKey([]byte{'x', 0x01, 0x02, 0x04}), regionBound.TxnRightBound) + carryRegionBound := MakeRegionBound(0x0102ff) + re.Equal(encodeKey([]byte{'r', 0x01, 0x02, 0xff}), carryRegionBound.RawLeftBound) + re.Equal(encodeKey([]byte{'r', 0x01, 0x03, 0x00}), carryRegionBound.RawRightBound) + re.Equal(encodeKey([]byte{'x', 0x01, 0x02, 0xff}), carryRegionBound.TxnLeftBound) + re.Equal(encodeKey([]byte{'x', 0x01, 0x03, 0x00}), carryRegionBound.TxnRightBound) + maxRegionBound := MakeRegionBound(constant.MaxValidKeyspaceID) re.Equal(encodeKey([]byte{'r', 0xff, 0xff, 0xff}), maxRegionBound.RawLeftBound) re.Equal(encodeKey([]byte{'s', 0x00, 0x00, 0x00}), maxRegionBound.RawRightBound) re.Equal(encodeKey([]byte{'x', 0xff, 0xff, 0xff}), maxRegionBound.TxnLeftBound) re.Equal(encodeKey([]byte{'y', 0x00, 0x00, 0x00}), maxRegionBound.TxnRightBound) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/keyspace/util_test.go` around lines 64 - 81, Add a test case in TestMakeRegionBound for a partial-carry next-prefix scenario (e.g. id 0x0102ff) to cover intermediate carry propagation: call MakeRegionBound(0x0102ff) and assert that regionBound.RawLeftBound equals encodeKey([]byte{'r', 0x01, 0x02, 0xff}), regionBound.RawRightBound equals encodeKey([]byte{'r', 0x01, 0x03, 0x00}), regionBound.TxnLeftBound equals encodeKey([]byte{'x', 0x01, 0x02, 0xff}), and regionBound.TxnRightBound equals encodeKey([]byte{'x', 0x01, 0x03, 0x00}) so the next-prefix increment logic in MakeRegionBound is validated for partial carry.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/server/split_scatter_forward_test.go`:
- Around line 81-94: The test currently blocks forever on reading from
fakeScheduling.reasons (in the loop that calls AskBatchSplit) if AskBatchSplit
stops forwarding; change the assertion to use a bounded receive so the test
fails fast: after calling grpcServer.AskBatchSplit, replace the direct receive
from fakeScheduling.reasons with a select that either reads the expected value
(compare to strconv.FormatInt(int64(reason), 10)) or times out after a short
duration (e.g., a few hundred milliseconds) and triggers a test failure;
reference the AskBatchSplit call and the fakeScheduling.reasons channel when
making the change.
---
Nitpick comments:
In `@pkg/keyspace/util_test.go`:
- Around line 64-81: Add a test case in TestMakeRegionBound for a partial-carry
next-prefix scenario (e.g. id 0x0102ff) to cover intermediate carry propagation:
call MakeRegionBound(0x0102ff) and assert that regionBound.RawLeftBound equals
encodeKey([]byte{'r', 0x01, 0x02, 0xff}), regionBound.RawRightBound equals
encodeKey([]byte{'r', 0x01, 0x03, 0x00}), regionBound.TxnLeftBound equals
encodeKey([]byte{'x', 0x01, 0x02, 0xff}), and regionBound.TxnRightBound equals
encodeKey([]byte{'x', 0x01, 0x03, 0x00}) so the next-prefix increment logic in
MakeRegionBound is validated for partial carry.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d1cc3420-05dd-4fe1-8ec7-03d23c32ddee
📥 Commits
Reviewing files that changed from the base of the PR and between db47cd786b2106ac8582eb59f15ba74701cee49c and c07710f3834c33e932532834bd102a5f6d40476b.
📒 Files selected for processing (5)
pkg/keyspace/util.gopkg/keyspace/util_test.gopkg/mcs/scheduling/server/grpc_service.gopkg/mcs/scheduling/server/split_scatter_test.gotests/server/split_scatter_forward_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/mcs/scheduling/server/grpc_service.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tools/pd-ctl/pdctl/command/keyspace_command_test.go (1)
57-61: ⚡ Quick winAvoid self-referential expected values in this test.
On Line 57, expected bounds are derived from
keyspace.MakeRegionBound, which is closely related to the logic under test. This weakens regression detection if both implementations drift together. Please keep at least one independently hardcoded/golden boundary case (for one keyspace ID) in addition to these helper-based checks.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/pd-ctl/pdctl/command/keyspace_command_test.go` around lines 57 - 61, The test currently asserts expected bounds by calling keyspace.MakeRegionBound and comparing its outputs (regionBound.RawLeftBound/RawRightBound and regionBound.TxnLeftBound/TxnRightBound), which is self-referential; change it so at least one of the expected values is a hardcoded/golden hex string for a concrete keyspace ID (e.g., pick a specific keyspaceID used in the test), and keep the other assertions using keyspace.MakeRegionBound for coverage. Update the assertion that compares hex.EncodeToString(regionBound.RawLeftBound) (or any one of RawRightBound/TxnLeftBound/TxnRightBound) to use the independent hardcoded expected hex value instead of deriving it from keyspace.MakeRegionBound. Ensure the golden value matches the canonical encoding for that chosen keyspaceID.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tools/pd-ctl/pdctl/command/keyspace_command_test.go`:
- Around line 57-61: The test currently asserts expected bounds by calling
keyspace.MakeRegionBound and comparing its outputs
(regionBound.RawLeftBound/RawRightBound and
regionBound.TxnLeftBound/TxnRightBound), which is self-referential; change it so
at least one of the expected values is a hardcoded/golden hex string for a
concrete keyspace ID (e.g., pick a specific keyspaceID used in the test), and
keep the other assertions using keyspace.MakeRegionBound for coverage. Update
the assertion that compares hex.EncodeToString(regionBound.RawLeftBound) (or any
one of RawRightBound/TxnLeftBound/TxnRightBound) to use the independent
hardcoded expected hex value instead of deriving it from
keyspace.MakeRegionBound. Ensure the golden value matches the canonical encoding
for that chosen keyspaceID.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 087b29f0-5446-4a7c-8635-d868d70dae9e
📥 Commits
Reviewing files that changed from the base of the PR and between c07710f3834c33e932532834bd102a5f6d40476b and cd7966f83949eade8f41034d7f7d104bb341397e.
📒 Files selected for processing (1)
tools/pd-ctl/pdctl/command/keyspace_command_test.go
fba1da3 to
cf12a91
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rleungx The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
| StartKey: startKey, | ||
| EndKey: endKey, | ||
| }, nil) | ||
| re.Equal(splitScatterRangeHint{}, resolveSplitScatterRangeHint(region)) |
There was a problem hiding this comment.
This test is the only remaining usage of resolveSplitScatterRangeHint. Maybe worth removing the function?
Signed-off-by: lhy1024 <admin@liudos.us>
|
@lhy1024: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@bufferflies @okJiang PTAL |
What problem does this PR solve?
Issue Number: ref #10592
This PR follows up #10621 and #10652. The max-keyspace region-bound fix has been split out and merged by #10665, so this PR now only keeps the remaining keyspace-aware split-scatter changes.
What is changed and how does it work?
Scope in this PR:
keyspace.ParseKeyspacePrefixso split-scatter can decode raw keyspace prefixes through the shared keyspace utility instead of duplicating prefix parsing logic.Check List
Tests
Environment
e3ab037ffe1f5b3dbf06691658c366da307b9a99,Kernel Type: Next Generationf469ce15a327c9b77efe0f2cfab920b65142a6aakeyspace_bid1,keyspace_aid2,SYSTEMid16777214[]after removing balance and evict schedulersResult
keyspace_ahot index regions:1 -> 52Check
keyspace_aregion start/end keys use keyspace prefix0x78000002, matchingkeyspace_aid2.keyspace_asplit boundary uses keyspace id1or16777214.keyspace_aboundaries23937800000274800000FF007800000274800000FF0051223987800000274800000FF007800000274800000FF0051224037800000274800000FF007800000274800000FF0051226317800000274800000FF007800000274800000FF0010902623737800000274800000FF007800000274800000FF00109026Release note