mcs/scheduling: clean primary resources on exit by bufferflies · Pull Request #10645 · tikv/pd

bufferflies · 2026-05-07T08:23:21Z

What problem does this PR solve?

Issue Number: Close #10644

What is changed and how does it work?

bind scheduling primary resources to the primary campaign context so they stop when
the primary exits. add cleanup for heartbeat streams, watchers, and in-memory
cluster references, and roll back partially initialized resources on startup
failure.

Check List

Tests

Unit test

Release note

Fix stale in-memory scheduling state retention after scheduling primary exits.

Summary by CodeRabbit

Bug Fixes
- Prevented stale suspect ranges from persisting after watcher shutdown.
- Heartbeat endpoints now return "not bootstrapped" when required stream/monitoring components or cluster state are unavailable.
Improvements
- Refined cluster start/stop flow with explicit, idempotent cleanup to avoid partial state and ensure watchers/streams are closed.
- API handlers now validate cluster readiness and return an error when uninitialized.
- Affinity manager starts availability checks after initialization.
Tests
- Added tests validating cleanup idempotency and safe heartbeat behavior during teardown.

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot · 2026-05-07T08:23:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign binshi-bing for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-07T08:23:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors the scheduling server's cluster lifecycle: context-driven startCluster, explicit watcher/heartbeat initialization and rollback, stopCluster delegating to cleanupClusterResources that closes watchers/hbStreams and clears in-memory cluster/storage, adds API guards and gRPC early-return checks, clears watcher suspect ranges on close, starts affinity availability checks, and adds tests with goleak verification.

Changes

Cluster Lifecycle Cleanup

Layer / File(s)	Summary
API basicCluster guard & tests `pkg/mcs/scheduling/server/apis/v1/api.go`, `pkg/mcs/scheduling/server/apis/v1/api_test.go`	Adds `getBasicCluster` helper and `pdcore` alias; v1 handlers early-return 500 when `basicCluster` is missing; adds tests verifying 500 `"not bootstrapped"` response.
gRPC heartbeat guards `pkg/mcs/scheduling/server/grpc_service.go`	`RegionHeartbeat` treats missing `hbStreams` like missing cluster; `StoreHeartbeat` treats missing `metaWatcher` like missing cluster; both return the not-bootstrapped response.
Watcher suspect range cleanup `pkg/mcs/scheduling/server/rule/watcher.go`	`Watcher.Close()` clears suspect key ranges via `checkerController.ClearSuspectKeyRanges()` when controller is non-nil before cancelling and waiting for goroutines.
Cluster Startup (context-driven) `pkg/mcs/scheduling/server/server.go`	`startCluster(ctx)` creates `basicCluster` and in-memory `storage`, starts meta/config watchers and heartbeat streams, constructs `Cluster`, wires schedulers controller, and creates rule/affinity watchers; rolls back partially initialized resources on error.
Watcher helpers & stopWatcher `pkg/mcs/scheduling/server/server.go`	`startMetaConfWatcher(ctx, basicCluster, storage)` returns watchers and unwinds metaWatcher on configWatcher failure; `stopWatcher` closes non-nil watchers and nils fields.
Cluster shutdown & cleanup `pkg/mcs/scheduling/server/server.go`, `pkg/mcs/scheduling/server/cluster.go`	`stopCluster` stops background jobs and delegates full teardown to `cleanupClusterResources`, which clears the cluster atomic, calls `stopWatcher`, closes/nils `hbStreams`, and nils `basicCluster`/`storage`. `Cluster.StopBackgroundJobs()` now returns `bool`.
Cleanup verification & tests `pkg/mcs/scheduling/server/server_test.go`	Adds `TestMain` with goleak verification; `TestCleanupClusterResources` verifies cleanup clears cluster and dependencies; adds tests for heartbeat-not-bootstrapped and RegionHeartbeat panic-safety; includes `mockRegionHeartbeatStream`.
Affinity manager init `pkg/schedule/affinity/manager.go`	`Manager.initialize()` starts availability check loop after `loadRegionLabel()` succeeds and logs initialization with group count.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tikv/pd#10042: Related affinity watcher implementation used by the server's affinity watcher wiring.

Suggested labels

lgtm, approved

Suggested reviewers

lhy1024
okJiang
JmPotato

Poem

🐰 I hopped through code with nimble paws,
Closed streams and watchers without a pause.
Cleared suspect ranges, no ghosts remain,
Tests checked leaks — no panic, no strain.
A tidy burrow, neat and sane.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'mcs/scheduling: clean primary resources on exit' is concise, clear, and accurately summarizes the primary change: cleaning up scheduling resources when the primary exits.
Description check	✅ Passed	The PR description includes the required issue number (Close `#10644`), a commit message explaining the changes, unit tests checkbox, and a release note, meeting the template requirements.
Linked Issues check	✅ Passed	The PR comprehensively addresses all objectives from `#10644`: closes heartbeat streams, stops/releases watchers, clears in-memory cluster/storage references, and rolls back partially initialized resources on startup failure.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to the linked issue: cleanup of heartbeat streams, watchers, cluster references, and adding unit tests for these operations. No extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mcs-mem-clean

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bufferflies · 2026-05-07T08:27:04Z

/check-issue-triage-complete

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/mcs/scheduling/server/server.go`:
- Around line 525-539: NewCluster allocates a Cluster whose NewRegionLabeler
starts a background goroutine (doGC) that listens on the cluster context; when
rule.NewWatcher or affinity.NewWatcher fails the code closes
hbStreams/configWatcher/metaWatcher but never cancels the cluster context so
goroutines leak. Fix by ensuring cluster's context is cancelled on all error
paths: after NewCluster returns, register a short-lived defer or explicit
cleanup to call the cluster cancellation/unconditional shutdown routine (either
call a new public Cluster method that unconditionally cancels the context or
directly invoke the cluster's cancel function) before returning on errors from
rule.NewWatcher and affinity.NewWatcher; update references around NewCluster,
rule.NewWatcher, affinity.NewWatcher, StartBackgroundJobs, StopBackgroundJobs,
and any cluster.cancel/StopBackgroundJobs logic so cleanup runs even if
StartBackgroundJobs was never called.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8eda0be5-dbf6-4c9e-bc86-bd595fded54a

📥 Commits

Reviewing files that changed from the base of the PR and between 410ae3a and 9c92cfd.

📒 Files selected for processing (3)

pkg/mcs/scheduling/server/rule/watcher.go
pkg/mcs/scheduling/server/server.go
pkg/mcs/scheduling/server/server_test.go

Signed-off-by: tongjian <1045931706@qq.com>

rleungx · 2026-05-08T08:44:05Z

-	s.configWatcher.SetSchedulersController(cluster.GetCoordinator().GetSchedulersController())
-	// Start the rule watcher after the cluster is created.
-	s.ruleWatcher, err = rule.NewWatcher(s.Context(), s.GetClient(), s.storage,
+	defer func() {


StopBackgroundJobs returns immediately when StartBackgroundJobs has not run, so this rollback does not cancel the cluster context on partial startup failure. NewCluster already starts context-bound goroutines such as the labeler GC / affinity check loop; please use a cleanup path that always cancels the cluster context.

This still does not fully address @rleungx's comment, and the latest defer also introduces a successful-path cleanup issue.

StopBackgroundJobs() still returns immediately when running=false, so if NewCluster() succeeds but a later watcher creation fails, the cluster context is still not cancelled.

Also, the successful path only sets cluster = nil, while hbStreams / watchers remain non-nil. The defer will close the runtime resources that were just transferred to Cluster before startCluster() returns successfully.

Could we either disarm those local variables after ownership is transferred, or centralize cleanup in a method that always cancels the cluster context?

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/mcs/scheduling/server/apis/v1/api_test.go`:
- Around line 29-42: The test
TestGetAllStoresReturnsNotBootstrappedWhenBasicClusterMissing is failing because
getBasicCluster currently returns errs.ErrNotLeader.Error() instead of an error
that produces "not bootstrapped" in getAllStores' response; update the
getBasicCluster function so it returns the appropriate not-bootstrapped error
(or wraps/returns the actual error value that contains "not bootstrapped")
instead of returning errs.ErrNotLeader.Error(), ensuring getAllStores sees and
propagates the not-bootstrapped condition so the test assertion on resp.Body
containing "not bootstrapped" passes.

In `@pkg/mcs/scheduling/server/apis/v1/api.go`:
- Around line 1657-1664: The getBasicCluster helper currently returns
errs.ErrNotLeader when svr.GetBasicCluster() is nil; change it to return
errs.ErrNotBootstrapped instead so handlers relying on getBasicCluster (function
getBasicCluster and call site using svr.GetBasicCluster()) produce the correct
"not bootstrapped" response expected by tests and consistent with other
handlers.

In `@pkg/mcs/scheduling/server/grpc_service.go`:
- Around line 246-248: The current nil-check branch in the StoreHeartbeat
handler returns a StoreHeartbeatResponse with notLeaderHeader() which sets
ErrorType_INVALID_VALUE; instead, change the returned header to a
transient/unavailable error so clients will retry: update the nil-branch that
checks s.metaWatcher (in the StoreHeartbeat method referencing s.metaWatcher and
notLeaderHeader()) to return a header using NOT_BOOTSTRAPPED or a new "leader
unavailable" error header (create a helper if needed, e.g.,
notBootstrappedHeader() or notLeaderAvailableHeader()) rather than
ErrorType_INVALID_VALUE so the semantics match initialization/stepdown transient
state.
- Around line 475-480: Change the error type returned by notLeaderHeader() from
schedulingpb.ErrorType_INVALID_VALUE to schedulingpb.ErrorType_UNKNOWN so the
"not leader" operational/state error is categorized correctly; locate the
notLeaderHeader function and update the schedulingpb.Error construction to set
Type to schedulingpb.ErrorType_UNKNOWN while keeping the Message "not leader"
and returning via errorHeader(&schedulingpb.Error{...}).

In `@pkg/mcs/scheduling/server/server_test.go`:
- Around line 88-99: The test
TestStoreHeartbeatReturnsNotBootstrappedWhenMetaWatcherMissing expects
schedulingpb.ErrorType_NOT_BOOTSTRAPPED but the current implementation in
grpc_service.go uses notLeaderHeader() which returns
schedulingpb.ErrorType_INVALID_VALUE; update the test to assert
schedulingpb.ErrorType_INVALID_VALUE for the response header error (or rename
the test to reflect the "not leader/invalid value" behavior) and ensure the
assertion checks resp.GetHeader().GetError().GetType() equals
schedulingpb.ErrorType_INVALID_VALUE after calling Service.StoreHeartbeat.
- Around line 67-86: The test
TestRegionHeartbeatReturnsNotBootstrappedWhenHeartbeatStreamsMissing asserts
NOT_BOOTSTRAPPED but the Service.RegionHeartbeat implementation returns
notLeaderHeader() which uses schedulingpb.ErrorType_INVALID_VALUE; update the
test to expect schedulingpb.ErrorType_INVALID_VALUE (change the re.Equal
assertion) and optionally rename the test to reflect the actual behavior (e.g.,
TestRegionHeartbeatReturnsInvalidValueWhenHeartbeatStreamsMissing) so the test
name matches the returned error; reference Service.RegionHeartbeat and
notLeaderHeader() when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e13d367b-25bc-4c60-ac16-5a6deb1d1876

📥 Commits

Reviewing files that changed from the base of the PR and between 23174e5 and e24a808d7089758d8e97f90f046efa6eb99e8a2a.

📒 Files selected for processing (5)

pkg/mcs/scheduling/server/apis/v1/api.go
pkg/mcs/scheduling/server/apis/v1/api_test.go
pkg/mcs/scheduling/server/grpc_service.go
pkg/mcs/scheduling/server/server.go
pkg/mcs/scheduling/server/server_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/mcs/scheduling/server/server.go

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/mcs/scheduling/server/cluster.go (1)
658-670: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Race condition: use CompareAndSwap for atomic test-and-set.

The current Load-then-Store pattern has a TOCTOU race. Two concurrent calls could both pass the running.Load() check on line 659, then both execute the stop sequence, leading to duplicate coordinator.Stop(), cancel(), and wg.Wait() calls with undefined behavior.
🔒 Proposed fix using CompareAndSwap
 func (c *Cluster) StopBackgroundJobs() bool {
-	if !c.running.Load() {
+	if !c.running.CompareAndSwap(true, false) {
 		return false
 	}
-	c.running.Store(false)
 	c.coordinator.Stop()
 	c.heartbeatRunner.Stop()
 	c.miscRunner.Stop()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/mcs/scheduling/server/cluster.go` around lines 658 - 670, Replace the
non-atomic Load-then-Store in Cluster.StopBackgroundJobs with an atomic
CompareAndSwap so only the goroutine that flips the running flag from true to
false executes the shutdown sequence; specifically, attempt
c.running.CompareAndSwap(true, false) at the start of StopBackgroundJobs and
return false unless the CAS succeeds, and only then call c.coordinator.Stop(),
c.heartbeatRunner.Stop(), c.miscRunner.Stop(), c.logRunner.Stop(), c.cancel(),
and c.wg.Wait().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/mcs/scheduling/server/apis/v1/api_test.go`:
- Around line 35-48: Rename the test function
TestStoreHeartbeatReturnsNotLeaderWhenMetaWatcherMissing to a name that reflects
it's testing the HTTP handler for listing stores (e.g.,
TestGetAllStoresReturnsNotBootstrapped) and update any references; the body
exercises getAllStores using gin test context and sets
multiservicesapi.ServiceContextKey with &scheserver.Server{}, so ensure the new
test name mentions getAllStores (or "GetAllStores") to avoid confusion with the
gRPC StoreHeartbeat method.

---

Outside diff comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 658-670: Replace the non-atomic Load-then-Store in
Cluster.StopBackgroundJobs with an atomic CompareAndSwap so only the goroutine
that flips the running flag from true to false executes the shutdown sequence;
specifically, attempt c.running.CompareAndSwap(true, false) at the start of
StopBackgroundJobs and return false unless the CAS succeeds, and only then call
c.coordinator.Stop(), c.heartbeatRunner.Stop(), c.miscRunner.Stop(),
c.logRunner.Stop(), c.cancel(), and c.wg.Wait().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc3fd4ed-4660-4c49-a883-4f135390b24c

📥 Commits

Reviewing files that changed from the base of the PR and between e7bea36 and faa8c7108303db0a0f271fa4d3f026e2823a8329.

📒 Files selected for processing (7)

pkg/mcs/scheduling/server/apis/v1/api.go
pkg/mcs/scheduling/server/apis/v1/api_test.go
pkg/mcs/scheduling/server/cluster.go
pkg/mcs/scheduling/server/grpc_service.go
pkg/mcs/scheduling/server/server.go
pkg/mcs/scheduling/server/server_test.go
pkg/schedule/affinity/manager.go

💤 Files with no reviewable changes (1)

pkg/schedule/affinity/manager.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/mcs/scheduling/server/apis/v1/api.go
pkg/mcs/scheduling/server/server.go

lhy1024 · 2026-05-13T02:52:41Z

 	return nil
 }

 func (s *Server) stopCluster() {
 	cluster := s.GetCluster()
 	if cluster != nil {
-		s.cluster.Store((*Cluster)(nil))
 		cluster.StopBackgroundJobs()


This changes the shutdown visibility order in a risky way. The old cluster remains published via s.GetCluster() while StopBackgroundJobs() is stopping the coordinator/runners. During this window, RPC/API handlers can still enter the old cluster and operate on resources that are already stopping. The previous code unpublished the cluster before stopping watchers. Could we first atomically unpublish the serving resources, then stop the captured old cluster/resources?

Yes, remove the basicCluster from the service struct; all callers should get the basic cluster from the Cluster field.

Thanks for moving hbStreams and watchers into Cluster; that fixes the separately mutable Server field issue.

I think the shutdown visibility issue in this thread is still valid, though. stopCluster() still calls cluster.StopBackgroundJobs() before cleanupClusterResources(cluster), and cleanupClusterResources is where s.cluster.Store(nil) happens. So while StopBackgroundJobs() is stopping coordinator/runners and canceling the cluster context, the old cluster is still visible through GetCluster() and handlers can still enter it.

Could we unpublish the cluster first, then stop and clean the captured old cluster? For example: capture old := s.GetCluster(), store nil into s.cluster, then call old.StopBackgroundJobs() and old.cleanupRuntimeResources().

I added a regression test to verify this window.

func TestStopClusterUnpublishesClusterBeforeStoppingBackgroundJobs(t *testing.T) { re := require.New(t) s, cluster := newServerWithClusterForStopTest(t) cluster.heartbeatRunner = syncRunner cluster.miscRunner = syncRunner cluster.logRunner = syncRunner cluster.running.Store(true) // Keep StopBackgroundJobs blocked so the test can observe the shutdown window. cluster.wg.Add(1) done := make(chan struct{}) released := false defer func() { if !released { cluster.wg.Done() } select { case <-done: case <-time.After(5 * time.Second): t.Log("stopCluster did not return after releasing the test gate") } }() go func() { s.stopCluster() close(done) }() re.Eventually(func() bool { return !cluster.IsBackgroundJobsRunning() }, 5*time.Second, 10*time.Millisecond) select { case <-done: t.Fatal("stopCluster returned before the test observed the shutdown window") default: } if got := s.GetCluster(); got != nil { t.Errorf("expected stopCluster to make cluster invisible before waiting for background jobs to stop, got %p", got) } cluster.wg.Done() released = true select { case <-done: case <-time.After(5 * time.Second): t.Fatal("stopCluster did not return") } }

--- FAIL: TestStopClusterUnpublishesClusterBeforeStoppingBackgroundJobs server_test.go:100: expected stopCluster to make cluster invisible before waiting for background jobs to stop, got 0xc0002dedc0

good catch, fixed

lhy1024 · 2026-05-13T02:54:09Z

+
+	s := &Server{}
+	s.cluster.Store(&Cluster{BasicCluster: basicCluster})
+	s.hbStreams = nil // cleanupClusterResources has nilled hbStreams before clearing cluster.


This test state does not match the current cleanup order. cleanupClusterResources() clears cluster before niling hbStreams, so cluster != nil && hbStreams == nil is not the real shutdown window. The real risky window is in stopCluster(): cluster is still published while StopBackgroundJobs() is running. Could we add a test that exercises the actual stopCluster() order, preferably with a concurrent heartbeat/API read?

This thread is outdated, but I don't think the underlying test concern is fully resolved yet. The current test still manually constructs cluster != nil && hbStreams == nil instead of exercising the actual stopCluster() order. The real window is still that stopCluster() keeps the old cluster published while StopBackgroundJobs() is running.

Also, after removing Server.basicCluster, this test file currently does not compile because TestCleanupClusterResources still references basicCluster in the Server literal and assertion.

bufferflies · 2026-05-13T03:45:51Z

/retest

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot · 2026-05-14T13:23:40Z

@bufferflies: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-2	`3b30f5c`	link	true	`/test pull-unit-test-next-gen-2`
pull-unit-test-next-gen-3	`3b30f5c`	link	true	`/test pull-unit-test-next-gen-3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rleungx · 2026-05-15T12:12:54Z

-// StopBackgroundJobs stops background jobs.
+// StopBackgroundJobs stops background jobs, these jobs is created by NewCluster.
 func (c *Cluster) StopBackgroundJobs() {
 	if !c.running.Load() {


On startup rollback, NewCluster may have already started goroutines while StartBackgroundJobs was never called, so running is false and this returns without canceling c.ctx. Please add a cleanup path that cancels the cluster context even before background jobs start.

rleungx · 2026-05-15T12:12:54Z

+	cluster.SetRuntimeResources(metaWatcher, configWatcher, ruleWatcher, affinityWatcher)
+	s.cluster.Store(cluster)
 	cluster.StartBackgroundJobs()
+	cluster = nil // defer cleanup no longer needed


This only skips cluster.StopBackgroundJobs; the defer will still close hbStreams and all watchers after a successful start because those locals remain non-nil. Please run the defer only on error, or nil out the resources after ownership is transferred to cluster.

rleungx · 2026-05-15T12:12:54Z

+	)
+	metaWatcher, configWatcher, err = s.startMetaConfWatcher(ctx, basicCluster, storage)
 	if err != nil {
+		configWatcher.Close()


startMetaConfWatcher can return nil watchers with an error, so this can panic and hide the real startup error. Please remove these closes here or guard them; the helper already cleans up partial watcher state.

clean stale cache if exit primary

9c92cfd

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels May 7, 2026

ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 7, 2026

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/server.go Outdated

adopt ai comment

23174e5

Signed-off-by: tongjian <1045931706@qq.com>

lhy1024 reviewed May 7, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/server.go Outdated

rleungx reviewed May 8, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/server.go Outdated

ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2026

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

address comment

e7bea36

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the mcs-mem-clean branch from e24a808 to e7bea36 Compare May 9, 2026 03:27

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/apis/v1/api_test.go Outdated

bufferflies force-pushed the mcs-mem-clean branch from faa8c71 to a1468c7 Compare May 13, 2026 02:23

bufferflies requested review from lhy1024 and rleungx May 13, 2026 02:25

lhy1024 reviewed May 13, 2026

View reviewed changes

address comment

b6baff6

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the mcs-mem-clean branch from a1468c7 to b6baff6 Compare May 13, 2026 04:00

lhy1024 reviewed May 13, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/grpc_service.go Outdated

bufferflies added 2 commits May 13, 2026 14:59

Merge branch 'master' of github.com:tikv/pd into mcs-mem-clean

8871821

resolve conflict

6fff238

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the mcs-mem-clean branch from 02b93c7 to 6fff238 Compare May 13, 2026 07:36

rleungx reviewed May 13, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/grpc_service.go Outdated

rleungx reviewed May 13, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/grpc_service.go Outdated

rleungx reviewed May 13, 2026

View reviewed changes

Comment thread pkg/mcs/scheduling/server/server.go Outdated

bufferflies added 2 commits May 14, 2026 10:40

wrap some filed into cluster struct to make lifetime same

53ef0f2

Signed-off-by: tongjian <1045931706@qq.com>

unpublish first

3b30f5c

Signed-off-by: tongjian <1045931706@qq.com>

rleungx reviewed May 15, 2026

View reviewed changes

Conversation

bufferflies commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

bufferflies commented May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bufferflies commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot Bot commented May 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

bufferflies commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading