Skip to content

server, config: support updating leader lease online#10631

Open
JmPotato wants to merge 4 commits into
tikv:masterfrom
JmPotato:bob/leader-lease-online-config
Open

server, config: support updating leader lease online#10631
JmPotato wants to merge 4 commits into
tikv:masterfrom
JmPotato:bob/leader-lease-online-config

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Apr 29, 2026

What problem does this PR solve?

Issue Number: ref #10630

Production clusters can see PD leader lease keepalive p99 latency approach the
5s default lease timeout, leaving little safety margin before leader churn.
Operators need a way to tune the PD leader lease online between releases without
editing the local config file and restarting PD.

What is changed and how does it work?

Support updating the PD leader lease timeout through POST /pd/api/v1/config with the top-level lease field.
Store the effective leader lease in PersistOptions and persist it with the existing config blob.
Before campaigning, load only the persisted leader lease so the next PD leader campaign uses the runtime lease value.
Keep existing toml startup behavior unchanged and reject only non-positive API values.

Implementation notes:

  • GET /pd/api/v1/config returns the effective top-level lease value.
  • POST /pd/api/v1/config accepts {"lease": <seconds>} and persists it through the existing config blob.
  • The current leader's etcd lease TTL is fixed at grant time, so an online update takes effect on the next PD leader campaign. Use leader transfer or resign to apply it immediately.
  • Campaign loads only the persisted leader lease before Campaign, then falls back to the current in-memory value if the load fails. The existing full config reload after leadership is kept is unchanged.
  • Older persisted config blobs that do not contain top-level lease keep the startup toml value; legacy non-positive persisted values are ignored.
  • The online API rejects 0 and negative values because they are explicit invalid updates. File config lease = 0 keeps the existing startup behavior and defaults to 5 through AdjustInt64.
  • No online-only upper bound is introduced; the API keeps the same effective range as file config except for rejecting explicit non-positive updates.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Code changes

Side effects

  • Increased code complexity

Related changes

Manual test

  • go test ./server/config -run 'TestLeaderLease|TestReloadKeepsCustomLeaderLeaseWhenPersistedConfigMissesLease|TestValidateLeaderLease|TestLoadPersistedLeaderLease' -count=1
  • go test -tags without_dashboard ./tests/server/api -run 'TestLeaderLeaseConfigAPI|TestLeaderLeaseTakesEffectAfterTransfer|TestLeaderLeaseCampaignSurvivesLoadFailure' -count=1
  • go vet ./server/config
  • gofmt -l server/config/config.go server/config/config_test.go server/config/persist_options.go server/server.go tests/server/api/api_test.go
  • git diff --check

Release note

Support updating the PD leader lease timeout online through `POST /pd/api/v1/config` with the `lease` field. The new value takes effect on the next PD leader campaign; use leader transfer or resign to apply it immediately. The API rejects non-positive values, while file config `lease = 0` keeps the existing default-to-5 startup behavior. No additional upper bound is introduced.

Summary by CodeRabbit

  • New Features

    • Leader lease is now persisted, surfaced in the config API (GET/POST), and applied at runtime; server provides a validated API to update the lease.
  • Bug Fixes

    • Reload now ignores non‑positive persisted lease values and preserves other persisted options when updating the lease.
  • Tests

    • Added unit and end‑to‑end tests for API behavior, persistence, reload semantics, and validation.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Persist leader-lease into PersistOptions, expose it via the top-level lease config API, validate and persist updates through Server.SetLeaderLease, source campaign/get-lease behavior from the persisted lease, and add tests and reload helpers for persistence/reload semantics.

Changes

Leader-lease persistence and runtime control

Layer / File(s) Summary
Data Shape / State
server/config/persist_options.go
Add leaderLease atomic.Int64 to PersistOptions; persistedConfig records leaseLoaded; add hasValidLeaderLease() and IsValidLeaderLease().
Core Persistence Logic
server/config/persist_options.go
Persist writes LeaderLease: o.GetLeaderLease(); introduce loadPersistedConfig(storage) helper; Reload applies loaded lease only when valid; add GetLeaderLease() / SetLeaderLease().
API / Handler
server/api/config.go
GetConfig merges persisted LeaderLease into returned config; updateConfig routes top-level "lease" to new updateLeaderLease which uses jsonutil.AddKeyValue and calls svr.SetLeaderLease when updated.
Server Integration
server/server.go
Add SetLeaderLease(lease int64) error to validate and persist (with rollback on failure); GetLease() and leader campaigning use persistOptions.GetLeaderLease(); reload persisted lease before campaigning and after keepalive.
Tests / API E2E
server/config/config_test.go, tests/server/api/api_test.go
Add unit tests for persist/reload semantics, ignoring non-positive persisted values, preserving custom in-memory lease when persisted config lacks lease, and an end-to-end API test exercising GET/POST lease and invalid inputs.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as Config API
    participant Server as PD Server
    participant Persist as PersistOptions
    participant Storage as Config Storage

    Client->>API: POST /pd/api/v1/config {"lease": 3000}
    API->>API: Detect top-level "lease"
    API->>Server: SetLeaderLease(3000)
    Server->>Server: Validate lease (>0)
    Server->>Persist: SetLeaderLease(3000)
    Persist->>Persist: Update atomic leaderLease
    Server->>Storage: Persist updated config
    Storage-->>Server: OK
    Server-->>API: success
    API-->>Client: 200 OK

    Note over Client,Storage: On restart / reload
    Server->>Storage: Load persisted config
    Storage-->>Server: Config{..., lease:3000}
    Server->>Persist: loadPersistedConfig(...)
    Persist->>Persist: Validate lease -> true
    Persist-->>Server: leaderLease=3000
    Server->>Server: Use persisted leaderLease for campaigning
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

  • tikv/pd#10150: Similar change adding a top-level config handler; patterns for routing top-level config keys are related.

Suggested labels

size/L

Suggested reviewers

  • nolouch
  • okJiang
  • lhy1024
  • rleungx

Poem

🐰 I hopped through bytes and held a lease,
Atomic heartbeats keeping leader peace,
API whispers, storage hums along,
Persisted hops keep leadership strong,
Clusters dance to a rabbit's gentle lease.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding support for online updates to the PD leader lease configuration, which aligns with all file changes across server/api/config.go, persist_options.go, server.go, and related tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all required sections completed, including problem statement, implementation details, checklist items, and release notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/config/persist_options.go`:
- Around line 829-831: Reload currently treats a default-initialized
cfg.LeaderLease as a present, valid value and overwrites any startup lease; fix
this by making the persisted presence explicit (either change cfg.LeaderLease to
a pointer type or add a boolean flag set by LoadConfig that indicates the field
was present) and only call IsValidLeaderLease and o.SetLeaderLease when that
presence indicator shows the field existed in the persisted blob; apply the same
change/guard for the other lease-related block at the 838-840 region so you only
apply persisted leases if the field was actually present in the loaded config.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b9f77e78-96a8-454f-a2b0-f0b11648ce41

📥 Commits

Reviewing files that changed from the base of the PR and between f941a7e and c8a100b984c363cae21ad8bec4ea51029b6b2175.

📒 Files selected for processing (5)
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
  • tests/server/api/api_test.go

Comment thread server/api/config.go Outdated
Comment thread server/config/persist_options.go Outdated
@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from c8a100b to f66c676 Compare April 29, 2026 04:56
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
server/config/persist_options.go (1)

829-831: ⚠️ Potential issue | 🟠 Major

Do not apply LeaderLease unless the persisted lease field actually exists.

On Line 829, IsValidLeaderLease(cfg.LeaderLease) is not enough to distinguish “field missing” vs “field present.” For older stored blobs without lease, reload can still overwrite a custom startup lease.

💡 Suggested fix (presence check before applying)
 func (o *PersistOptions) Reload(storage endpoint.ConfigStorage) error {
 	cfg := &persistedConfig{Config: &Config{}}
+	leaseProbe := struct {
+		LeaderLease *int64 `json:"lease"`
+	}{}
 	// Pass nil to initialize cfg to default values (all items undefined)
 	if err := cfg.Adjust(nil, true); err != nil {
 		return err
 	}
 
 	isExist, err := storage.LoadConfig(cfg)
 	if err != nil {
 		return err
 	}
+	_, _ = storage.LoadConfig(&leaseProbe)
 	adjustScheduleCfg(&cfg.Schedule)
 	...
 	if isExist {
 		...
-		if IsValidLeaderLease(cfg.LeaderLease) {
-			o.SetLeaderLease(cfg.LeaderLease)
+		if leaseProbe.LeaderLease != nil && IsValidLeaderLease(*leaseProbe.LeaderLease) {
+			o.SetLeaderLease(*leaseProbe.LeaderLease)
 		}
 		...
 	}
 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/config/persist_options.go` around lines 829 - 831, The current code
calls o.SetLeaderLease when IsValidLeaderLease(cfg.LeaderLease) is true, which
overwrites a startup lease even if the persisted blob lacked a `lease` field;
change the condition to require the persisted field's presence as well (e.g.,
check the config wrapper/flag that indicates the `lease` field was present or
use a nil/presence check on cfg.LeaderLease) before calling o.SetLeaderLease;
update the if around IsValidLeaderLease(cfg.LeaderLease) to something like "if
cfg.HasLeaderLease() && IsValidLeaderLease(cfg.LeaderLease) {
o.SetLeaderLease(cfg.LeaderLease) }" (replace HasLeaderLease with the actual
presence indicator in the config struct).
server/api/config.go (1)

218-219: ⚠️ Potential issue | 🟠 Major

lease is updateable, but follower GET /config can still serve stale lease.

With Line 218 enabling dynamic lease, follower reads need to merge lease from leader too. Today follower GetConfig only syncs schedule/replication (Lines 81-82), so lease may lag until reload/leadership change.

💡 Suggested fix in follower merge path
 	mergedCfg := localCfg
 	mergedCfg.Replication = leaderCfg.Replication
 	mergedCfg.Schedule = leaderCfg.Schedule
+	mergedCfg.LeaderLease = leaderCfg.LeaderLease
 	h.rd.JSON(w, http.StatusOK, mergedCfg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/config.go` around lines 218 - 219, GetConfig on followers
currently syncs only schedule/replication and can return a stale lease; update
the follower merge path to include the leader's lease value as well. In the
follower GET /config handling (where schedule/replication are merged) add logic
to fetch and merge the leader's lease into the response so that the value
updated via updateLeaderLease is reflected to followers; ensure you reuse the
same merge/order semantics as schedule/replication so updateLeaderLease (case
"lease") changes propagate immediately to follower GetConfig responses.
🧹 Nitpick comments (1)
server/config/config_test.go (1)

131-148: Add one backward-compat test for missing lease key.

Please add a case where stored config omits lease entirely and assert reload keeps the in-memory startup lease unchanged.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/config/config_test.go` around lines 131 - 148, Extend
TestLeaderLeaseReloadIgnoresNonPositivePersistedValues to include a case where
the persisted config omits the lease key entirely: call storage.SaveConfig with
an empty value (e.g., struct{}{} or a map with no "lease" key), then run the
same reload flow (NewConfig, cfg.Adjust, set cfg.LeaderLease to 7,
NewPersistOptions, opt.Reload) and assert opt.GetLeaderLease() remains 7;
reference TestLeaderLeaseReloadIgnoresNonPositivePersistedValues,
storage.SaveConfig, NewConfig, NewPersistOptions, opt.Reload, and
opt.GetLeaderLease to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@server/api/config.go`:
- Around line 218-219: GetConfig on followers currently syncs only
schedule/replication and can return a stale lease; update the follower merge
path to include the leader's lease value as well. In the follower GET /config
handling (where schedule/replication are merged) add logic to fetch and merge
the leader's lease into the response so that the value updated via
updateLeaderLease is reflected to followers; ensure you reuse the same
merge/order semantics as schedule/replication so updateLeaderLease (case
"lease") changes propagate immediately to follower GetConfig responses.

In `@server/config/persist_options.go`:
- Around line 829-831: The current code calls o.SetLeaderLease when
IsValidLeaderLease(cfg.LeaderLease) is true, which overwrites a startup lease
even if the persisted blob lacked a `lease` field; change the condition to
require the persisted field's presence as well (e.g., check the config
wrapper/flag that indicates the `lease` field was present or use a nil/presence
check on cfg.LeaderLease) before calling o.SetLeaderLease; update the if around
IsValidLeaderLease(cfg.LeaderLease) to something like "if cfg.HasLeaderLease()
&& IsValidLeaderLease(cfg.LeaderLease) { o.SetLeaderLease(cfg.LeaderLease) }"
(replace HasLeaderLease with the actual presence indicator in the config
struct).

---

Nitpick comments:
In `@server/config/config_test.go`:
- Around line 131-148: Extend
TestLeaderLeaseReloadIgnoresNonPositivePersistedValues to include a case where
the persisted config omits the lease key entirely: call storage.SaveConfig with
an empty value (e.g., struct{}{} or a map with no "lease" key), then run the
same reload flow (NewConfig, cfg.Adjust, set cfg.LeaderLease to 7,
NewPersistOptions, opt.Reload) and assert opt.GetLeaderLease() remains 7;
reference TestLeaderLeaseReloadIgnoresNonPositivePersistedValues,
storage.SaveConfig, NewConfig, NewPersistOptions, opt.Reload, and
opt.GetLeaderLease to locate the code to change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e0a21ab-1f69-44ae-a4af-914af870c79f

📥 Commits

Reviewing files that changed from the base of the PR and between c8a100b984c363cae21ad8bec4ea51029b6b2175 and f66c67600cffe2e472032a0970f6964761f65450.

📒 Files selected for processing (5)
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
  • tests/server/api/api_test.go

@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from f66c676 to 6168127 Compare April 29, 2026 06:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
server/server.go (1)

1013-1025: ⚠️ Potential issue | 🟠 Major

Keep lease in the follower merge path.

GetConfig() still only overlays Schedule and Replication from leaderCfg. After a runtime lease update, a follower can keep serving an old lease until its local cache reloads, so /pd/api/v1/config can disagree with the leader. Please copy LeaderLease from leaderCfg here too.

Suggested patch
 mergedCfg := localCfg
 mergedCfg.Replication = leaderCfg.Replication
 mergedCfg.Schedule = leaderCfg.Schedule
+mergedCfg.LeaderLease = leaderCfg.LeaderLease
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/server.go` around lines 1013 - 1025, GetConfig currently doesn't copy
the leader's runtime lease into the returned cfg causing followers to serve
stale LeaderLease; in GetConfig ensure you overlay LeaderLease from the
persisted leader config (use the value returned by persistOptions.GetLeaderLease
/ the leaderCfg's lease) into cfg so the returned config reflects the leader's
lease update (update the LeaderLease field in GetConfig to copy/clone the
leader's lease value rather than leaving a stale local one).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@server/server.go`:
- Around line 1013-1025: GetConfig currently doesn't copy the leader's runtime
lease into the returned cfg causing followers to serve stale LeaderLease; in
GetConfig ensure you overlay LeaderLease from the persisted leader config (use
the value returned by persistOptions.GetLeaderLease / the leaderCfg's lease)
into cfg so the returned config reflects the leader's lease update (update the
LeaderLease field in GetConfig to copy/clone the leader's lease value rather
than leaving a stale local one).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4124abaa-3025-491c-8fe5-0b2d872988cd

📥 Commits

Reviewing files that changed from the base of the PR and between f66c67600cffe2e472032a0970f6964761f65450 and 61681277780107c2387e22f0fc8429e0ad14bb45.

📒 Files selected for processing (5)
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
  • tests/server/api/api_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/config/persist_options.go

Comment thread server/server.go
@@ -1013,6 +1013,7 @@ func (s *Server) GetServiceMiddlewareConfig() *config.ServiceMiddlewareConfig {
// GetConfig gets the config information.
func (s *Server) GetConfig() *config.Config {
cfg := s.cfg.Clone()
cfg.LeaderLease = s.persistOptions.GetLeaderLease()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps the leader response fresh, but follower GET /config still only merges schedule/replication from the leader. Can we also copy LeaderLease there so followers don't return the old lease after a POST?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c334f2830 by copying LeaderLease from the leader config in the follower GET /config merge path.

Comment thread server/config/persist_options.go Outdated
@@ -813,12 +820,42 @@ func (o *PersistOptions) Reload(storage endpoint.ConfigStorage) error {
o.labelProperty.Store(cfg.LabelProperty)
o.keyspace.Store(&cfg.Keyspace)
o.microservice.Store(&cfg.Microservice)
if IsValidLeaderLease(cfg.LeaderLease) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still can't tell "missing lease field" from "lease was persisted". For old config blobs without lease, Adjust fills the default 5 and this can override a custom startup lease.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c334f2830. persistedConfig now records whether the top-level lease key was actually present during JSON unmarshal, and Reload only applies a present positive lease.

Comment thread server/config/persist_options.go Outdated
if err != nil {
return err
}
if isExist && IsValidLeaderLease(cfg.LeaderLease) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: before applying this value, we should first know the lease key was actually present in storage, not just filled by defaults.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c334f2830 with the same presence check in ReloadLeaderLease, so narrow reload keeps the startup lease when old config blobs do not contain lease.

@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from 6168127 to c334f28 Compare May 6, 2026 08:15
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/server/api/api_test.go (1)

128-156: ⚡ Quick win

Reset cfg per request and verify all followers.

Reusing the same cfg map across multiple ReadGetJSON calls is fragile: json.Unmarshal into an existing map merges rather than replaces keys, so if a future regression causes lease to be omitted from a response, the residual value from the previous call would mask the bug. Additionally, break after the first follower only validates one of the two non-leader members.

♻️ Proposed fix
-	var cfg map[string]any
-	err = testutil.ReadGetJSON(re, tests.TestDialClient, configURL, &cfg)
+	cfg := map[string]any{}
+	err = testutil.ReadGetJSON(re, tests.TestDialClient, configURL, &cfg)
 	re.NoError(err)
 	re.Equal(float64(7), cfg["lease"])
 	_, hasLeader := cfg["leader"]
 	re.False(hasLeader)
@@
-	err = testutil.ReadGetJSON(re, tests.TestDialClient, configURL, &cfg)
+	cfg = map[string]any{}
+	err = testutil.ReadGetJSON(re, tests.TestDialClient, configURL, &cfg)
 	re.NoError(err)
 	re.Equal(float64(10), cfg["lease"])
 	_, hasLeader = cfg["leader"]
 	re.False(hasLeader)
@@
-	for _, s := range cluster.GetServers() {
-		if s.GetServer().Name() == leader.GetServer().Name() {
-			continue
-		}
-		followerConfigURL := s.GetAddr() + "/pd/api/v1/config"
-		err = testutil.ReadGetJSON(re, tests.TestDialClient, followerConfigURL, &cfg)
-		re.NoError(err)
-		re.Equal(float64(10), cfg["lease"])
-		break
-	}
+	for _, s := range cluster.GetServers() {
+		if s.GetServer().Name() == leader.GetServer().Name() {
+			continue
+		}
+		followerConfigURL := s.GetAddr() + "/pd/api/v1/config"
+		followerCfg := map[string]any{}
+		err = testutil.ReadGetJSON(re, tests.TestDialClient, followerConfigURL, &followerCfg)
+		re.NoError(err)
+		re.Equal(float64(10), followerCfg["lease"])
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/api/api_test.go` around lines 128 - 156, The test reuses the
same cfg map across multiple testutil.ReadGetJSON calls which can cause stale
keys to persist because json.Unmarshal merges into existing maps; change each
ReadGetJSON call to use a fresh map (redeclare or make a new map[string]any
before each call to testutil.ReadGetJSON) so responses replace prior state, and
in the follower loop remove the break so you assert the lease value for all
non-leader servers (identify servers via cluster.GetServers() and compare
against leader.GetServer().Name()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/server/api/api_test.go`:
- Around line 128-156: The test reuses the same cfg map across multiple
testutil.ReadGetJSON calls which can cause stale keys to persist because
json.Unmarshal merges into existing maps; change each ReadGetJSON call to use a
fresh map (redeclare or make a new map[string]any before each call to
testutil.ReadGetJSON) so responses replace prior state, and in the follower loop
remove the break so you assert the lease value for all non-leader servers
(identify servers via cluster.GetServers() and compare against
leader.GetServer().Name()).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 093b3c9c-8c7c-4f03-a3ed-bf02d043ad33

📥 Commits

Reviewing files that changed from the base of the PR and between 61681277780107c2387e22f0fc8429e0ad14bb45 and c334f2830db2eb70524bf7954eaa6e156dea4847.

📒 Files selected for processing (5)
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
  • tests/server/api/api_test.go

@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from c334f28 to 75cda2d Compare May 6, 2026 08:24
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 6, 2026
@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from 75cda2d to 93d6178 Compare May 6, 2026 08:29
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/server/api/api_test.go (1)

147-156: 💤 Low value

Optional: replace the single-iteration loop with a direct follower lookup (or check all followers).

The for ... { continue ... break } pattern only validates one non-leader server but reads like an iteration over all servers. Either drop the break to assert lease propagation on every follower (stronger guarantee), or use cluster.GetServer(cluster.GetFollower()) to make the single-follower intent explicit.

♻️ Suggested refactor (assert on every follower)
-	for _, s := range cluster.GetServers() {
-		if s.GetServer().Name() == leader.GetServer().Name() {
-			continue
-		}
-		followerConfigURL := s.GetAddr() + "/pd/api/v1/config"
-		err = testutil.ReadGetJSON(re, tests.TestDialClient, followerConfigURL, &cfg)
-		re.NoError(err)
-		re.Equal(float64(10), cfg["lease"])
-		_, hasLeader = cfg["leader"]
-		re.False(hasLeader)
-		break
-	}
+	for _, s := range cluster.GetServers() {
+		if s.GetServer().Name() == leader.GetServer().Name() {
+			continue
+		}
+		followerConfigURL := s.GetAddr() + "/pd/api/v1/config"
+		err = testutil.ReadGetJSON(re, tests.TestDialClient, followerConfigURL, &cfg)
+		re.NoError(err)
+		re.Equal(float64(10), cfg["lease"])
+		_, hasLeader = cfg["leader"]
+		re.False(hasLeader)
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/server/api/api_test.go` around lines 147 - 156, The test currently
iterates servers but uses continue+break which only checks a single non-leader;
either (A) assert on every follower by removing the break so the loop checks
lease propagation for all servers returned by cluster.GetServers() (keep the
existing comparison against leader.GetServer().Name(), use s.GetAddr() +
"/pd/api/v1/config", testutil.ReadGetJSON, and re.Equal(float64(10),
cfg["lease"]) for each), or (B) make the single-follower intent explicit by
replacing the loop with a direct lookup like
cluster.GetServer(cluster.GetFollower()) and then call testutil.ReadGetJSON
against that server address and assert cfg["lease"] equals 10. Choose one
approach and update the test accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/server/api/api_test.go`:
- Around line 147-156: The test currently iterates servers but uses
continue+break which only checks a single non-leader; either (A) assert on every
follower by removing the break so the loop checks lease propagation for all
servers returned by cluster.GetServers() (keep the existing comparison against
leader.GetServer().Name(), use s.GetAddr() + "/pd/api/v1/config",
testutil.ReadGetJSON, and re.Equal(float64(10), cfg["lease"]) for each), or (B)
make the single-follower intent explicit by replacing the loop with a direct
lookup like cluster.GetServer(cluster.GetFollower()) and then call
testutil.ReadGetJSON against that server address and assert cfg["lease"] equals
10. Choose one approach and update the test accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 43b7378c-d7f7-4313-b380-bfe93cba0972

📥 Commits

Reviewing files that changed from the base of the PR and between c334f2830db2eb70524bf7954eaa6e156dea4847 and 75cda2dc509ce9a990702aca9ffc41b4bae9d8a7.

📒 Files selected for processing (5)
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
  • tests/server/api/api_test.go

@JmPotato JmPotato force-pushed the bob/leader-lease-online-config branch from 93d6178 to 8834eea Compare May 6, 2026 08:42
@JmPotato
Copy link
Copy Markdown
Member Author

JmPotato commented May 6, 2026

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

@JmPotato
Copy link
Copy Markdown
Member Author

JmPotato commented May 6, 2026

/test pull-unit-test-next-gen-3

1 similar comment
@JmPotato
Copy link
Copy Markdown
Member Author

JmPotato commented May 6, 2026

/test pull-unit-test-next-gen-3

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 18, 2026

@YuhaoZhang00: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: YuhaoZhang00
Once this PR has been reviewed and has the lgtm label, please assign okjiang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 20, 2026
@JmPotato
Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 20, 2026

@JmPotato: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 fef4d94 link true /test pull-unit-test-next-gen-2
pull-unit-test-next-gen-3 fef4d94 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread server/api/config.go
return h.updateMicroserviceConfig(cfg, kp[len(kp)-1], value)
case "controller":
return h.updateControllerConfig(kp[len(kp)-1], value)
case "lease":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider version-gating this API until all PD members understand the persisted lease. During rolling upgrade, an old PD can still persist lease: 0 in the config blob, which makes the new code ignore the online value and fall back to local toml on the next campaign.

// the online update API. Keep the accepted range aligned with file config
// behavior: reject explicit non-positive updates, but do not add an online-only
// upper bound.
func ValidateLeaderLease(lease int64) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an upper bound here? Values larger than math.MaxInt64 / int64(time.Second) can be persisted successfully, but time.Duration(lease) * time.Second overflows later in election keepalive and may break the next campaign.

failpoint.Inject("loadLeaderLeaseFail", func() {
failpoint.Return(int64(0), errors.New("failpoint: fail to load persisted leader lease"))
})
cfg := &persistedConfig{Config: &Config{}}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still unmarshals the full persisted config on the election path. Could we decode only a small {lease} struct here? That avoids extra work and prevents unrelated config decode issues from changing campaign behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants