Skip to content

election: use fixed lease keepalive interval#10654

Open
JmPotato wants to merge 7 commits into
tikv:masterfrom
JmPotato:bob/lease-keepalive-fixed-interval
Open

election: use fixed lease keepalive interval#10654
JmPotato wants to merge 7 commits into
tikv:masterfrom
JmPotato:bob/lease-keepalive-fixed-interval

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented May 9, 2026

What problem does this PR solve?

Issue Number: ref #10653

What is changed and how does it work?

Cap the election lease keepalive interval at 500ms.

PD's election lease renewal previously derived its `KeepAliveOnce` cadence
from `leaseTimeout/3`, so raising the lease TTL also slowed renewal. The
keepalive ticker interval is now `min(leaseTimeout/3, 500ms)`:

- Small leases keep the existing `leaseTimeout/3` behavior, matching etcd's
  built-in `clientv3.KeepAlive` cadence.
- Large leases are capped at 500ms so PD's leader-failover reaction time
  stays sub-second regardless of the configured TTL.

The per-call `KeepAliveOnce` context timeout continues to use the full
`leaseTimeout`, leaving enough RPC budget to absorb etcd tail latency above
the keepalive interval without prematurely cancelling renewals.

Check List

Tests

  • Unit test
  • Manual test
image

Release note

Fixed an issue where increasing PD leader lease could also slow down lease keepalive renewal.

Summary by CodeRabbit

  • Bug Fixes

    • Capped lease keep-alive renewal cadence at 500ms for more consistent and reliable lease renewals.
  • Tests

    • Updated keep-alive tests to wait for renewal signals with improved timeout handling to validate the new cadence.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 9, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a 500ms cap and helper to compute the lease keep-alive interval, refactors keepAliveWorker to compute the interval internally, updates Lease.KeepAlive to the new signature, and adjusts the keep-alive test to wait with a timeout.

Changes

Lease keep-alive changes

Layer / File(s) Summary
Interval cap and helper
pkg/election/lease.go
Adds maxLeaseKeepAliveInterval (500ms) and getLeaseKeepAliveInterval(leaseTimeout) returning min(leaseTimeout/3, 500ms).
keepAliveWorker refactor
pkg/election/lease.go
Removes interval parameter from keepAliveWorker and computes interval := getLeaseKeepAliveInterval(l.leaseTimeout) internally for ticker and timeouts.
KeepAlive invocation
pkg/election/lease.go
Lease.KeepAlive now calls l.keepAliveWorker(ctx) without an interval argument.
Test update
pkg/election/lease_test.go
TestLeaseKeepAlive updated to call lease.keepAliveWorker(ctx) and waits for a keep-alive signal with a 2s timeout instead of fixed sleep.

Sequence Diagram

sequenceDiagram
  participant Lease
  participant KeepAliveWorker
  participant IntervalHelper as getLeaseKeepAliveInterval
  participant Ticker as time.Ticker
  Lease->>KeepAliveWorker: keepAliveWorker(ctx)
  KeepAliveWorker->>IntervalHelper: getLeaseKeepAliveInterval(leaseTimeout)
  KeepAliveWorker->>Ticker: NewTicker(interval)
  Ticker->>KeepAliveWorker: tick -> send keep-alive
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • tikv/pd#10622: Modifies the same keepalive path in pkg/election/lease.go; related by adjustments to interval/timing logic.
  • tikv/pd#10649: Overlaps with changes to keepAliveWorker and tick interval computation.

Suggested reviewers

  • lhy1024
  • rleungx
  • okJiang

Poem

🐰 I nibble code at break of day,
A tidy cap to keep delays away,
Five hundred milliseconds sets the beat,
The worker hums, steady and neat,
Hops of joy for tests that pass today!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: capping the lease keepalive interval, which is the primary purpose of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively addresses all required template sections with detailed problem statement, technical explanation, test coverage, and release notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/election/lease.go`:
- Around line 191-197: The keepalive interval returned by
getLeaseKeepAliveInterval may be clamped to maxLeaseKeepAliveInterval but
callers (e.g., the loop that calls l.KeepAliveOnce) still use l.leaseTimeout as
the RPC/context timeout, allowing long timeouts to overlap many short-interval
ticks; change the keepalive call-site to use a bounded timeout based on the
computed interval (e.g., use min(l.leaseTimeout,
getLeaseKeepAliveInterval(l.leaseTimeout)) or derive a separate
keepaliveTimeout) and ensure the KeepAliveOnce invocation is given a context
with that bounded timeout and cancelled after each call to prevent goroutine
leaks; also consider adding simple backoff/retry logic or using an errgroup
around repeated KeepAliveOnce calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ad155e57-40f9-45ec-9d81-8a3d262ff28c

📥 Commits

Reviewing files that changed from the base of the PR and between 521db060bdfcbe86fd2996883c4026cd10a74051 and dcb2ee69b0ae6982a60e3781f08080faa36d5002.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • server/config/config.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/election/lease_test.go

Comment thread pkg/election/lease.go Outdated
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/election/lease.go (1)

200-206: ⚡ Quick win

Simplify the unreachable defensive check.

The condition leaseTimeout < timeout at Line 202 can never be true because timeout is computed as min(leaseTimeout/3, maxLeaseKeepAliveInterval), which is always ≤ leaseTimeout/3 < leaseTimeout for positive durations. This defensive check creates unreachable code and adds confusion.

♻️ Simplify to remove unreachable branch
 func getLeaseKeepAliveTimeout(leaseTimeout time.Duration) time.Duration {
-	timeout := getLeaseKeepAliveInterval(leaseTimeout)
-	if leaseTimeout < timeout {
-		return leaseTimeout
-	}
-	return timeout
+	return getLeaseKeepAliveInterval(leaseTimeout)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/election/lease.go` around lines 200 - 206, The conditional in
getLeaseKeepAliveTimeout is unreachable because timeout is computed by
getLeaseKeepAliveInterval(leaseTimeout) and will always be less than
leaseTimeout for positive durations; remove the defensive branch and simplify
getLeaseKeepAliveTimeout to directly return the computed timeout (keep the call
to getLeaseKeepAliveInterval to compute timeout), referencing
getLeaseKeepAliveTimeout and getLeaseKeepAliveInterval to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 200-206: The conditional in getLeaseKeepAliveTimeout is
unreachable because timeout is computed by
getLeaseKeepAliveInterval(leaseTimeout) and will always be less than
leaseTimeout for positive durations; remove the defensive branch and simplify
getLeaseKeepAliveTimeout to directly return the computed timeout (keep the call
to getLeaseKeepAliveInterval to compute timeout), referencing
getLeaseKeepAliveTimeout and getLeaseKeepAliveInterval to locate the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ce1dec28-905f-4a04-8c9e-a207a1b59dae

📥 Commits

Reviewing files that changed from the base of the PR and between dcb2ee69b0ae6982a60e3781f08080faa36d5002 and 417966511e950b667022bdd47be76d691929f045.

📒 Files selected for processing (2)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/election/lease_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.11%. Comparing base (f6653ed) to head (d7db677).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10654      +/-   ##
==========================================
+ Coverage   79.06%   79.11%   +0.05%     
==========================================
  Files         535      535              
  Lines       73065    73071       +6     
==========================================
+ Hits        57767    57810      +43     
+ Misses      11211    11192      -19     
+ Partials     4087     4069      -18     
Flag Coverage Δ
unittests 79.11% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign nolouch for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 9, 2026
@JmPotato JmPotato changed the title election: cap lease keepalive interval election: use fixed lease keepalive interval May 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/election/lease_test.go (1)

114-120: ⚡ Quick win

Strengthen this test to assert cadence is truly fixed (not TTL-derived).

Right now it only checks “some keepalive arrives within 2s”, which may still pass even if cadence regresses. Consider granting a much larger TTL and asserting first keepalive still arrives near the fixed interval window.

Suggested test tightening
-	re.NoError(lease.Grant(defaultLeaseTimeout))
+	re.NoError(lease.Grant(defaultLeaseTimeout * 6))
 	ch := lease.keepAliveWorker(ctx)
+	start := time.Now()
 	select {
 	case <-ch:
+		re.Less(time.Since(start), 1500*time.Millisecond)
 	case <-time.After(2 * time.Second):
 		re.Fail("timed out waiting for lease keepalive")
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/election/lease_test.go` around lines 114 - 120, Grant the lease with a
much larger TTL (instead of defaultLeaseTimeout) so the keepalive cadence cannot
be explained by TTL expiry, then call lease.keepAliveWorker(ctx), record the
time before waiting on ch, and assert that the first message from ch arrives
within a tight window around the configured keepalive interval (e.g., within
±50% of the expected interval constant or a hardcoded expectedInterval) rather
than just before a 2s timeout; use lease.Grant, lease.keepAliveWorker, ctx and
ch to locate the code and adjust the test timeout accordingly so the assertion
fails if cadence regresses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 114-120: Grant the lease with a much larger TTL (instead of
defaultLeaseTimeout) so the keepalive cadence cannot be explained by TTL expiry,
then call lease.keepAliveWorker(ctx), record the time before waiting on ch, and
assert that the first message from ch arrives within a tight window around the
configured keepalive interval (e.g., within ±50% of the expected interval
constant or a hardcoded expectedInterval) rather than just before a 2s timeout;
use lease.Grant, lease.keepAliveWorker, ctx and ch to locate the code and adjust
the test timeout accordingly so the assertion fails if cadence regresses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9129ba85-68ba-4ec5-a605-0a25779f35fa

📥 Commits

Reviewing files that changed from the base of the PR and between 417966511e950b667022bdd47be76d691929f045 and 1ed6d49b23ec6dbd81d438dab0751c925255202b.

📒 Files selected for processing (2)
  • pkg/election/lease.go
  • pkg/election/lease_test.go

Comment thread pkg/election/lease.go Outdated
@@ -34,6 +34,8 @@ const (
revokeLeaseTimeout = time.Second
requestTimeout = etcdutil.DefaultRequestTimeout
slowRequestTime = etcdutil.DefaultSlowRequestTime
// leaseKeepAliveInterval is fixed to renew leases frequently regardless of the configured lease timeout.
leaseKeepAliveInterval = 500 * time.Millisecond
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why choose 500ms?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, shall we also consider if the lease is less than 2s?

Comment thread pkg/election/lease.go Outdated
log.Warn("the interval between keeping alive lease is too long", zap.Time("last-time", lastTime))
}
go func(start time.Time) {
defer logutil.LogPanic()
ctx1, cancel := context.WithTimeout(ctx, l.leaseTimeout)
ctx1, cancel := context.WithTimeout(ctx, leaseKeepAliveInterval)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout shouldn't be set to 500ms

@JmPotato JmPotato force-pushed the bob/lease-keepalive-fixed-interval branch from 1ed6d49 to 246f734 Compare May 13, 2026 03:47
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 13, 2026
@JmPotato JmPotato force-pushed the bob/lease-keepalive-fixed-interval branch from 246f734 to 4d2e935 Compare May 13, 2026 03:50
Address review feedback on tikv#10654:

- Replace the fixed 500ms cadence with min(leaseTimeout/3, 500ms) so the
  keepalive interval still scales down for very small leases while staying
  capped at 500ms for the common case where the lease timeout is large.
- Restore the per-call `KeepAliveOnce` context timeout to `l.leaseTimeout`.
  Using the keepalive interval as the RPC timeout was too aggressive: any
  etcd tail latency above 500ms would cancel every renewal and force a
  leader switch. The RPC budget should track the lease TTL instead.

Signed-off-by: JmPotato <[email protected]>
@JmPotato JmPotato force-pushed the bob/lease-keepalive-fixed-interval branch from 4d2e935 to d7db677 Compare May 13, 2026 03:51
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 13, 2026

@JmPotato: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 d7db677 link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ch := lease.keepAliveWorker(ctx)
select {
case <-ch:
case <-time.After(2 * lease.getKeepAliveInterval()):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not really verify the capped cadence: the first keepalive is sent immediately, and the timeout is derived from the implementation under test. Please add a direct interval test, or assert timing on the second keepalive.

Comment thread pkg/election/lease.go
interval := l.getKeepAliveInterval()
go func() {
defer logutil.LogPanic()
ticker := time.NewTicker(interval)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the ticker to 500ms for large leases, but each KeepAliveOnce below still uses l.leaseTimeout as its context timeout. With a large lease and slow etcd, that can accumulate many overlapping RPCs. Please cap the request timeout separately, or limit in-flight attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants