election: use fixed lease keepalive interval#10654
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a 500ms cap and helper to compute the lease keep-alive interval, refactors keepAliveWorker to compute the interval internally, updates Lease.KeepAlive to the new signature, and adjusts the keep-alive test to wait with a timeout. ChangesLease keep-alive changes
Sequence DiagramsequenceDiagram
participant Lease
participant KeepAliveWorker
participant IntervalHelper as getLeaseKeepAliveInterval
participant Ticker as time.Ticker
Lease->>KeepAliveWorker: keepAliveWorker(ctx)
KeepAliveWorker->>IntervalHelper: getLeaseKeepAliveInterval(leaseTimeout)
KeepAliveWorker->>Ticker: NewTicker(interval)
Ticker->>KeepAliveWorker: tick -> send keep-alive
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/election/lease.go`:
- Around line 191-197: The keepalive interval returned by
getLeaseKeepAliveInterval may be clamped to maxLeaseKeepAliveInterval but
callers (e.g., the loop that calls l.KeepAliveOnce) still use l.leaseTimeout as
the RPC/context timeout, allowing long timeouts to overlap many short-interval
ticks; change the keepalive call-site to use a bounded timeout based on the
computed interval (e.g., use min(l.leaseTimeout,
getLeaseKeepAliveInterval(l.leaseTimeout)) or derive a separate
keepaliveTimeout) and ensure the KeepAliveOnce invocation is given a context
with that bounded timeout and cancelled after each call to prevent goroutine
leaks; also consider adding simple backoff/retry logic or using an errgroup
around repeated KeepAliveOnce calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ad155e57-40f9-45ec-9d81-8a3d262ff28c
📥 Commits
Reviewing files that changed from the base of the PR and between 521db060bdfcbe86fd2996883c4026cd10a74051 and dcb2ee69b0ae6982a60e3781f08080faa36d5002.
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.goserver/config/config.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/election/lease_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/election/lease.go (1)
200-206: ⚡ Quick winSimplify the unreachable defensive check.
The condition
leaseTimeout < timeoutat Line 202 can never be true becausetimeoutis computed asmin(leaseTimeout/3, maxLeaseKeepAliveInterval), which is always ≤leaseTimeout/3<leaseTimeoutfor positive durations. This defensive check creates unreachable code and adds confusion.♻️ Simplify to remove unreachable branch
func getLeaseKeepAliveTimeout(leaseTimeout time.Duration) time.Duration { - timeout := getLeaseKeepAliveInterval(leaseTimeout) - if leaseTimeout < timeout { - return leaseTimeout - } - return timeout + return getLeaseKeepAliveInterval(leaseTimeout) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/election/lease.go` around lines 200 - 206, The conditional in getLeaseKeepAliveTimeout is unreachable because timeout is computed by getLeaseKeepAliveInterval(leaseTimeout) and will always be less than leaseTimeout for positive durations; remove the defensive branch and simplify getLeaseKeepAliveTimeout to directly return the computed timeout (keep the call to getLeaseKeepAliveInterval to compute timeout), referencing getLeaseKeepAliveTimeout and getLeaseKeepAliveInterval to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 200-206: The conditional in getLeaseKeepAliveTimeout is
unreachable because timeout is computed by
getLeaseKeepAliveInterval(leaseTimeout) and will always be less than
leaseTimeout for positive durations; remove the defensive branch and simplify
getLeaseKeepAliveTimeout to directly return the computed timeout (keep the call
to getLeaseKeepAliveInterval to compute timeout), referencing
getLeaseKeepAliveTimeout and getLeaseKeepAliveInterval to locate the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ce1dec28-905f-4a04-8c9e-a207a1b59dae
📥 Commits
Reviewing files that changed from the base of the PR and between dcb2ee69b0ae6982a60e3781f08080faa36d5002 and 417966511e950b667022bdd47be76d691929f045.
📒 Files selected for processing (2)
pkg/election/lease.gopkg/election/lease_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/election/lease_test.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10654 +/- ##
==========================================
+ Coverage 79.06% 79.11% +0.05%
==========================================
Files 535 535
Lines 73065 73071 +6
==========================================
+ Hits 57767 57810 +43
+ Misses 11211 11192 -19
+ Partials 4087 4069 -18
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/election/lease_test.go (1)
114-120: ⚡ Quick winStrengthen this test to assert cadence is truly fixed (not TTL-derived).
Right now it only checks “some keepalive arrives within 2s”, which may still pass even if cadence regresses. Consider granting a much larger TTL and asserting first keepalive still arrives near the fixed interval window.
Suggested test tightening
- re.NoError(lease.Grant(defaultLeaseTimeout)) + re.NoError(lease.Grant(defaultLeaseTimeout * 6)) ch := lease.keepAliveWorker(ctx) + start := time.Now() select { case <-ch: + re.Less(time.Since(start), 1500*time.Millisecond) case <-time.After(2 * time.Second): re.Fail("timed out waiting for lease keepalive") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/election/lease_test.go` around lines 114 - 120, Grant the lease with a much larger TTL (instead of defaultLeaseTimeout) so the keepalive cadence cannot be explained by TTL expiry, then call lease.keepAliveWorker(ctx), record the time before waiting on ch, and assert that the first message from ch arrives within a tight window around the configured keepalive interval (e.g., within ±50% of the expected interval constant or a hardcoded expectedInterval) rather than just before a 2s timeout; use lease.Grant, lease.keepAliveWorker, ctx and ch to locate the code and adjust the test timeout accordingly so the assertion fails if cadence regresses.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 114-120: Grant the lease with a much larger TTL (instead of
defaultLeaseTimeout) so the keepalive cadence cannot be explained by TTL expiry,
then call lease.keepAliveWorker(ctx), record the time before waiting on ch, and
assert that the first message from ch arrives within a tight window around the
configured keepalive interval (e.g., within ±50% of the expected interval
constant or a hardcoded expectedInterval) rather than just before a 2s timeout;
use lease.Grant, lease.keepAliveWorker, ctx and ch to locate the code and adjust
the test timeout accordingly so the assertion fails if cadence regresses.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9129ba85-68ba-4ec5-a605-0a25779f35fa
📥 Commits
Reviewing files that changed from the base of the PR and between 417966511e950b667022bdd47be76d691929f045 and 1ed6d49b23ec6dbd81d438dab0751c925255202b.
📒 Files selected for processing (2)
pkg/election/lease.gopkg/election/lease_test.go
| @@ -34,6 +34,8 @@ const ( | |||
| revokeLeaseTimeout = time.Second | |||
| requestTimeout = etcdutil.DefaultRequestTimeout | |||
| slowRequestTime = etcdutil.DefaultSlowRequestTime | |||
| // leaseKeepAliveInterval is fixed to renew leases frequently regardless of the configured lease timeout. | |||
| leaseKeepAliveInterval = 500 * time.Millisecond | |||
There was a problem hiding this comment.
BTW, shall we also consider if the lease is less than 2s?
| log.Warn("the interval between keeping alive lease is too long", zap.Time("last-time", lastTime)) | ||
| } | ||
| go func(start time.Time) { | ||
| defer logutil.LogPanic() | ||
| ctx1, cancel := context.WithTimeout(ctx, l.leaseTimeout) | ||
| ctx1, cancel := context.WithTimeout(ctx, leaseKeepAliveInterval) |
There was a problem hiding this comment.
Timeout shouldn't be set to 500ms
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
Signed-off-by: JmPotato <[email protected]>
1ed6d49 to
246f734
Compare
246f734 to
4d2e935
Compare
Address review feedback on tikv#10654: - Replace the fixed 500ms cadence with min(leaseTimeout/3, 500ms) so the keepalive interval still scales down for very small leases while staying capped at 500ms for the common case where the lease timeout is large. - Restore the per-call `KeepAliveOnce` context timeout to `l.leaseTimeout`. Using the keepalive interval as the RPC timeout was too aggressive: any etcd tail latency above 500ms would cancel every renewal and force a leader switch. The RPC budget should track the lease TTL instead. Signed-off-by: JmPotato <[email protected]>
4d2e935 to
d7db677
Compare
|
@JmPotato: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| ch := lease.keepAliveWorker(ctx) | ||
| select { | ||
| case <-ch: | ||
| case <-time.After(2 * lease.getKeepAliveInterval()): |
There was a problem hiding this comment.
This does not really verify the capped cadence: the first keepalive is sent immediately, and the timeout is derived from the implementation under test. Please add a direct interval test, or assert timing on the second keepalive.
| interval := l.getKeepAliveInterval() | ||
| go func() { | ||
| defer logutil.LogPanic() | ||
| ticker := time.NewTicker(interval) |
There was a problem hiding this comment.
This changes the ticker to 500ms for large leases, but each KeepAliveOnce below still uses l.leaseTimeout as its context timeout. With a large lease and slow etcd, that can accumulate many overlapping RPCs. Please cap the request timeout separately, or limit in-flight attempts.
What problem does this PR solve?
Issue Number: ref #10653
What is changed and how does it work?
Check List
Tests
Release note
Summary by CodeRabbit
Bug Fixes
Tests