🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643
🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643joelanford wants to merge 1 commit intooperator-framework:mainfrom
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
9dece97 to
a586063
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a reconciliation dead-end where ClusterObjectSet (COS) updates were being dropped once a revision hit ProgressDeadlineExceeded, preventing stuck revisions from being archived and cleaned up. It removes the update-blocking predicate, makes progress-deadline handling “sticky” in status updates, and adds a deadline-aware rate limiter plus an E2E scenario to validate archival cleanup.
Changes:
- Remove the
ProgressDeadlineExceeded-skipping watch predicate and introduce a controllerRateLimiterthat caps exponential backoff at the progress deadline. - Refactor progress-deadline computation into a shared
durationUntilDeadlinehelper and adjust progressing/retrying status updates to preferProgressDeadlineExceededonce exceeded. - Add an E2E scenario (and step helper) that archives a deadline-exceeded COS and verifies resource cleanup.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
internal/operator-controller/controllers/clusterobjectset_controller.go |
Removes the skip predicate, adds deadline-aware rate limiting, refactors deadline computation, and changes progressing/retrying condition behavior when the deadline is exceeded. |
test/e2e/steps/steps.go |
Adds a new Godog step to patch a COS lifecycle state to Archived. |
test/e2e/features/revision.feature |
Adds an E2E scenario that forces ProgressDeadlineExceeded, archives the COS, and asserts resources are removed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterobjectset_controller.go
Outdated
Show resolved
Hide resolved
| When ClusterObjectSet "${COS_NAME}" lifecycle is set to "Archived" | ||
| Then ClusterObjectSet "${COS_NAME}" is archived | ||
| And resource "configmap/test-configmap" is eventually not found | ||
| And resource "deployment/test-deployment" is eventually not found |
There was a problem hiding this comment.
It does not seems to be testing the same scenario @joelanford
Could we ensure the same scenario here?
There was a problem hiding this comment.
- User installs a ClusterExtension. The CE controller creates COS-rev-1.
- COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.
- User updates the ClusterExtension. The CE controller creates COS-rev-2.
- COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived so the old revision gets cleaned up.
- The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.
- COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.
There was a problem hiding this comment.
Is it not the same? Making a CE that stamps out the COS-1 and COS-2 such that the CE reconciler eventually tries to set COS-1 as archived is the same thing, but just more ceremony around a standalone COS that exceeds the deadline and then is directly archived by the test code.
I'll verify manually to make sure. Is there something else happening with the CE/COS interaction that would make the COS-only test in the PR different?
a586063 to
46dcf54
Compare
46dcf54 to
aea3d72
Compare
There was a problem hiding this comment.
Pull request overview
This PR adjusts ClusterObjectSet (COS) reconciliation behavior so revisions that hit ProgressDeadlineExceeded still reconcile (enabling archival/cleanup), while making the deadline-exceeded state “sticky” to avoid the previous reconcile loop. It also adds a deadline-capped rate limiter and an e2e scenario to verify archival cleans up resources after a progress deadline is exceeded.
Changes:
- Remove the watch predicate that dropped COS update events when
ProgressDeadlineExceededwas set, allowing spec patches like archival to be reconciled. - Refactor progress-deadline handling (shared
durationUntilDeadline, stickyProgressDeadlineExceeded, requeue-at-deadline behavior) and add a deadline-aware rate limiter to cap backoff until the deadline. - Add an e2e scenario and a new step to patch COS lifecycle to
Archived, plus improve e2e cleanup synchronization.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/steps/steps.go | Adds a Godog step to patch a ClusterObjectSet’s spec.lifecycleState. |
| test/e2e/steps/hooks.go | Makes scenario cleanup wait for concurrent deletions to finish. |
| test/e2e/features/revision.feature | Adds an e2e scenario covering archival + resource cleanup after ProgressDeadlineExceeded. |
| internal/operator-controller/controllers/clusterobjectset_controller.go | Removes the skip predicate; makes deadline handling sticky; adds deadline-aware requeue + custom rate limiter. |
| internal/operator-controller/controllers/clusterobjectset_controller_test.go | Updates the progress-deadline requeue expectation to match the new scheduling behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if _, err := k8sClient(args...); err != nil { | ||
| logger.Info("Error deleting resource", "name", res.name, "namespace", res.namespace, "stderr", stderrOutput(err)) | ||
| } | ||
| wg.Done() | ||
| }(r) |
| return true | ||
| }, | ||
| } | ||
| c.Clock = clock.RealClock{} |
There was a problem hiding this comment.
Not something I changed in this PR, so out-of-scope. Regardless, caller's don't call SetupWithManager in tests, so this isn't a problem in practice.
| cos := &ocv1.ClusterObjectSet{} | ||
| if err := r.client.Get(context.Background(), item.NamespacedName, cos); err != nil { |
There was a problem hiding this comment.
Our use of the rate limiter uses the informer cache which is in-memory and does not block.
aea3d72 to
e1b56c2
Compare
e1b56c2 to
00c3c75
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a reconciliation dead-end for ClusterObjectSet (COS) revisions that hit ProgressDeadlineExceeded, ensuring they can still reconcile (e.g., to process archival) and improving deadline-related requeue behavior so deadline status is set promptly.
Changes:
- Removes the update predicate that dropped COS updates when
ProgressDeadlineExceededwas set. - Refactors progress-deadline handling into shared helpers (
durationUntilDeadline,requeueForDeadline) and adds a deadline-aware rate limiter to cap exponential backoff at the deadline. - Extends e2e coverage with a new scenario that archives a deadline-exceeded COS and verifies resource cleanup; improves scenario cleanup to wait for deletions to complete.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/steps/steps.go | Adds an e2e step to patch a COS lifecycle state (used for archival in scenarios). |
| test/e2e/steps/hooks.go | Ensures scenario cleanup waits for concurrent deletion commands to finish. |
| test/e2e/features/revision.feature | Adds an e2e scenario that drives a COS into ProgressDeadlineExceeded, archives it, and asserts resources are removed. |
| internal/operator-controller/controllers/clusterobjectset_controller.go | Removes the deadline-exceeded predicate; adds deadline-aware requeue/rate-limiting and updates Progressing condition handling. |
| internal/operator-controller/controllers/clusterobjectset_controller_test.go | Updates expected requeue timing behavior for progress deadline handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| remaining, hasDeadline := c.durationUntilDeadline(cos) | ||
| isDeadlineExceeded := hasDeadline && remaining <= 0 | ||
|
|
| // deadlineAwareRateLimiter wraps a delegate rate limiter and caps the backoff | ||
| // duration to the time remaining until the COS progress deadline (+2s), ensuring | ||
| // that ProgressDeadlineExceeded is set promptly even during exponential backoff. | ||
| type deadlineAwareRateLimiter struct { | ||
| delegate workqueue.TypedRateLimiter[ctrl.Request] |
00c3c75 to
9768528
Compare
Remove the skipProgressDeadlineExceededPredicate that blocked all update events for COS objects with ProgressDeadlineExceeded. This predicate prevented archival of stuck revisions because the lifecycle state patch was dropped as an update event. To prevent the reconcile loop that the predicate was masking, markAsProgressing now sets ProgressDeadlineExceeded instead of RollingOut/Retrying when the deadline has been exceeded. Terminal reasons (Succeeded) always apply. Unregistered reasons panic. Continue reconciling after ProgressDeadlineExceeded rather than clearing the error and stopping requeue. This allows revisions to recover if a transient error resolves itself, even after the deadline was exceeded. Extract durationUntilDeadline as a shared helper for deadline computation. Add a deadlineAwareRateLimiter that caps exponential backoff at the deadline so ProgressDeadlineExceeded is set promptly even during error retries. Move the deadline requeue logic into requeueForDeadline, called from within reconcile when probes are still failing. Fix e2e scenario cleanup to wait for resource deletions to complete before returning. Add an e2e test that creates a COS with a never-ready deployment, waits for ProgressDeadlineExceeded, archives the COS, and verifies cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
9768528 to
ef08a54
Compare
Description
The
skipProgressDeadlineExceededPredicateblocked all update events for COS objectswith
ProgressDeadlineExceeded, which prevented archival of stuck revisions — thelifecycle state patch was silently dropped.
This PR:
markAsProgressingto setProgressDeadlineExceededinstead ofRollingOut/Retryingwhen the deadline has been exceeded, preventing the reconcileloop the predicate was masking.
Succeededalways applies; unregistered reasons panicProgressDeadlineExceededrather than clearing theerror and stopping requeue. This allows revisions to recover if a transient error
resolves itself, even after the deadline was exceeded
durationUntilDeadlineas a shared helper for deadline computationdeadlineAwareRateLimiterthat caps exponential backoff at the deadline soProgressDeadlineExceededis set promptly even during error retriesrequeueForDeadline, called from withinreconcilewhen probes are still failingProgressDeadlineExceeded, archives the COS, and verifies resource cleanupAddresses feedback from:
#2610 (comment)
Reviewer Checklist