CORENET-6572: only report Progressing for active network rollouts#2937
Conversation
|
@jluhrsen: This pull request references CORENET-6572 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRemoved global MCP-progressing check; per-rendered-MC processing now uses source-based machineconfig helpers and defers pruning until a non-paused pool confirms removal. Persisted per-pod snapshot fields ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
1700-1736: Please add a cold-start mid-rollout regression case.Current tests validate rollout continuation with existing state and reboot churn without rollout, but they don’t cover first-observation behavior when last-seen state is empty and workload is already at
observedGeneration==generation,updated==replicas,unavailable>0. That scenario should be pinned to avoid false-negative Progressing.As per coding guidelines, "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 1826-1856
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` around lines 1700 - 1736, Add a new unit test that covers the cold-start mid-rollout regression: create a Deployment (reuse depB or name it depColdStart) with Status.ObservedGeneration == Generation, Status.UpdatedReplicas == Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty last-seen state (i.e., do not pre-populate any prior status in the status manager). Then call setStatus(t, client, depColdStart), run status.SetFromPods(), and call getStatuses(client, "testing"); finally assert via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing is ConditionTrue (and other expected conditions mirror the other tests). Use the same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses, conditionsInclude) and place the test near the existing rollout tests so it executes in the same context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 93-110: The dsState.RolloutGeneration must be initialized when the
controller first sees a DaemonSet mid-convergence even if
generation==observedGeneration; change the initialization logic in the DaemonSet
handling (variables: dsState, RolloutGeneration, currentRolloutGeneration) so
that when a state entry is new or RolloutGeneration==0 and the DaemonSet has
in-flight convergence (ds.Status.NumberUnavailable > 0 OR
(ds.Status.DesiredNumberScheduled > 0 && ds.Status.UpdatedNumberScheduled <
ds.Status.DesiredNumberScheduled)), you set dsState.RolloutGeneration =
currentRolloutGeneration (respecting the existing status.installComplete guard
if needed); apply the same initialization pattern to the analogous StatefulSet
and Deployment blocks (the corresponding variables and checks around
observedGeneration, UpdatedNumberScheduled/Ready/DesiredNumberScheduled, and
NumberUnavailable/NotReady).
---
Nitpick comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Around line 1700-1736: Add a new unit test that covers the cold-start
mid-rollout regression: create a Deployment (reuse depB or name it depColdStart)
with Status.ObservedGeneration == Generation, Status.UpdatedReplicas ==
Status.Replicas, and Status.UnavailableReplicas > 0 but simulate an empty
last-seen state (i.e., do not pre-populate any prior status in the status
manager). Then call setStatus(t, client, depColdStart), run
status.SetFromPods(), and call getStatuses(client, "testing"); finally assert
via conditionsInclude on oc.Status.Conditions that OperatorStatusTypeProgressing
is ConditionTrue (and other expected conditions mirror the other tests). Use the
same helpers referenced in the diff (setStatus, status.SetFromPods, getStatuses,
conditionsInclude) and place the test near the existing rollout tests so it
executes in the same context.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c8f95521-c9ef-41f3-bf7d-f168f710f1c4
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
da9acf2 to
2e2113f
Compare
|
@danwinship , if you have a chance, please check this one out. looks like network operator will hit this issue in 90% of the 4.21->4.22 aws-ovn-upgrade jobs so running it here a few times will be enough to validate the fix from that high level. /payload-job-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
/retest |
2e2113f to
bb1965f
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/0c24b550-23ae-11f1-97df-f16d2b83fd79-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: The test is asserting co.Status.Versions from a stale
ClusterOperator object after calling getStatuses; update the test to re-fetch or
assign the returned updated ClusterOperator before asserting Versions (e.g., use
the returned oc from getStatuses to set co = oc or call getStatuses again) so
that assertions against Status.Versions use the latest ClusterOperator state;
specifically adjust the places around the getStatuses call and the subsequent
assertions that reference co.Status.Versions (occurrences around getStatuses, co
variable, and Status.Versions) to use the refreshed object.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3654515-9f48-47fd-8ed3-357ead25e962
📒 Files selected for processing (3)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/statusmanager/pod_status.go
|
sorry about the |
|
I hope it shoudl work now /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5e2eeb00-23b2-11f1-9a0c-ae4d6f10dbf5-0 |
no worries @petr-muller . thanks for kicking it off again |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2429a040-2582-11f1-86c7-0543ceeba5f2-0 |
|
/retest /payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2c409070-261a-11f1-8f79-8f7632132838-0 |
danwinship
left a comment
There was a problem hiding this comment.
I feel like #2936 was probably closer to correct...
| sourceNames.Insert(source.Name) | ||
| } | ||
| return sourceNames.IsSuperset(machineConfigs) | ||
| } |
There was a problem hiding this comment.
So this is just duplicating the logic of mcutil.AreMachineConfigsRemovedFromPool but removing the check that status.MachineCount == status.UpdatedMachineCount. We should not be duplicating that logic here; it should remain in mcutil.
If it is actually correct that we need to check the machine count here, but not from the place in pkg/network/ovn_kubernetes.go that calls mcutil.AreMachineConfigsRemovedFromPool, then we should fix this by splitting out that check from mcutil.AreMachineConfigsRemovedFromPool. But if we don't want that check here, then I'm not convinced we want that check in pkg/network/ovn_kubernetes.go either.
There was a problem hiding this comment.
@danwinship , I've spent some back and forth on this and it's a bit different than your first review now. Will try to respond to each comment though:
we've moved the helpers AreMachineConfigsRenderedOnPoolSource() / AreMachineConfigsRemovedFromPoolSource() to mcutil.
logic should be that we only care whether the CNO MC is part of the pool's rendered source yet, not if it's finished converging on nodes. once it is, we don't need to keep CNO as progressing even in case of generic MCP node churn
| sourceNames.Insert(source.Name) | ||
| } | ||
| return !sourceNames.HasAny(machineConfigs.UnsortedList()...) | ||
| } |
There was a problem hiding this comment.
hope the above reply makes sense.
| default: | ||
| return 1 | ||
| } | ||
| } |
There was a problem hiding this comment.
observedGeneration must be less than or equal to generation, so this function never returns observedGeneration. It's just max(generation, 1).
There was a problem hiding this comment.
got it. should be good now
| (ds.Status.DesiredNumberScheduled == 0 || ds.Status.NumberAvailable >= ds.Status.DesiredNumberScheduled) { | ||
| dsState.RolloutGeneration = 0 | ||
| } | ||
| dsRolloutActive := dsState.RolloutGeneration != 0 |
There was a problem hiding this comment.
This seems like it must be way more complicated than it needs to be. I don't remember the exact details of how the daemonset controller reports status, but you should not need all of this just to distinguish the cases of "daemonset controller is doing stuff because the DaemonSet was updated" vs "daemonset controller is doing stuff because the nodes were updated".
What does dsState.RolloutGeneration represent? It seems that sometimes it is 0 and sometimes it is 1 and sometimes it is the current ds.Generation. This seems like it's overloaded and trying to represent multiple things at once.
There was a problem hiding this comment.
yes, TBH I was not very happy with how this was showing up, but it is a little complicated to look at. The newest
update puts the work in to dedicated helpers instead with a comment block hopefully helping explain it better.
there is no more RolloutGeneration now, but we do now have a RolloutActive bool now to tell us if we already saw
a rollout start and set in these new dedicated helpers. keeping track of that should help with knowing if we are
in a real rollout vs reboot churn.
| reachedAvailableLevel = false | ||
|
|
||
| dsState, exists := daemonsetStates[dsName] | ||
| if !exists || !reflect.DeepEqual(dsState.LastSeenStatus, ds.Status) { |
There was a problem hiding this comment.
(with the dsState, exists := daemonsetStates[dsName] moved, it's no longer clear what exists means at this point, so if the code stays like this it would need to be renamed)
There was a problem hiding this comment.
agreed. it's changed to hadState here now if that helps it make more sense?
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d993720-27f8-11f1-9bbb-e6c2fe987086-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/controller/statusmanager/status_manager_test.go (1)
520-520:⚠️ Potential issue | 🟡 MinorRefresh
cobefore theStatus.Versionsassertions.These two calls still discard the updated
ClusterOperator, but Lines 540 and 576 keep readingco.Status.Versions. That means the test is still validating the object fetched earlier in the method and can miss a regression in version mirroring.Suggested fix
- _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing") @@ - _, oc, err = getStatuses(client, "testing") + co, oc, err = getStatuses(client, "testing")Also applies to: 564-564
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/statusmanager/status_manager_test.go` at line 520, Test reads stale ClusterOperator into variable co then asserts on co.Status.Versions; refresh co from the API before those assertions. After the call to getStatuses (or where the operator update occurs), re-fetch the updated ClusterOperator (e.g., call getStatuses again or perform a fresh GET for the ClusterOperator) and use that refreshed co when asserting on co.Status.Versions so the test validates the latest mirrored versions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Line 41: The InstallComplete field is a plain bool so legacy annotations
deserialize as "missing" => false; change the field type to *bool
(InstallComplete *bool) and update all restore/read sites to treat nil as
"unknown" (fall back to existing ClusterOperator availability or previous
versioned state) before deciding to flip/install-complete; specifically update
the annotation restore logic, any IsInstallComplete checks, and the reconcile
paths that currently assume false (references to InstallComplete usage around
the earlier reconcile check and the bootstrap re-entry logic) to handle nil
safely and only treat explicit false when the pointer is non-nil.
---
Duplicate comments:
In `@pkg/controller/statusmanager/status_manager_test.go`:
- Line 520: Test reads stale ClusterOperator into variable co then asserts on
co.Status.Versions; refresh co from the API before those assertions. After the
call to getStatuses (or where the operator update occurs), re-fetch the updated
ClusterOperator (e.g., call getStatuses again or perform a fresh GET for the
ClusterOperator) and use that refreshed co when asserting on co.Status.Versions
so the test validates the latest mirrored versions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 73fd88cf-e693-4789-96a6-299b7c1fb92d
📒 Files selected for processing (4)
pkg/controller/statusmanager/machineconfig_status.gopkg/controller/statusmanager/pod_status.gopkg/controller/statusmanager/status_manager_test.gopkg/util/machineconfig/util.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/controller/statusmanager/machineconfig_status.go
727c270 to
075dccd
Compare
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 5 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41aa3610-27f9-11f1-8f25-d6288f44483f-0 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a9c8c0d0-3312-11f1-8eac-8a15b15994db-0 |
|
/retest |
|
/payload-aggregate periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 10 |
|
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3fe44330-336a-11f1-8103-beaac89a22c3-0 |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: danwinship, jluhrsen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/verified by the 20 jobs in the previous two aggregate runs (10 each) of periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade |
|
@jluhrsen: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
/test e2e-azure-ovn-upgrade |
1 similar comment
|
/test e2e-azure-ovn-upgrade |
|
/retest |
|
/tide refresh |
|
/test e2e-aws-ovn-hypershift-conformance |
1 similar comment
|
/test e2e-aws-ovn-hypershift-conformance |
|
/override ci/prow/e2e-aws-ovn-hypershift-conformance |
|
@danwinship: Overrode contexts on behalf of danwinship: ci/prow/e2e-aws-ovn-hypershift-conformance DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/tide refresh |
|
/skip |
|
/override ci/prow/e2e-metal-ipi-ovn-dualstack-bgp |
|
@danwinship: Overrode contexts on behalf of danwinship: ci/prow/e2e-metal-ipi-ovn-dualstack-bgp DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jluhrsen: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/override ci/prow/e2e-ovn-ipsec-step-registry |
|
@danwinship: Overrode contexts on behalf of danwinship: ci/prow/e2e-ovn-ipsec-step-registry DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
avoid false Progressing during reboot churn
Keep pod-based Progressing tied to an actual CNO rollout instead of
temporary unavailability during node reboot churn.
Detect rollouts with a simple two-phase approach:
Once Updated >= Current, treat pod unavailability as node reboot churn
rather than network rollout progress, avoiding false Progressing conditions.
Restore install-complete state safely across upgrades by treating older
last-seen annotations that omit InstallComplete as legacy data instead
of assuming install is incomplete again.
For machine config status, stop treating generic MCP node convergence
as a CNO rollout signal. Reuse shared mcutil helpers for source-only
checks, and only prune removed machine-config state after every matching
non-paused pool has dropped that config from its rendered source.
Co-authored-by: Codex