NVIDIA-596: Enable dpu healthcheck #2941
Conversation
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds DPU node lease support: new bootstrap config fields and defaults, reads and validates values from hardware-offload ConfigMap, exposes stringified values to templates, injects env vars and ovnkube CLI flags for DPU node modes, adds ConfigMap defaults and tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope ... [truncated 17356 characters] ... ired in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Comment |
62c31b1 to
b5a3d66
Compare
|
/retest-required |
|
Blocked by k8snetworkplumbingwg/multus-cni#1490 |
|
@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing. |
…Sets Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef. Jira: https://issues.redhat.com/browse/NVIDIA-596
1eb0381 to
6b9ed3a
Compare
done |
|
@yingwang-0320: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tsorya: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| fi | ||
|
|
||
| if [ "${OVN_NODE_MODE}" == "dpu-host" ] || [ "${OVN_NODE_MODE}" == "dpu" ]; then | ||
| if [[ -n "${OVNKUBE_NODE_LEASE_RENEW_INTERVAL}" ]]; then |
There was a problem hiding this comment.
nitpick: be consistent with [[ / ]] rather than [ / ]
| dpu_lease_flags="--dpu-node-lease-renew-interval ${OVNKUBE_NODE_LEASE_RENEW_INTERVAL}" | ||
| fi | ||
| if [[ -n "${OVNKUBE_NODE_LEASE_DURATION}" ]]; then | ||
| dpu_lease_flags="$dpu_lease_flags --dpu-node-lease-duration ${OVNKUBE_NODE_LEASE_DURATION}" |
There was a problem hiding this comment.
ok, since I'm already nitpicking on shell script style...
| dpu_lease_flags="$dpu_lease_flags --dpu-node-lease-duration ${OVNKUBE_NODE_LEASE_DURATION}" | |
| dpu_lease_flags="${dpu_lease_flags} --dpu-node-lease-duration ${OVNKUBE_NODE_LEASE_DURATION}" |
| fi | ||
|
|
||
| if [ "${OVN_NODE_MODE}" == "dpu-host" ] || [ "${OVN_NODE_MODE}" == "dpu" ]; then | ||
| if [[ -n "${OVNKUBE_NODE_LEASE_RENEW_INTERVAL}" ]]; then |
There was a problem hiding this comment.
Actually, you could remove line 606 since the other two variables will always be unset in non-DPU modes...
| smart-nic-mode-label: "network.operator.openshift.io/smart-nic=" | ||
| mgmt-port-resource-name: "openshift.io/mgmtvf" | ||
| dpu-node-lease-renew-interval: "10" | ||
| dpu-node-lease-duration: "40" |
There was a problem hiding this comment.
10 and 40 what?
if you made this a time.Duration (dpu-node-lease-renew-interval: "10s"), it would be more self-documenting and more obvious that it had to be a string rather than a raw number
There was a problem hiding this comment.
Sounds good, will change
There was a problem hiding this comment.
The issue is it is the same name as in ovnk but i can add some suffix maybe
| daemon-config.json: | | ||
| { | ||
| "cniVersion": "0.3.1", | ||
| "cniVersion": "1.1.0", |
There was a problem hiding this comment.
Multus was updated to the new version which enables CNI status
- Remove outer DPU mode guard around lease flag construction since
the env vars are only injected for DPU modes anyway.
- Use ${var} style for string interpolation consistency.
- Rename ConfigMap keys to dpu-node-lease-renew-interval-in-seconds
and dpu-node-lease-duration-in-seconds to make the unit
self-documenting.
Made-with: Cursor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tsorya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1094-1102: If either ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9e7b87eb-40e5-48db-92c5-608250f639d9
📒 Files selected for processing (3)
bindata/network/ovn-kubernetes/common/008-script-lib.yamlhack/hardware-offload-config.yamlpkg/network/ovn_kubernetes.go
✅ Files skipped from review due to trivial changes (1)
- hack/hardware-offload-config.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- bindata/network/ovn-kubernetes/common/008-script-lib.yaml
| // Setting either value to 0 disables the DPU health check. | ||
| // When both are non-zero, duration must be greater than interval. | ||
| if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 && | ||
| ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | ||
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | ||
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | ||
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | ||
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | ||
| } |
There was a problem hiding this comment.
Normalize disable semantics when either lease value is 0.
Line 1094 says either value disables health check, but current flow can still render flags when renew interval is non-zero and duration is 0. Please normalize both fields to 0 when either is 0 so behavior is consistent and not dependent on downstream flag parsing behavior.
Proposed fix
- // Setting either value to 0 disables the DPU health check.
- // When both are non-zero, duration must be greater than interval.
- if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 &&
- ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+ // Setting either value to 0 disables the DPU health check.
+ if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 {
+ ovnConfigResult.DpuNodeLeaseRenewInterval = 0
+ ovnConfigResult.DpuNodeLeaseDuration = 0
+ } else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval {
+ // When both are non-zero, duration must be greater than interval.
klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults",
ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval)
ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT
ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Setting either value to 0 disables the DPU health check. | |
| // When both are non-zero, duration must be greater than interval. | |
| if ovnConfigResult.DpuNodeLeaseRenewInterval != 0 && ovnConfigResult.DpuNodeLeaseDuration != 0 && | |
| ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | |
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | |
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | |
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | |
| } | |
| // Setting either value to 0 disables the DPU health check. | |
| if ovnConfigResult.DpuNodeLeaseRenewInterval == 0 || ovnConfigResult.DpuNodeLeaseDuration == 0 { | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = 0 | |
| ovnConfigResult.DpuNodeLeaseDuration = 0 | |
| } else if ovnConfigResult.DpuNodeLeaseDuration <= ovnConfigResult.DpuNodeLeaseRenewInterval { | |
| // When both are non-zero, duration must be greater than interval. | |
| klog.Warningf("dpu-node-lease-duration-in-seconds (%d) must be greater than dpu-node-lease-renew-interval-in-seconds (%d), using defaults", | |
| ovnConfigResult.DpuNodeLeaseDuration, ovnConfigResult.DpuNodeLeaseRenewInterval) | |
| ovnConfigResult.DpuNodeLeaseRenewInterval = DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT | |
| ovnConfigResult.DpuNodeLeaseDuration = DPU_NODE_LEASE_DURATION_DEFAULT | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/network/ovn_kubernetes.go` around lines 1094 - 1102, If either
ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets
Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.
Jira: https://issues.redhat.com/browse/NVIDIA-596
Summary by CodeRabbit
New Features
Tests