OCPEDGE-2491: Log pcs status and etcd member list after every recovery#30949
Conversation
…y test Add post-test logging of `sudo pcs status` and `sudo podman exec etcd etcdctl member list -w table` via SSH through the hypervisor after every recovery test (pass or fail). This provides visibility into the final cluster state without relying on the Kubernetes API, which may be unavailable after recovery tests. The logging is registered via DeferCleanup in BeforeEach and is gated on HasHypervisorConfig(). Errors are logged but never fail the test. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Skipping CI for Draft Pull Request. |
|
@lucaconsalvi: This pull request references OCPEDGE-2491 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdded a post-spec cleanup hook that always collects concise Pacemaker ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_recovery.go`:
- Around line 828-843: The SSH helper calls currently discard stderr (using `_`)
and only log the error object on failure, losing valuable diagnostics; update
the calls to services.PcsStatus and core.ExecuteRemoteSSHCommand so they capture
both stdout and stderr (e.g., pcsOutput, pcsErrOutput, pcsErr and etcdOutput,
etcdErrOutput, etcdErr), and when pcsErr or etcdErr is non-nil include both the
stdout and stderr variables in the framework.Logf message (alongside the error)
to preserve command output for debugging while keeping the existing success-path
logging unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 562d68bd-fc7c-43f9-8584-c2979c2c83dc
📒 Files selected for processing (1)
test/extended/two_node/tnf_recovery.go
|
Scheduling required tests: |
Include stdout and stderr in error log messages for pcs status and etcd member list commands, so diagnostic output is not lost on failure. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Scheduling required tests: |
|
/test e2e-gcp-ovn-upgrade |
|
/test e2e-metal-ipi-ovn-ipv6 |
|
Job Failure Risk Analysis for sha: d451f3c
|
fonta-rh
left a comment
There was a problem hiding this comment.
LGTM — clean diagnostic addition, follows existing patterns well.
Three optional nit-level suggestions inline (not blocking):
os.Statguard:err != nilcatches all stat failures, not just "not found"- Nil-guard on
GetHypervisorConfig()return (defensive, matches a pre-existing gap insetupMinimalTestEnvironment) - Comment clarifying relationship with
deferDiagnosticsOnFailure
| // Select the remaining index | ||
| targetNode = nodes.Items[(randomIndex+1)%len(nodes.Items)] | ||
|
|
||
| // Log final pcs and etcd status after every test (pass or fail) via SSH |
There was a problem hiding this comment.
Nit: a reader seeing both this and deferDiagnosticsOnFailure in the It blocks may wonder about redundancy. Consider clarifying:
| // Log final pcs and etcd status after every test (pass or fail) via SSH | |
| // Log concise pcs and etcd status after every test (pass or fail) via SSH. | |
| // Complements deferDiagnosticsOnFailure which gathers verbose diagnostics only on failure. |
| return | ||
| } | ||
|
|
||
| sshConfig := exutil.GetHypervisorConfig() |
There was a problem hiding this comment.
Nit: HasHypervisorConfig() only checks that the config string is non-empty — if the JSON is malformed, GetHypervisorConfig() returns nil and the next line panics. Same gap exists in setupMinimalTestEnvironment (line 673), but a panic inside DeferCleanup is uglier since it fails the test.
| sshConfig := exutil.GetHypervisorConfig() | |
| sshConfig := exutil.GetHypervisorConfig() | |
| if sshConfig == nil { | |
| framework.Logf("Skipping final cluster status: failed to parse hypervisor config") | |
| return | |
| } |
| if _, err := os.Stat(hypervisorConfig.PrivateKeyPath); os.IsNotExist(err) { | ||
| framework.Logf("Skipping final cluster status: private key not found at %s", hypervisorConfig.PrivateKeyPath) | ||
| return | ||
| } |
There was a problem hiding this comment.
Nit: os.IsNotExist(err) only matches one specific stat error. If os.Stat fails with e.g. EACCES, the guard is false, execution continues, and SSH fails later with a misleading auth error. Checking err != nil is strictly better here.
| if _, err := os.Stat(hypervisorConfig.PrivateKeyPath); os.IsNotExist(err) { | |
| framework.Logf("Skipping final cluster status: private key not found at %s", hypervisorConfig.PrivateKeyPath) | |
| return | |
| } | |
| if _, err := os.Stat(hypervisorConfig.PrivateKeyPath); err != nil { | |
| framework.Logf("Skipping final cluster status: cannot access private key at %s: %v", hypervisorConfig.PrivateKeyPath, err) | |
| return | |
| } |
|
/hold to leave time to address optional nits, feel free to /unhold |
|
/approve |
|
/lgtm |
Address review feedback on PR openshift#30949: - Guard against nil GetHypervisorConfig() return on malformed JSON - Broaden os.Stat check from os.IsNotExist to err != nil - Clarify logFinalClusterStatus vs deferDiagnosticsOnFailure roles Applied to both logFinalClusterStatus and setupMinimalTestEnvironment. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
/lgtm |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/two_node/tnf_recovery.go (1)
799-819: Extract the hypervisor SSH bootstrap into one helper.Line 799 through Line 819 duplicate the same config-parse /
core.SSHConfigpopulation / key validation / known-hosts setup flow already insetupMinimalTestEnvironmentat Line 679 onward. This PR already had to harden both paths the same way, so they’re likely to drift again. A small shared helper that returns a validated SSH config plus known-hosts path would keep the skip/fail behavior consistent in one place.As per coding guidelines, "-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_recovery.go` around lines 799 - 819, Extract the duplicated hypervisor SSH bootstrap sequence into a single helper (e.g., PrepareValidatedHypervisorSSH or GetValidatedHypervisorSSHConfig) that performs exutil.GetHypervisorConfig(), populates and returns a core.SSHConfig, validates the private key file (os.Stat), calls core.PrepareLocalKnownHostsFile and returns the knownHostsPath and error; then replace the duplicated blocks in tnf_recovery.go (the final-cluster-status skip block) and in setupMinimalTestEnvironment with calls to this helper and propagate/handle its error to keep skip/fail behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/extended/two_node/tnf_recovery.go`:
- Around line 799-819: Extract the duplicated hypervisor SSH bootstrap sequence
into a single helper (e.g., PrepareValidatedHypervisorSSH or
GetValidatedHypervisorSSHConfig) that performs exutil.GetHypervisorConfig(),
populates and returns a core.SSHConfig, validates the private key file
(os.Stat), calls core.PrepareLocalKnownHostsFile and returns the knownHostsPath
and error; then replace the duplicated blocks in tnf_recovery.go (the
final-cluster-status skip block) and in setupMinimalTestEnvironment with calls
to this helper and propagate/handle its error to keep skip/fail behavior
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d09c518c-879a-4dce-8ebe-32cb7512c300
📒 Files selected for processing (1)
test/extended/two_node/tnf_recovery.go
Test resultsRan the double node failure (cold-boot) recovery test on a
What was verified
Sample output from the new logging========== FINAL CLUSTER STATUS ==========
etcd member list from node master-1: |
|
Scheduling required tests: |
|
/unhold |
|
/approve Thank you for putting this together! This will help a lot for CI debugging. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fonta-rh, jaypoulz, lucaconsalvi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3312eb00-344d-11f1-9853-1c153a7c7070-0 |
|
/verified by lucaconsalvi |
|
@lucaconsalvi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@lucaconsalvi: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
-w table after every recovery test (pass or fail)
after recovery tests
configured
Details
Recovery tests currently don't log final cluster state, making it harder to diagnose
failures. This adds a logFinalClusterStatus helper registered via DeferCleanup in
BeforeEach that SSHs through the hypervisor to collect pacemaker and etcd membership
status from the cluster nodes after every test completes.
The function tries each node in order and stops after the first node where both commands
succeed (since both commands return cluster-wide state). If a node is unreachable or a
command fails, it logs the error and tries the next node.
Test plan
Fixes: OCPEDGE-2491