SRE-3703 ci: Fault injection testing stage on VM/bare metal#17953
SRE-3703 ci: Fault injection testing stage on VM/bare metal#17953grom72 wants to merge 28 commits into
Conversation
|
Errors are Unable to load ticket data |
276641f to
23827b4
Compare
e724b71 to
14b4ae9
Compare
|
Test stage Test RPMs on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/56/execution/node/424/log |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/56/display/redirect |
1 similar comment
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17953/56/display/redirect |
eea6d40 to
8bf0a15
Compare
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/62/execution/node/369/log |
3591870 to
fdc56a7
Compare
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17953/69/testReport/ |
dd13687 to
07365a4
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17953/76/testReport/ |
94807ff to
df82b57
Compare
|
Test stage NLT Fault injection testing completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/90/execution/node/333/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/90/execution/node/346/log |
|
Test stage NLT Fault injection testing completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/91/execution/node/346/log |
|
Test stage NLT completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17953/91/execution/node/344/log |
| } // post | ||
| } // stage('Functional on Ubuntu 20.04') | ||
| stage('Fault injection testing') { | ||
| stage('NLT Fault injection testing') { |
There was a problem hiding this comment.
FYI this will require an update to merge requirements
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
pipeline-lib now supports overriding NLT/FI defaults (always_script, testResults, valgrind_pattern, with_valgrind, NLT, FI) via the config map, taking priority over the values auto-detected from the stage name by parseStageInfo. Make the Jenkinsfile stages explicit to take advantage of this and to make the stage configuration self-documenting. NLT stage (unitTest call): - Add with_valgrind: 'memcheck', valgrind_pattern: '*memcheck.xml', always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' NLT stage (unitTestPost call): - Remove always_script (now passed to unitTest above) - Add NLT: true to explicitly activate the NLT post-processing block (recordIssues, discoverGitReferenceBuild) instead of relying on stage name detection - Add valgrind_pattern: '*memcheck.xml' for the valgrind_stash NLT Fault injection testing stage (unitTest call): - Add always_script: 'ci/unit/test_nlt_post.sh', testResults: 'nlt-junit.xml' - Add with_valgrind: '' to explicitly suppress valgrind for FI NLT Fault injection testing stage (unitTestPost call): - Replace always_script with FI: true to explicitly activate fault injection post-processing (nlt-client-leaks.json, 'Fault injection' naming, discoverGitReferenceBuild) instead of relying on the now- removed stage name auto-detection of FI in parseStageInfo Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
NLT in stage name is no longer needed as required information is transfered via parameters of unitTest and unitTestPost procedures. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-python-bandit: true Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
| // To use a test branch (i.e. PR) until it lands to master | ||
| // I.e. for testing library changes | ||
| //@Library(value='pipeline-lib@your_branch') _ | ||
| @Library(value=['pipeline-lib@grom72/SRE-3704']) _ |
There was a problem hiding this comment.
To be removed before landing
…Test-FI Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com> Priority: 2 Cancel-prev-build: false Skip-unit-test: true Skip-unit-test-memcheck: true Skip-func-vm-all: true Skip-test-el-9-rpms: true Skip-test-leap-15-rpms: true Skip-func-hw-test: true Skip-build-el8-gcc: true Skip-build-leap15-gcc: true Skip-func-test-el9: true Skip-func-test-leap15: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@hpe.com>
This PR introduces logic that simplifies the Fault Injection testing stage in CI (the Jenkinsfile)
by moving it from a Docker container environment to a provisioned VM/bare metal environment.
Requires:
or
or
Background
The old
Fault injection testingstage ran NLT fault injection inside a Docker container(
docker_runner_fi+Dockerfile.el.9) on a shared Jenkins agent host. Up to 10 FIcontainers could execute simultaneously on the same host alongside other CI workloads,
resulting in severe CPU and network resource contention. The symptoms were well-documented:
RPC timeouts, SWIM protocol failures to make progress, and "Sluggish EC boundary" warnings —
all caused by infrastructure overload rather than real code defects.
Additionally, nlt_server.yaml had
ABT_STACK_OVERFLOW_CHECK=mprotectset, which causes Argobots to issuemprotect()calls for ULT stack overflow detection. On KVM-based VMs, each such call triggers TLB shootdown IPIs across all vCPUs, making test execution significantly slower on VMs than on bare metal or inside Docker containers where this overhead is less pronounced. This was a known cause of very long and unpredictable FI test execution times when running on VMs. Now that the stage runs on a dedicated provisioned VM with proper resources,ABT_STACK_OVERFLOW_CHECK=mprotectis removed from nlt_server.yaml, restoring test execution duration comparable to bare metal.Two workarounds were introduced to mask this instability:
DAOS-623 test: add allowed error for FI, commit e0fd4e3):added
skip_substringsfilters innode_local_test.pyandcart_logtest.pyto suppressSWIM/network-related error conditions ("sluggish ec boundary report from rank",
"sluggish stable epoch reporting", "progress callback was not called for too long",
"rpc failed; rc:") that were firing due to Docker resource contention.
DAOS-623 test: ignore the server errors in client FI tests too):extended the same suppression to server-side errors seen in NLT client FI runs.
Both PRs were explicitly described as temporary workarounds, with the expectation that they
would be reverted once FI testing was moved to a dedicated, stable environment. This PR
delivers that fix and reverts both workarounds (e0fd4e3 / #17959 and #17999), restoring
full error checking in
node_local_test.pyandcart_logtest.py.Solution
The
NLT Fault Injection testingstage now runs on a dedicated provisioned VM(
CI_FI_1_LABEL, defaultci_fi_vm1) using the sameunitTest/unitTestPostpipelineprocedures as the NLT and Unit Test stages. This mirrors how NLT tests have always been
run — on bare metal/VM nodes exclusively allocated for that purpose — and brings the same
benefits to FI testing:
VM_CPUS=20in pipeline-lib) eliminate the resource contention that caused SWIMand RPC failures. With 20+ cores,
AllocFailTest.launch()can run FI tests inparallel (
max_child = 15) instead of the forced serial mode (max_child = 1)that occurred when the Docker container saw fewer than 20 vCPUs.
ABT_STACK_OVERFLOW_CHECK=mprotectis removed fromutils/nlt_server.yaml,eliminating the cascading TLB shootdown IPIs that occurred when multiple FI
containers ran simultaneously on a shared KVM host.
Docker containers on a shared host.
removing a full SCons build from the critical path and significantly reducing
stage runtime.
node_local_test.pyandcart_logtest.py;the
skip_substringssuppression introduced in DAOS-623 test: add allowed error for FI #17959 and DAOS-623 test: ignore the server errors in client FI tests too #17999 is removed.unitTestPostpath, consistent with all other test stages.
The stage is renamed from
Fault injection testingtoNLT Fault Injection testingtoavoid confusion with the existing
Fault injection testingstage and to enable detectionin
parseStageInfo/skipStagein pipeline-lib.Jenkinsfile:
Fault injection testingstage (Docker build +nlt_test()) with the newNLT Fault Injection testingstage running on a provisioned VM viaunitTest.nlt_test()helper function entirely — its logic is now handled byunitTest/unitTestPostin pipeline-lib.CI_FI_1_LABELparameter (ci_fi_vm1) for the new FI VM pool; renameCI_NLT_1_LABELdefault fromci_nlt_1toci_nlt_vm1.fault-inject-valgrindstash fromvalgrindReportPublish— FI runswith
--memcheck noand produces no memcheck XML.ci/docker_nlt.sh:
via the standard
unitTestpath.ci/provisioning/post_provision_config_common_functions.sh:
maldeton provisioned nodes;maldetscans add CPU load during NLT tests.ci/unit/test_nlt.sh:
ssh -tt+ inline heredoc execution withssh -T … bash -s -- $*pipingtest_nlt_node.shover stdin, so that command-line arguments ($*) are forwardedcorrectly to
test_nlt_node.sh(required for the--memcheck no --class-name fault-injection fiarguments passed by the FI stage).ci/unit/test_nlt_node.sh:
sudo mkdir -p /mnt/daos(no longer needed on provisioned VMs).$*; default to the original NLT run parameters whenno arguments are given, making the script reusable for both plain NLT and FI.
tmpfsonnlt_logs/and setTMPDIRto it before executingnode_local_test.py, so NLT log files land on a fast in-memory filesystem.exec envto setHTTPS_PROXY/NO_PROXYcleanly.ci/unit/test_nlt_post.sh:
rsyncpass to also collect logs frombuild/nlt_logs/on the node(NLT with
--no-rootwrites logs there instead of/tmp/).rsynccalls non-fatal (|| true) so post steps do not fail on missinglog directories.
utils/nlt_server.yaml:
ABT_STACK_OVERFLOW_CHECK=mprotectfrom engineenv_vars; the mprotect-basedULT stack overflow detection is no longer needed and was a source of TLB shootdown
overhead on shared KVM hosts.
utils/node_local_test.py:
skip_substringsworkaround block (revert of DAOS-623 test: add allowed error for FI #17959 / e0fd4e3 andDAOS-623 test: ignore the server errors in client FI tests too #17999): "sluggish ec boundary report from rank", "sluggish stable epoch reporting",
"progress callback was not called for too long", "rpc failed; rc:" are no longer
suppressed — these conditions should not occur on a dedicated VM.
fault_statusdetection: if the initial detection fails, tryfault_statuson
$PATHand then/usr/bin/fault_statusbefore giving up, improving robustness whenthe binary is installed via RPM rather than built in-tree.
src/tests/ftest/cart/util/cart_logtest.py:
self.skip_substrings = []and the associated substring-suppression check block(revert of the
cart_logtest.pyportion of DAOS-623 test: add allowed error for FI #17959 / DAOS-623 test: ignore the server errors in client FI tests too #17999), restoring full logerror detection.
Steps for the author:
After all prior steps are complete: