DAOS-18949 container: Fix sched_seq assert failures#18269
Conversation
|
Ticket title is 'erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - timeout destroying container in tearDown' |
6a51f89 to
a0f84ea
Compare
|
First NLT, then NTL memcheck broke; adding |
|
Test stage Unit Test bdev completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18269/3/display/redirect |
|
Test stage Test RPMs on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18269/4/execution/node/857/log |
The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has
been triggered likely by the following scenario:
cont_child_create_start
cont_child_start (the one near the beginning)
if cont_child->sc_destroy
returned -DER_CONT_NONEXIST
vos_cont_create: -DER_EXIST
assertion failed due to no schduling
This patch changes cont_child_start and ds_cont_child_lookup to return
-DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario
above won't reach the vos_cont_create call.
Features: container
Allow-unstable-test: true
Signed-off-by: Li Wei <[email protected]>
|
Now Fault Injection and Test RPM failures, sigh; rebasing. |
|
Requesting reviews early, since after 4 builds still 0 CI coverage. |
wangshilong
left a comment
There was a problem hiding this comment.
Thanks for fixing this.
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18269/5/execution/node/1384/log |
|
Build 5
|
|
Also triggered erasurecode/multiple_rank_failure on top of current pull request (equivalent to build 5): https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18324/2/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/erasurecode/multiple_rank_failure.py/. All 3 repeats passed without the sched_req assertion failure. |
|
Please see my last two comments on the latest round of testing. Thanks. |
The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has been triggered likely by the following scenario:
This patch changes cont_child_start and ds_cont_child_lookup to return -DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario above won't reach the vos_cont_create call.
Features: container
Steps for the author:
After all prior steps are complete: