Skip to content

DAOS-18949 container: Fix sched_seq assert failures#18269

Open
liw wants to merge 1 commit into
masterfrom
liw/cont-start-rc
Open

DAOS-18949 container: Fix sched_seq assert failures#18269
liw wants to merge 1 commit into
masterfrom
liw/cont-start-rc

Conversation

@liw
Copy link
Copy Markdown
Contributor

@liw liw commented May 18, 2026

The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has been triggered likely by the following scenario:

cont_child_create_start
  cont_child_start (the one near the beginning)
    if cont_child->sc_destroy
      returned -DER_CONT_NONEXIST
  vos_cont_create: -DER_EXIST
  assertion failed due to no schduling

This patch changes cont_child_start and ds_cont_child_lookup to return -DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario above won't reach the vos_cont_create call.

Features: container

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Ticket title is 'erasurecode/multiple_rank_failure.py:EcodOnlineMultiRankFail.test_ec_multiple_rank_failure - timeout destroying container in tearDown'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18949

@liw liw force-pushed the liw/cont-start-rc branch 2 times, most recently from 6a51f89 to a0f84ea Compare May 18, 2026 23:41
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 18, 2026

First NLT, then NTL memcheck broke; adding Allow-unstable-test: true.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Unit Test bdev completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18269/3/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

The assertion "sched_seq1 != sched_seq2" in cont_child_create_start has
been triggered likely by the following scenario:

  cont_child_create_start
    cont_child_start (the one near the beginning)
      if cont_child->sc_destroy
        returned -DER_CONT_NONEXIST
    vos_cont_create: -DER_EXIST
    assertion failed due to no schduling

This patch changes cont_child_start and ds_cont_child_lookup to return
-DER_CONT_DESTROYING instead of -DER_CONT_NONEXIST, so that the scenario
above won't reach the vos_cont_create call.

Features: container
Allow-unstable-test: true
Signed-off-by: Li Wei <[email protected]>
@liw liw force-pushed the liw/cont-start-rc branch from a0f84ea to 0db65d2 Compare May 20, 2026 00:48
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 20, 2026

Now Fault Injection and Test RPM failures, sigh; rebasing.

@liw liw marked this pull request as ready for review May 20, 2026 01:36
@liw liw requested review from a team as code owners May 20, 2026 01:36
@liw liw requested review from liuxuezhao and wangshilong May 20, 2026 01:36
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 20, 2026

Requesting reviews early, since after 4 builds still 0 CI coverage.

Copy link
Copy Markdown
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18269/5/execution/node/1384/log

@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 22, 2026

Build 5

  • [Features: container] container/boundary: DAOS-18610 (timed out, but without the sched_seq assertion failures)

@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 23, 2026

Also triggered erasurecode/multiple_rank_failure on top of current pull request (equivalent to build 5): https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18324/2/artifact/Functional%20Hardware%20Large%20MD%20on%20SSD/erasurecode/multiple_rank_failure.py/. All 3 repeats passed without the sched_req assertion failure.

@liw liw requested a review from a team May 23, 2026 00:33
@liw liw added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label May 23, 2026
@liw
Copy link
Copy Markdown
Contributor Author

liw commented May 23, 2026

Please see my last two comments on the latest round of testing. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

4 participants