Skip to content

DAOS-18618 pool: fix pool destroy hang during stop#18284

Open
wangshilong wants to merge 1 commit into
masterfrom
shilongw/DAOS-18618-destroy-pool
Open

DAOS-18618 pool: fix pool destroy hang during stop#18284
wangshilong wants to merge 1 commit into
masterfrom
shilongw/DAOS-18618-destroy-pool

Conversation

@wangshilong
Copy link
Copy Markdown
Contributor

Pool destroy can hang with an active scrubber because two scrubber cleanup bugs combine during teardown:

  1. cont_iter_is_loaded_cb() can exit without calling sc_cont_teardown(), leaving sc_scrubbing set and blocking destroy on sc_scrub_cond.
  2. sc_ensure_containers_are_loaded() can loop forever on DER_CONT_NONEXIST after the container has already been removed.

Fix the teardown path so scrubber state is always released, skip removed containers, and stop scrubber work promptly when the pool is stopping.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Pool destroy can hang with an active scrubber because two scrubber
cleanup bugs combine during teardown:

1. cont_iter_is_loaded_cb() can exit without calling
   sc_cont_teardown(), leaving sc_scrubbing set and blocking destroy
   on sc_scrub_cond.
2. sc_ensure_containers_are_loaded() can loop forever on
   DER_CONT_NONEXIST after the container has already been removed.

Fix the teardown path so scrubber state is always released, skip
removed containers, and stop scrubber work promptly when the pool is
stopping.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong marked this pull request as ready for review May 19, 2026 08:04
@wangshilong wangshilong requested review from a team as code owners May 19, 2026 08:04
@wangshilong wangshilong requested review from NiuYawei and liw May 19, 2026 08:05
@github-actions
Copy link
Copy Markdown

Ticket title is 'daos_test/suite.py:DaosCoreTest.test_daos_rebuild_simple - test timeout w/ DER_CSUM err'
Status is 'Reopened'
Labels: '2.8.0tb5,ci_2.6_daily,ci_2.8_daily,ci_master_daily,pr_test,scrubbed_2.8,test_2.8'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-18618

@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

3 participants