Skip to content

DAOS-18976 rebuild: refine rebuild SCAN process#18289

Open
gnailzenh wants to merge 3 commits into
release/2.6from
liang/b2_6_rebuild_hang
Open

DAOS-18976 rebuild: refine rebuild SCAN process#18289
gnailzenh wants to merge 3 commits into
release/2.6from
liang/b2_6_rebuild_hang

Conversation

@gnailzenh
Copy link
Copy Markdown
Collaborator

@gnailzenh gnailzenh commented May 19, 2026

1. in rebuild_obj_scan_cb()
   Check rt_finishing to allow rebuild_scan_leader to exit quickly when
   rebuild_tgt_fini() is waiting for the refcount to drop. Without this,
   a stale scan_leader continues scanning all VOS objects indefinitely,
   blocking TLS cleanup and causing retries to fail with -DER_BUSY.
2. in rebuild_tgt_scan_handler()
   fix a race window between rebuild_tgt_fini() ->
   rebuild_pool_tls_destroy() and rebuild_pool_tls_lookup().

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Co-authored-by: Xuezhao Liu <xuezhao.liu@hpe.com>

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@gnailzenh gnailzenh requested review from a team as code owners May 19, 2026 18:30
@github-actions
Copy link
Copy Markdown

Ticket title is 'Aurora rebuild failing with DER_HG / DER_SHUTDOWN'
Status is 'In Progress'
Labels: 'test_2.6.5rc1'
https://daosio.atlassian.net/browse/DAOS-18976

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Comment thread src/object/srv_obj_migrate.c Outdated
@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect

1 similar comment
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/2/display/redirect

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

@gnailzenh gnailzenh force-pushed the liang/b2_6_rebuild_hang branch from 2b2479c to 0e86c5d Compare May 21, 2026 07:57
1. in rebuild_obj_scan_cb()
   Check rt_finishing to allow rebuild_scan_leader to exit quickly when
   rebuild_tgt_fini() is waiting for the refcount to drop. Without this,
   a stale scan_leader continues scanning all VOS objects indefinitely,
   blocking TLS cleanup and causing retries to fail with -DER_BUSY.
2. in rebuild_tgt_scan_handler()
   fix a race window between rebuild_tgt_fini() ->
   rebuild_pool_tls_destroy() and rebuild_pool_tls_lookup().

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Co-authored-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao force-pushed the liang/b2_6_rebuild_hang branch from 0e86c5d to eae35f3 Compare May 22, 2026 07:56
@liuxuezhao liuxuezhao changed the title DAOS-18976 rebuild: avoid to triggerFAIL_RECLAIM too early DAOS-18976 rebuild: refine rebuild SCAN process May 22, 2026
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-18289/4/display/redirect

@liuxuezhao liuxuezhao requested a review from wangshilong May 22, 2026 07:58
liuxuezhao
liuxuezhao previously approved these changes May 22, 2026
wangshilong
wangshilong previously approved these changes May 22, 2026
Co-authored-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao dismissed stale reviews from wangshilong and themself via ae8f65c May 22, 2026 08:23
liuxuezhao
liuxuezhao previously approved these changes May 22, 2026
@liuxuezhao liuxuezhao requested a review from wangshilong May 22, 2026 08:25
wangshilong
wangshilong previously approved these changes May 22, 2026
Co-authored-by: Xuezhao Liu <xuezhao.liu@hpe.com>
@liuxuezhao liuxuezhao dismissed stale reviews from wangshilong and themself via 8b95c22 May 22, 2026 09:56
@liuxuezhao liuxuezhao requested a review from wangshilong May 22, 2026 09:58
@daosbuild3
Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants