Skip to content

DAOS-18889 object: client retry conditional ops for DER_TX_RESTART case#18270

Draft
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-18889
Draft

DAOS-18889 object: client retry conditional ops for DER_TX_RESTART case#18270
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-18889

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

@Nasf-Fan Nasf-Fan commented May 18, 2026

On DTX non-leader, the order of two conditional modifications against the same object shard is uncontrolled. If the one with newer epoch is handled before the older one, then related ilog logic may regard them as potential conflict, then return -DER_TX_RESTART to the caller when handle the old one. Under such case, directly restart related DTX (on DTX leader) with newer epoch may still generate conflict, because hlc epsilon boundary covers relative large range of epoch. Then let's ask client to retry the operation with random delay that will much reduce the possibility of subsequent epoch conflict.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

Ticket title is 'Random timeouts on IOR file creation causing slowdowns'
Status is 'In Progress'
Labels: 'triaged'
https://daosio.atlassian.net/browse/DAOS-18889

On DTX non-leader, the order of two conditional modifications against
the same object shard is uncontrolled. If the one with newer epoch is
handled before the older one, then related ilog logic may regard them
as potential conflict, then return -DER_TX_RESTART to the caller when
handle the old one. Under such case, directly restart related DTX (on
DTX leader) with newer epoch may still generate conflict, because hlc
epsilon boundary covers relative large range of epoch. Then let's ask
client to retry the operation with random delay that will much reduce
the possibility of subsequent epoch conflict.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18889 branch from 07a9da9 to f986ecf Compare May 18, 2026 15:57
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18270/3/execution/node/1252/log

@Nasf-Fan
Copy link
Copy Markdown
Contributor Author

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18270/3/execution/node/1252/log

osa_online_drain failed for DAOS-18218, to be retested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants