Skip to content

DAOS-19036 dtx: handle DTX race issues#18428

Draft
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19036_1
Draft

DAOS-19036 dtx: handle DTX race issues#18428
Nasf-Fan wants to merge 1 commit into
masterfrom
Nasf-Fan/DAOS-19036_1

Conversation

@Nasf-Fan
Copy link
Copy Markdown
Contributor

@Nasf-Fan Nasf-Fan commented Jun 3, 2026

Mainly including the following fixes:

  1. When DTX leader switch, it is possible that the old DTX leader wanted to abort such DTX but not completed before its eviction. And then the new DTX leader may re-execute related modification successfully and try to commit such DTX. If without control, it is possible that those in-flight DTX ABORT RPC from the old DTX leader may abort the DTX that is to be committed by the new DTX leader, then break DTX semantics.

    The patch adds @Version parameter when abort DTX: when new DTX leader handles resent RPC from client, related DTX version will be refreshed if it has been prepared by old DTX leader; anytime when abort DTX locally, the logic will compare the version from ABORT request with related DTX version and skip stale ABORT RPC.

  2. vos_dtx_load_mbs() maybe triggered before related DTX prepared locally. Under such case, related MBS information is empty. We need to handle such case to avoid segmentation fault.

  3. Explicitly cleanup non-prepared DTX after modification failure to avoid leaking stale active DTX (header) in DTX table.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Ticket title is 'Argonne Daos_user : Engine ranks 590, 593, and 596 entered Errored state unexpectedly'
Status is 'In Progress'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19036

@daosbuild3
Copy link
Copy Markdown
Collaborator

Mainly including the following fixes:

1. When DTX leader switch, it is possible that the old DTX leader
   wanted to abort such DTX but not completed before its eviction.
   And then the new DTX leader may re-execute related modification
   successfully and try to commit such DTX. If without control, it
   is possible that those in-flight DTX ABORT RPC from the old DTX
   leader may abort the DTX that is to be committed by the new DTX
   leader, then break DTX semantics.

   The patch adds @Version parameter when abort DTX: when new DTX
   leader handles resent RPC from client, related DTX version will
   be refreshed if it has been prepared by old DTX leader; anytime
   when abort DTX locally, the logic will compare the version from
   ABORT request with related DTX version and skip stale ABORT RPC.

2. vos_dtx_load_mbs() maybe triggered before related DTX prepared
   locally. Under such case, related MBS information is empty. We
   need to handle such case to avoid segmentation fault.

3. Explicitly cleanup non-prepared DTX after modification failure
   to avoid leaking stale active DTX (header) in DTX table.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@daosbuild3
Copy link
Copy Markdown
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants