Skip to content

DAOS-19173 pool: destroy timeout tiering and serialization#18518

Draft
kccain wants to merge 1 commit into
masterfrom
kccain/daos_19173
Draft

DAOS-19173 pool: destroy timeout tiering and serialization#18518
kccain wants to merge 1 commit into
masterfrom
kccain/daos_19173

Conversation

@kccain

@kccain kccain commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

For MD on SSD configurations, engine local SCM capacity based corpc timeout configuration:

  • Add pool_destroy_local_scm_size() get engine-local SCM size
  • Add pool_destroy_rpc_timeout() to customize destroy CoRPC timeout for ds_mgmt_tgt_pool_destroy_ranks()
  • non-MD-on-SSD case falls back to default timeout

And serialize pool target destroy handling for when a previous handler invocation (whose CoRPC initiator timed out) is still busy performing expensive subtree destruction / file unlinking:

  • Refactor ds_pooltgts with dual synchronization domains: Create: dpt_create_mutex/cv + dpt_creates_ht (existing create-cancel) Destroy: dpt_destroy_mutex/cv + dpt_destroys_ht (new serialization)
  • Add ds_pooltgts_destroy_rec for destroy-in-flight tracking hash table
  • Add to ds_mgmt_hdlr_tgt_destroy() the destroy serialization

Features: pool

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

For MD on SSD configurations, engine local SCM capacity
based corpc timeout configuration:
- Add pool_destroy_local_scm_size() get engine-local SCM size
- Add pool_destroy_rpc_timeout() to customize destroy CoRPC timeout
  for ds_mgmt_tgt_pool_destroy_ranks()
- non-MD-on-SSD case falls back to default timeout

And serialize pool target destroy handling for when a previous
handler invocation (whose CoRPC initiator timed out) is still
busy performing expensive subtree destruction / file unlinking:
- Refactor ds_pooltgts with dual synchronization domains:
  Create: dpt_create_mutex/cv + dpt_creates_ht (existing create-cancel)
  Destroy: dpt_destroy_mutex/cv + dpt_destroys_ht (new serialization)
- Add ds_pooltgts_destroy_rec for destroy-in-flight tracking hash table
- Add to ds_mgmt_hdlr_tgt_destroy() the destroy serialization

Features: pool

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@github-actions

Copy link
Copy Markdown

Ticket title is 'dmg pool destroy very slow'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-19173

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18518/1/execution/node/1319/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants