Skip to content

Fix race between resilver wait and offline/detach #17269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 24, 2025

We should not clear scn_state and notify waiters until we call vdev_dtl_reassess(), otherwise following offline/detach request may fail with "no valid replicas" error.

How Has This Been Tested?

I was unable to reproduce it by hands, but #17259 seems uncovered it on some of later online_offline_001_pos test runs: https://github.com/openzfs/zfs/actions/runs/14625988592/job/41037499459 , resulting in this error:

20:05:36.83 SUCCESS: zpool online testpool md0
20:05:36.84 SUCCESS: check_state testpool md0 online
20:05:36.85 SUCCESS: zpool wait -t resilver testpool
20:05:36.86 cannot offline md1: no valid replicas
20:05:36.86 ERROR: zpool offline testpool md1 exited 1

I hope this to fix it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 24, 2025
amotin added 2 commits April 25, 2025 12:08
We should not clear scn_state and notify waiters until we call
vdev_dtl_reassess(), otherwise following offline/detach request
may fail with "no valid replicas".

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
After more CI runs and code reading after openzfs#17259 I've found that
online starts resilver via async mechanism, which does not provide
wait primitives at this time.  Restore some delays to restore CI
until this is properly fixed.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@amotin
Copy link
Member Author

amotin commented Apr 25, 2025

While I still think the fix is valid, it apparently fixes a different problem. After more code reading I've found another race window, caused by the fact that zpool online starts resilver asynchronously and zpool wait can't wait for it. I don't have enough motivation to start refactoring there, so just restore some delays I removed before to cover it for now.

@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 29, 2025
@tonyhutter
Copy link
Contributor

Can we get one more reviewer on this one?

@amotin amotin added Status: Code Review Needed Ready for review and testing and removed Status: Accepted Ready to integrate (reviewed, tested) labels May 2, 2025
@tonyhutter
Copy link
Contributor

Merged as:
f85c96e ZTS: Restore some delays in online_offline tests
f86d9af Fix race between resilver wait and offline/detach

@tonyhutter tonyhutter closed this May 2, 2025
@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants