Skip to content

Feature Request: Emergency "Best-Effort" Mode for failing vdevs during resilver #18522

@cukilabanza-oss

Description

@cukilabanza-oss

Proposal: "Read-Only Parity Assist" / Forensic Override Mode for Faulted Devices

Overview:
Currently, when a drive exhibits a high rate of checksum errors or IO failures during resilvering, ZFS faults the drive and kicks it from the pool. While this protects against further corruption, it can lead to total pool loss if a second drive in a RAID-Z1 (or third in RAID-Z2) fails during that same window.

The Problem:
ZFS’s "integrity first" philosophy becomes a liability when it discards a drive that is 99% readable. In a near-disaster scenario, that 99% of readable data could be the difference between a successful resilver and a total loss of the pool.

Why Existing Mitigations (Checkpoints/Spares) Are Not Enough:
Developers often point to zpool checkpoint or spares as solutions, but these are pre-planning tools. This proposal addresses the unplanned emergency:
• Checkpoints are "Rewind" buttons: If a drive fails during a resilver after days of new data has been written, rolling back to a checkpoint results in massive data loss.
• Spares require capacity: Many users (especially in TrueNAS/Small-Biz environments) operate with all drive bays full. You cannot "pre-plan" a spare when you have no physical slots left.

Proposed Feature:
Introduce an optional "Quarantine" or "Best-Effort" read-only state for drives that would otherwise be faulted.
• Non-Destructive Participation: The drive is kept in the pool but marked as "Suspect."
• Parity Aid: During a resilver of another drive, ZFS should be allowed to attempt to read from the "Suspect" drive if it is the only remaining way to reconstruct a block.
• Strict Validation: Any data pulled from a Suspect drive must be strictly validated against existing checksums. If the checksum fails, ZFS continues as it does now (declaring a read error), but it doesn't discard the drive's other healthy blocks.
• Admin Override: This mode would be manually toggled by an administrator who accepts the risk of keeping a failing drive online to prioritize data availability over absolute drive health standards.

Use Case:
In a zero-redundancy state, if a remaining drive throws intermittent errors during a resilver, ZFS often faults and kicks that drive to "protect" integrity. Because no other parity exists, this eviction immediately kills the resilver and renders the entire pool unrecoverable—even if 99% of the faulted drive was still readable. The Proposal's Impact means, instead of a total collapse, the "Best-Effort" mode keeps the suspect drive online. While the specific bad sectors may result in localized Permanent Data Errors (corrupted files), the healthy 99% of the drive is used to complete the resilver. This prevents a survivable hardware hiccup from escalating into a Total Pool Loss.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: FeatureFeature request or new feature

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions