Is your feature request related to a problem?/Why is this needed
external-snapshotter treats codes.ResourceExhausted from the CSI driver as a non-final error and keeps retrying. For capacity-driven failures (no pool with enough free space to host the snapshot or clone), the condition is permanent until an operator does something about it, but the side-car keeps hammering the control-plane until either a human intervenes or the pool magically gets more space. The error never surfaces on the VolumeSnapshot object so the user has no terminal signal either.
external-provisioner already made this change in kubernetes-csi/external-provisioner#675 for exactly this reason. external-snapshotter was left out, and the two side-cars now disagree on what ResourceExhausted means.
We've been hitting this in production on OpenEBS Mayastor. Snapshot create or restore returns ResourceExhausted when a pool is full (HTTP 507 from the REST layer, mapped to codes.ResourceExhausted in the CSI gRPC response), and the snapshotter retries forever.
Describe the solution you'd like in detail
Move codes.ResourceExhausted from the non-final list to the final list on the snapshot create path so the side-car surfaces the error to the VolumeSnapshot status instead of retrying indefinitely. Same shape as the external-provisioner fix in kubernetes-csi/external-provisioner#675, which was the clean "treat ResourceExhausted as final" change in that side-car.
kubernetes-csi/external-snapshotter#1334 (authored by @xing-yang) already did exactly this. It was small (11/4 lines), had the approved label, but stayed in WIP and got auto-closed by the triage bot after ~150 days of inactivity.
Can open a new PR with the same change if it helps move things along.
Describe alternatives you've considered
- Returning
FailedPrecondition from the CSI driver instead of ResourceExhausted for capacity errors. This works as a workaround (we're doing this in openebs/mayastor-control-plane#1114) but it's misleading semantically, and we'd rather revert it once the side-car is fixed.
- Letting the retries continue and handling the noise on the control-plane side. Doesn't help the user, who still sees no terminal error on the VolumeSnapshot object.
Additional context
cc @xing-yang @jingxu97
Is your feature request related to a problem?/Why is this needed
external-snapshottertreatscodes.ResourceExhaustedfrom the CSI driver as a non-final error and keeps retrying. For capacity-driven failures (no pool with enough free space to host the snapshot or clone), the condition is permanent until an operator does something about it, but the side-car keeps hammering the control-plane until either a human intervenes or the pool magically gets more space. The error never surfaces on the VolumeSnapshot object so the user has no terminal signal either.external-provisioneralready made this change in kubernetes-csi/external-provisioner#675 for exactly this reason.external-snapshotterwas left out, and the two side-cars now disagree on whatResourceExhaustedmeans.We've been hitting this in production on OpenEBS Mayastor. Snapshot create or restore returns
ResourceExhaustedwhen a pool is full (HTTP 507 from the REST layer, mapped tocodes.ResourceExhaustedin the CSI gRPC response), and the snapshotter retries forever.Describe the solution you'd like in detail
Move
codes.ResourceExhaustedfrom the non-final list to the final list on the snapshot create path so the side-car surfaces the error to the VolumeSnapshot status instead of retrying indefinitely. Same shape as the external-provisioner fix in kubernetes-csi/external-provisioner#675, which was the clean "treat ResourceExhausted as final" change in that side-car.kubernetes-csi/external-snapshotter#1334 (authored by @xing-yang) already did exactly this. It was small (11/4 lines), had the
approvedlabel, but stayed in WIP and got auto-closed by the triage bot after ~150 days of inactivity.Can open a new PR with the same change if it helps move things along.
Describe alternatives you've considered
FailedPreconditionfrom the CSI driver instead ofResourceExhaustedfor capacity errors. This works as a workaround (we're doing this in openebs/mayastor-control-plane#1114) but it's misleading semantically, and we'd rather revert it once the side-car is fixed.Additional context
ResourceExhaustedshould be final or retriable. external-provisioner picked "final".cc @xing-yang @jingxu97