`ResourceExhausted` is retried indefinitely on snapshot/clone create, doesn't match external-provisioner behavior

**Is your feature request related to a problem?/Why is this needed**

`external-snapshotter` treats `codes.ResourceExhausted` from the CSI driver as a non-final error and keeps retrying. For capacity-driven failures (no pool with enough free space to host the snapshot or clone), the condition is permanent until an operator does something about it, but the side-car keeps hammering the control-plane until either a human intervenes or the pool magically gets more space. The error never surfaces on the VolumeSnapshot object so the user has no terminal signal either.

`external-provisioner` already made this change in [kubernetes-csi/external-provisioner#675](https://github.com/kubernetes-csi/external-provisioner/pull/675) for exactly this reason. `external-snapshotter` was left out, and the two side-cars now disagree on what `ResourceExhausted` means.

We've been hitting this in production on OpenEBS Mayastor. Snapshot create or restore returns `ResourceExhausted` when a pool is full (HTTP 507 from the REST layer, mapped to `codes.ResourceExhausted` in the CSI gRPC response), and the snapshotter retries forever.

**Describe the solution you'd like in detail**

Move `codes.ResourceExhausted` from the non-final list to the final list on the snapshot create path so the side-car surfaces the error to the VolumeSnapshot status instead of retrying indefinitely. Same shape as the external-provisioner fix in [kubernetes-csi/external-provisioner#675](https://github.com/kubernetes-csi/external-provisioner/pull/675), which was the clean "treat ResourceExhausted as final" change in that side-car.

[kubernetes-csi/external-snapshotter#1334](https://github.com/kubernetes-csi/external-snapshotter/pull/1334) (authored by @xing-yang) already did exactly this. It was small (11/4 lines), had the `approved` label, but stayed in WIP and got auto-closed by the triage bot after ~150 days of inactivity.

Can open a new PR with the same change if it helps move things along.

**Describe alternatives you've considered**

- Returning `FailedPrecondition` from the CSI driver instead of `ResourceExhausted` for capacity errors. This works as a workaround (we're doing this in [openebs/mayastor-control-plane#1114](https://github.com/openebs/mayastor-control-plane/pull/1114)) but it's misleading semantically, and we'd rather revert it once the side-car is fixed.
- Letting the retries continue and handling the noise on the control-plane side. Doesn't help the user, who still sees no terminal error on the VolumeSnapshot object.

**Additional context**

- [container-storage-interface/spec#419](https://github.com/container-storage-interface/spec/issues/419): the CSI spec is ambiguous about whether `ResourceExhausted` should be final or retriable. external-provisioner picked "final".
- xing-yang's reasoning on the original PR thread covers the same ground, including a [#sig-storage Slack discussion](https://kubernetes.slack.com/archives/C09QZFCE5/p1688119656274719) from 2023.

cc @xing-yang @jingxu97 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ResourceExhausted` is retried indefinitely on snapshot/clone create, doesn't match external-provisioner behavior #1434

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ResourceExhausted is retried indefinitely on snapshot/clone create, doesn't match external-provisioner behavior #1434

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ResourceExhausted` is retried indefinitely on snapshot/clone create, doesn't match external-provisioner behavior #1434