Open
Description
What would you like to be added:
An --machine-delete-policy
option that support retaining failed machine. Two strategies are proposed:
Delete
: current behaviorOrphan
: remove failed machine from the machine set (orphan it)
Why is this needed:
I run stateful applications on gardener and using local persistent volume, machine deletion is a critical operation because it also delete all data in the local PV. And I've witnessed that all nodes were get replaces when the load balancer of shoot apiserver was unhealthy and lost all local data.
As a more conservative strategy, the machineset controller could orphan the failed machine from the machineset without actually deleting it. The orphaned machines can then be deleted by human operators with their manual confirmation.