-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an option to retain failed machine #449
Comments
Overall sounds like a good idea to me.
Also, out of curiosity, if you are not willing to replace the unhealthy machine, you can simply set a large health-timeout flag, and MCM should not touch it. There could be an improvement to set
|
I think we have not faced it but thought about this case. Basically, kubelets can't reach the APIServer, but all the rest of the control-plane components[kcm, mcm] can reach it. Hence, node-objects go stale, KCM starts evicting and, for MCM machines are unhealthy hence will be replaced. @amshuman-kr @ggaurav10 you worked on this case and also implemented the mitigation by disabling the control-plane if the load-balancer is unhealthy, right? |
Out of curiosity: what is important on that local volume data? I ask as pods managing state typically do this on PVCs. I mean your pod can be evicted and respawn on another node anytime, also with no machine failures, the pods also "lose" the local data in such cases. |
Good call! I've considered something similar and it is great to know we already have this.
Yes, I do want to spawn new machines to keep the healthy nodes count match the spec |
Actually we use local PV and all the data of our database goes to the local volume, the Pods do assume the local PV via PVCs, but the PVC topology constraint will no longer be satisfied if the Node of the local PV is down. BTW, I'm going to elaborate this in depth in today's Gardener community meeting. Hope that will make things clear. |
@aylei Gardener deploys the dependency-watchdog probe as part of the seed bootstrap which probes the shoot apiservers via the loadbalancer and scales down the kube-controller-manager to avoid loosing nodes in that scenario. Did that not work in your case? The probe was explicitly introduced to manage the above case you mentioned. |
Minor sidenote on naming convention: I think we should reuse keywords from storageClass, i.e. someting like
This way, we do not create new terminology and can attach to already existing concepts. [btw. maybe then we could have other tools that somehow could still operate on the machine (depends on the infrastructure, I know) and extract data or re-animate the machine and so on.] |
I like the feature as well, something like a "quarantine". If we do that, we should also specify a max number of such machines that shall be "parked"/"ignored". We don't want them to pile up unbound like |
@aylei Do you still need this feature?
So effectively the other pods on such a node would suffer. |
This use case needs to be reviewed in light on #818 |
What would you like to be added:
An
--machine-delete-policy
option that support retaining failed machine. Two strategies are proposed:Delete
: current behaviorOrphan
: remove failed machine from the machine set (orphan it)Why is this needed:
I run stateful applications on gardener and using local persistent volume, machine deletion is a critical operation because it also delete all data in the local PV. And I've witnessed that all nodes were get replaces when the load balancer of shoot apiserver was unhealthy and lost all local data.
As a more conservative strategy, the machineset controller could orphan the failed machine from the machineset without actually deleting it. The orphaned machines can then be deleted by human operators with their manual confirmation.
The text was updated successfully, but these errors were encountered: