Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to retain failed machine #449

Open
aylei opened this issue Apr 21, 2020 · 10 comments
Open

Provide an option to retain failed machine #449

aylei opened this issue Apr 21, 2020 · 10 comments
Labels
area/usability Usability related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/5 Priority (lower number equals higher priority)

Comments

@aylei
Copy link

aylei commented Apr 21, 2020

What would you like to be added:
An --machine-delete-policy option that support retaining failed machine. Two strategies are proposed:

  • Delete: current behavior
  • Orphan: remove failed machine from the machine set (orphan it)

Why is this needed:
I run stateful applications on gardener and using local persistent volume, machine deletion is a critical operation because it also delete all data in the local PV. And I've witnessed that all nodes were get replaces when the load balancer of shoot apiserver was unhealthy and lost all local data.
As a more conservative strategy, the machineset controller could orphan the failed machine from the machineset without actually deleting it. The orphaned machines can then be deleted by human operators with their manual confirmation.

@aylei aylei added the kind/enhancement Enhancement, improvement, extension label Apr 21, 2020
@hardikdr
Copy link
Member

Overall sounds like a good idea to me.
The only part, we might want to consider is, if there are constant failures, such machines could stack up, and lead to unnecessary costs if not seen by the user. We have a freeze mechanism in place, where if number of machine-objects goes beyond a certain number MCM stops scaling-up.

  • You might want to complement the above-suggested feature with freeze, at some point we stop creating the new machines.

Also, out of curiosity, if you are not willing to replace the unhealthy machine, you can simply set a large health-timeout flag, and MCM should not touch it. There could be an improvement to set -1 to signify infinite timeout.

  • Or is it that, you do want to spawn new machines in the place of unhealthy ones, but also willing to keep the old ones?

@hardikdr
Copy link
Member

And I've witnessed that all nodes were get replaces when the load balancer of shoot apiserver was unhealthy and lost all local data.

I think we have not faced it but thought about this case. Basically, kubelets can't reach the APIServer, but all the rest of the control-plane components[kcm, mcm] can reach it. Hence, node-objects go stale, KCM starts evicting and, for MCM machines are unhealthy hence will be replaced.

@amshuman-kr @ggaurav10 you worked on this case and also implemented the mitigation by disabling the control-plane if the load-balancer is unhealthy, right?
Can you please elaborate a little around, how can this be avoided?

@dguendisch
Copy link
Member

dguendisch commented Apr 23, 2020

Out of curiosity: what is important on that local volume data? I ask as pods managing state typically do this on PVCs. I mean your pod can be evicted and respawn on another node anytime, also with no machine failures, the pods also "lose" the local data in such cases.

@aylei
Copy link
Author

aylei commented Apr 24, 2020

  • You might want to complement the above-suggested feature with freeze, at some point we stop creating the new machines.

Good call! I've considered something similar and it is great to know we already have this.

  • Or is it that, you do want to spawn new machines in the place of unhealthy ones, but also willing to keep the old ones?

Yes, I do want to spawn new machines to keep the healthy nodes count match the spec

@aylei
Copy link
Author

aylei commented Apr 24, 2020

Out of curiosity: what is important on that local volume data? I ask as pods managing state typically do this on PVCs. I mean your pod can be evicted and respawn on another node anytime, also with no machine failures, the pods also "lose" the local data in such cases.

Actually we use local PV and all the data of our database goes to the local volume, the Pods do assume the local PV via PVCs, but the PVC topology constraint will no longer be satisfied if the Node of the local PV is down.

BTW, I'm going to elaborate this in depth in today's Gardener community meeting. Hope that will make things clear.

@amshuman-kr
Copy link

And I've witnessed that all nodes were get replaces when the load balancer of shoot apiserver was unhealthy and lost all local data.

@aylei Gardener deploys the dependency-watchdog probe as part of the seed bootstrap which probes the shoot apiservers via the loadbalancer and scales down the kube-controller-manager to avoid loosing nodes in that scenario. Did that not work in your case? The probe was explicitly introduced to manage the above case you mentioned.

@vasu1124
Copy link
Member

Minor sidenote on naming convention: I think we should reuse keywords from storageClass, i.e. someting like --machine-reclaimPolicy:

  • Delete: default behavior.
  • Retain: remove from working pool, but retain the machine.

This way, we do not create new terminology and can attach to already existing concepts.

[btw. maybe then we could have other tools that somehow could still operate on the machine (depends on the infrastructure, I know) and extract data or re-animate the machine and so on.]

@vlerenc
Copy link
Member

vlerenc commented May 26, 2020

I like the feature as well, something like a "quarantine". If we do that, we should also specify a max number of such machines that shall be "parked"/"ignored". We don't want them to pile up unbound like retained volumes. It's a debugging feature, so 1 or 2 is what I would people to normally desire.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jul 26, 2020
@prashanth26 prashanth26 added area/usability Usability related size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020
@gardener-robot gardener-robot added effort/2w Effort for issue is around 2 weeks and removed size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
@prashanth26 prashanth26 added the priority/5 Priority (lower number equals higher priority) label Jul 21, 2021
@himanshu-kun
Copy link
Contributor

@aylei Do you still need this feature?
Have you considered these side-effects:

  • this node is not registered to the load-balancer anymore, so no internal traffic can reach the other pods on this node
  • any other pods on these orphaned machine would not be evicted as we won't drain this node
  • cluster autoscaler would still consider this node to be a part of the node group , and see it as a LongNotReady node and will try to remove it , and mcm will reject it, and this would be a loop.

So effectively the other pods on such a node would suffer.

@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) effort/2w Effort for issue is around 2 weeks labels Feb 21, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 31, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 9, 2024
@aaronfern
Copy link
Contributor

This use case needs to be reviewed in light on #818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/usability Usability related kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/5 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

10 participants