Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During downscaling the pods don't get killed gracefully #156

Closed
Templum opened this issue Mar 9, 2018 · 10 comments
Closed

During downscaling the pods don't get killed gracefully #156

Templum opened this issue Mar 9, 2018 · 10 comments

Comments

@Templum
Copy link

Templum commented Mar 9, 2018

Expected Behaviour

I expect the system that after experiences a heavy load it slowly decreases the available pod amount.
Additional during the downscaling no request should be lost due to a killed pod. Simmiliar to the behaviour that can be achieved using $ kubectl delete pod lambda-xy --grace-period=30. You can also set that behaviour for a replica set.

Current Behaviour

Currently, the pod get's removed without any grace period at all, resulting in some request getting lost.

Possible Solution

Use the available functionality for a replica set to define a gracefull shutdown period should already address this issue. As kubernetes will avoid passing new traffic/requests to dying instances. In the long term, it might be a logical idea to use watchdog. However, I would recommend for the first version only to consider Linux behaviour as windows container are rare.

Steps to Reproduce (for bugs)

I set up a repository with a sample script which should allow reproducing the error. At least for mac it works. See below for the link.

Your Environment

  • Kubernetes Version: v1.8.8
  • Minikube v0.25.0

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

  • Kubernetes

Operating System and version (e.g. Linux, Windows, MacOS):

  • MacOS

Link to your project or a code example to reproduce issue:

Have a look here

@alexellis
Copy link
Member

Hi Simon,

Thanks for taking time to test this and to give your feedback. It would be extremly useful for us if you could provide deterministic instructions and example test results showing the problem. For instance, can you simulate the problem by scaling the OpenFaaS deployment with kubectl and therefore isolating some of the "moving parts".

Kubernetes should remove any pods in a Terminating status from the Service' load-balancing pool, that should prevent any requests being sent to a pod in a Terminating status.

There is probably something we can try in the watchdog to handle the initial SIGINT message sent by Kubernetes when terminating a Pod - here's a good starting point - https://golang.org/pkg/net/http/#Server.Shutdown

Are you willing to work on this further?

@Templum
Copy link
Author

Templum commented Mar 12, 2018

Hi @alexellis ,

I went ahead an created a simple node script which demonstrates the issue. I also will try to play around with the behaviour when it is triggered using $ kubectl.

You are completely right regarding the terminating status. It might be that the shutdown period is to short and simply increasing it in the deployment might already fix that issue. However, in the long run, it makes sense to probably handle the SIGINT message. But I guess it will be more than enough to set up the whole thing for linux only first, as windows container are very rare.

I might have a look at this however, I'm still fresh to go. And would definitive need someone to review any changes.

Best regards,
Simon

@alexellis
Copy link
Member

I believe this is fixed now and released in the latest watchdog versions if you'd like to re-try your test to validate @Templum ? Thanks for bringing this up and helping us improve OpenFaaS as a whole

@Templum
Copy link
Author

Templum commented Apr 6, 2018

Can you please point me to the docker image version which include your patch? Then I can retry the tests

@alexellis
Copy link
Member

Sure, the shutdown is implemented in the 0.7.9 watchdog if you would like to try that?

@Templum
Copy link
Author

Templum commented Apr 19, 2018

@alexellis Just to ensure this not an issue related to my machine could you have a look:

  1. Deploy OpenFaaS on minikube
  2. Deploy a function nodeinfo from image templum/nodeinfo:latest
  3. Clone the test Suite: https://github.com/Templum/OpenFaaS-Gracefully-Shutdown-Test.git
    • $ cd OpenFaaS-Gracefully-Shutdown-Test
    • $ yarn install
    • $ node open_faas_test.js

Note: I have builded a new version of nodeinfo (as the last was builded months ago) to ensure it contains your changes.

My result is that when I start it, at some point it starts failing and I receive Bad Gateways. And this happends even before the Auto-Scaling is kicking in

@alexellis
Copy link
Member

What’s in your image?

@Templum
Copy link
Author

Templum commented Apr 20, 2018

@alexellis Nothing which should differ from the original. In fact, I build it from the official openfaas/faas as I saw on dockerhub the last nodeinfo was built around 6 months or so.

@alexellis
Copy link
Member

Derek close: resolved/released

@derek derek bot closed this as completed Oct 28, 2018
@alexellis
Copy link
Member

I'm not sure that openfaas/faas#573 actually fixed this issue for functions that run for longer than 30 seconds.

@kevin-lindsay-1 raised #853 to flag it for functions that can run for longer than 30 seconds and it's being looked into now.

cc @Templum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants