-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During downscaling the pods don't get killed gracefully #156
Comments
Hi Simon, Thanks for taking time to test this and to give your feedback. It would be extremly useful for us if you could provide deterministic instructions and example test results showing the problem. For instance, can you simulate the problem by scaling the OpenFaaS deployment with Kubernetes should remove any pods in a There is probably something we can try in the watchdog to handle the initial SIGINT message sent by Kubernetes when terminating a Pod - here's a good starting point - https://golang.org/pkg/net/http/#Server.Shutdown Are you willing to work on this further? |
Hi @alexellis , I went ahead an created a simple node script which demonstrates the issue. I also will try to play around with the behaviour when it is triggered using You are completely right regarding the terminating status. It might be that the shutdown period is to short and simply increasing it in the deployment might already fix that issue. However, in the long run, it makes sense to probably handle the SIGINT message. But I guess it will be more than enough to set up the whole thing for linux only first, as windows container are very rare. I might have a look at this however, I'm still fresh to go. And would definitive need someone to review any changes. Best regards, |
I believe this is fixed now and released in the latest watchdog versions if you'd like to re-try your test to validate @Templum ? Thanks for bringing this up and helping us improve OpenFaaS as a whole |
Can you please point me to the docker image version which include your patch? Then I can retry the tests |
Sure, the shutdown is implemented in the 0.7.9 watchdog if you would like to try that? |
@alexellis Just to ensure this not an issue related to my machine could you have a look:
My result is that when I start it, at some point it starts failing and I receive |
What’s in your image? |
@alexellis Nothing which should differ from the original. In fact, I build it from the official openfaas/faas as I saw on dockerhub the last nodeinfo was built around 6 months or so. |
Derek close: resolved/released |
I'm not sure that openfaas/faas#573 actually fixed this issue for functions that run for longer than 30 seconds. @kevin-lindsay-1 raised #853 to flag it for functions that can run for longer than 30 seconds and it's being looked into now. cc @Templum |
Expected Behaviour
I expect the system that after experiences a heavy load it slowly decreases the available pod amount.
Additional during the downscaling no request should be lost due to a killed pod. Simmiliar to the behaviour that can be achieved using
$ kubectl delete pod lambda-xy --grace-period=30
. You can also set that behaviour for a replica set.Current Behaviour
Currently, the pod get's removed without any grace period at all, resulting in some request getting lost.
Possible Solution
Use the available functionality for a replica set to define a gracefull shutdown period should already address this issue. As kubernetes will avoid passing new traffic/requests to dying instances. In the long term, it might be a logical idea to use watchdog. However, I would recommend for the first version only to consider Linux behaviour as windows container are rare.
Steps to Reproduce (for bugs)
I set up a repository with a sample script which should allow reproducing the error. At least for mac it works. See below for the link.
Your Environment
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Operating System and version (e.g. Linux, Windows, MacOS):
Link to your project or a code example to reproduce issue:
Have a look here
The text was updated successfully, but these errors were encountered: