During downscaling the pods don't get killed gracefully #156

Templum · 2018-03-09T15:47:18Z

Expected Behaviour

I expect the system that after experiences a heavy load it slowly decreases the available pod amount.
Additional during the downscaling no request should be lost due to a killed pod. Simmiliar to the behaviour that can be achieved using $ kubectl delete pod lambda-xy --grace-period=30. You can also set that behaviour for a replica set.

Current Behaviour

Currently, the pod get's removed without any grace period at all, resulting in some request getting lost.

Possible Solution

Use the available functionality for a replica set to define a gracefull shutdown period should already address this issue. As kubernetes will avoid passing new traffic/requests to dying instances. In the long term, it might be a logical idea to use watchdog. However, I would recommend for the first version only to consider Linux behaviour as windows container are rare.

Steps to Reproduce (for bugs)

I set up a repository with a sample script which should allow reproducing the error. At least for mac it works. See below for the link.

Your Environment

Kubernetes Version: v1.8.8
Minikube v0.25.0

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

Kubernetes

Operating System and version (e.g. Linux, Windows, MacOS):

MacOS

Link to your project or a code example to reproduce issue:

Have a look here

The text was updated successfully, but these errors were encountered:

alexellis · 2018-03-09T22:49:05Z

Hi Simon,

Thanks for taking time to test this and to give your feedback. It would be extremly useful for us if you could provide deterministic instructions and example test results showing the problem. For instance, can you simulate the problem by scaling the OpenFaaS deployment with kubectl and therefore isolating some of the "moving parts".

Kubernetes should remove any pods in a Terminating status from the Service' load-balancing pool, that should prevent any requests being sent to a pod in a Terminating status.

There is probably something we can try in the watchdog to handle the initial SIGINT message sent by Kubernetes when terminating a Pod - here's a good starting point - https://golang.org/pkg/net/http/#Server.Shutdown

Are you willing to work on this further?

Templum · 2018-03-12T14:45:38Z

Hi @alexellis ,

I went ahead an created a simple node script which demonstrates the issue. I also will try to play around with the behaviour when it is triggered using $ kubectl.

You are completely right regarding the terminating status. It might be that the shutdown period is to short and simply increasing it in the deployment might already fix that issue. However, in the long run, it makes sense to probably handle the SIGINT message. But I guess it will be more than enough to set up the whole thing for linux only first, as windows container are very rare.

I might have a look at this however, I'm still fresh to go. And would definitive need someone to review any changes.

Best regards,
Simon

alexellis · 2018-04-06T13:10:49Z

I believe this is fixed now and released in the latest watchdog versions if you'd like to re-try your test to validate @Templum ? Thanks for bringing this up and helping us improve OpenFaaS as a whole

Templum · 2018-04-06T13:34:16Z

Can you please point me to the docker image version which include your patch? Then I can retry the tests

alexellis · 2018-04-14T01:50:59Z

Sure, the shutdown is implemented in the 0.7.9 watchdog if you would like to try that?

Templum · 2018-04-19T07:06:35Z

@alexellis Just to ensure this not an issue related to my machine could you have a look:

Deploy OpenFaaS on minikube
Deploy a function nodeinfo from image templum/nodeinfo:latest
Clone the test Suite: https://github.com/Templum/OpenFaaS-Gracefully-Shutdown-Test.git
- $ cd OpenFaaS-Gracefully-Shutdown-Test
- $ yarn install
- $ node open_faas_test.js

Note: I have builded a new version of nodeinfo (as the last was builded months ago) to ensure it contains your changes.

My result is that when I start it, at some point it starts failing and I receive Bad Gateways. And this happends even before the Auto-Scaling is kicking in

alexellis · 2018-04-20T12:45:05Z

What’s in your image?

Templum · 2018-04-20T14:23:51Z

@alexellis Nothing which should differ from the original. In fact, I build it from the official openfaas/faas as I saw on dockerhub the last nodeinfo was built around 6 months or so.

alexellis · 2018-10-28T13:34:50Z

Derek close: resolved/released

alexellis · 2021-11-03T09:35:18Z

I'm not sure that openfaas/faas#573 actually fixed this issue for functions that run for longer than 30 seconds.

@kevin-lindsay-1 raised #853 to flag it for functions that can run for longer than 30 seconds and it's being looked into now.

cc @Templum

alexellis mentioned this issue Mar 10, 2018

Architecture: watchdog - graceful termination openfaas/faas#573

Closed

Templum mentioned this issue Mar 19, 2018

Initial implementation for a graceful shutdown of the watchdog. openfaas/faas#587

Closed

11 tasks

derek bot closed this as completed Oct 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During downscaling the pods don't get killed gracefully #156

During downscaling the pods don't get killed gracefully #156

Templum commented Mar 9, 2018 •

edited

Loading

alexellis commented Mar 9, 2018

Templum commented Mar 12, 2018

alexellis commented Apr 6, 2018

Templum commented Apr 6, 2018

alexellis commented Apr 14, 2018

Templum commented Apr 19, 2018 •

edited

Loading

alexellis commented Apr 20, 2018

Templum commented Apr 20, 2018

alexellis commented Oct 28, 2018

alexellis commented Nov 3, 2021

During downscaling the pods don't get killed gracefully #156

During downscaling the pods don't get killed gracefully #156

Comments

Templum commented Mar 9, 2018 • edited Loading

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Are you using Docker Swarm or Kubernetes (FaaS-netes)?

Operating System and version (e.g. Linux, Windows, MacOS):

Link to your project or a code example to reproduce issue:

alexellis commented Mar 9, 2018

Templum commented Mar 12, 2018

alexellis commented Apr 6, 2018

Templum commented Apr 6, 2018

alexellis commented Apr 14, 2018

Templum commented Apr 19, 2018 • edited Loading

alexellis commented Apr 20, 2018

Templum commented Apr 20, 2018

alexellis commented Oct 28, 2018

alexellis commented Nov 3, 2021

Templum commented Mar 9, 2018 •

edited

Loading

Templum commented Apr 19, 2018 •

edited

Loading