Workers go into restarting/crash cycle (WORKER TIMEOUT / signal 6) #339
-
I am struggling to know which layer is the root cause here. My app runs fine, but then suddenly it is unable to serve requests for a while and then "fixes itself". While it's unable to serve requests my logs show:
Initially, I thought it was related to load and resource limits, but it seems to also happen during "typical load" and when resources are nowhere near their limits. |
Beta Was this translation helpful? Give feedback.
Replies: 27 comments
-
BTW I saw this ticket here #47 but I think its not the same issue. |
Beta Was this translation helpful? Give feedback.
-
Facing the same issue. Whenever following code is executed with incorrect smtp_url, port, my worker crashes:
There is no crash if smtp_url or port is valid.
|
Beta Was this translation helpful? Give feedback.
-
I'm also facing the same issue. any workarounds? |
Beta Was this translation helpful? Give feedback.
-
i resolved this issue by adding worker timeout while initiating my gunicorn application.
|
Beta Was this translation helpful? Give feedback.
-
Facing the same issue when running long processes on websockets and it ends up terminating the websocket connection. Any fixes? |
Beta Was this translation helpful? Give feedback.
-
Facing the same issue when I use the haystack. I modifed the docker-compose.yml as following: |
Beta Was this translation helpful? Give feedback.
-
Facing this issue while using docker. Working perfectly fine if run directly with None of the following suggested solutions worked:
Can someone please point me in the right direction to resolve this issue? |
Beta Was this translation helpful? Give feedback.
-
Facing this same issue (on both CentOS and Ubuntu VM's), it happens during typical load and all resources are not near limits. |
Beta Was this translation helpful? Give feedback.
-
Same here, anyone can suggest a good alternative? |
Beta Was this translation helpful? Give feedback.
-
In my case it seemed to happen due to request timeout to external service. |
Beta Was this translation helpful? Give feedback.
-
Any solution for this ? Facing the same issue when calling an endpoint which takes 1-2 min to execute. |
Beta Was this translation helpful? Give feedback.
-
@atTheShikhar for me - I switched to combination Flask + uWSGI. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately i cannot change the SGI and framework, since most of the work are already done in my case. Just needed this one endpoint to work. |
Beta Was this translation helpful? Give feedback.
-
Can you show your launching command? what parameters do you use? I assume gunicorn should run smoothly when using 1 worker on a single process. |
Beta Was this translation helpful? Give feedback.
-
I am using docker so the final running command is this (the timeout part was added after reading above discussion) Btw, this runs just fine locally. Problem only happens after i deploy it on GCR. |
Beta Was this translation helpful? Give feedback.
-
hmm, for me it struggles both on EC2 machine and on Fargate... but so did you try running 1 worker? just to make work, |
Beta Was this translation helpful? Give feedback.
-
yup i tried with 1 worker, still no luck. |
Beta Was this translation helpful? Give feedback.
-
anyone have any update? |
Beta Was this translation helpful? Give feedback.
-
I've had the same issue. BUT, it only comes up when I started using max_requests setting... |
Beta Was this translation helpful? Give feedback.
-
Adding some info if it helps. Gunicorn config below is run via
Server RAM: 1.9GB Thanks |
Beta Was this translation helpful? Give feedback.
-
Managed to resolve this issue, sharing in case this helps. Our issue originated from making external API calls from within an async endpoint. These API calls did not support async, which introduced blocking calls to the event loop, resulting in the uvicorn worker timing out. Our reliance on FastAPI Cache decorators for these async endpoints prevented us from simply redefining these endpoints as sync ( To resolve, we made use of the from fastapi.concurrency import run_in_threadpool
@api.get('/handler')
async def handler():
...
# Slow async function
await my_async_function()
....
# Slow running sync function
await run_in_threadpool(sync_function) We released this update over 2 weeks ago and haven't seen any worker timeouts. Hopefully this helps 🙂 |
Beta Was this translation helpful? Give feedback.
-
@nicholasmccrea wow, should have been hard to identify it! Great news! |
Beta Was this translation helpful? Give feedback.
-
@nicholarmccrea your solution seems like a good solution, but i imagined removing the "async" from the endpoint functions allowed the requests to be handled in a different threadpool. Do you need to explicitly run it in a seperate function? I am taking the knowledge from the comment on this post : |
Beta Was this translation helpful? Give feedback.
-
@mcazim98 in our case we were not able to redefine our endpoints as sync due to our reliance on FastAPI cache decorators. The FastAPI cache version we were using was Thankfully, FastAPI cache now support sync functions as of version |
Beta Was this translation helpful? Give feedback.
-
Yes I was having similar issues where endpoint was requesting a response from an ML model running within the saem fastapi application. Redefining the endpoints to sync versions fixed the issue. |
Beta Was this translation helpful? Give feedback.
-
I encountered a similar issue, running FastAPI backend, where endpoint seemed to randomly ignore my requests sometimes, but sometimes 'fixes itself' and works. After spending 2 weeks on this, I realized this had to do with ports. My fix was to:
|
Beta Was this translation helpful? Give feedback.
-
I experienced a similar issue with the following (slightly different) symptoms:
I was able to resolve the issue by updating my connection pool to utilize asynchronous connections. Here's the difference between the two connection pools:
In use, it looks something like this:
This change resolved my issue. |
Beta Was this translation helpful? Give feedback.
Managed to resolve this issue, sharing in case this helps.
Our issue originated from making external API calls from within an async endpoint. These API calls did not support async, which introduced blocking calls to the event loop, resulting in the uvicorn worker timing out. Our reliance on FastAPI Cache decorators for these async endpoints prevented us from simply redefining these endpoints as sync (
async def
->def
).To resolve, we made use of the
run_in_threadpool()
utility function to ensure these sync calls are run in a separate threadpool, outside the event loop. Alongside this, we updated our gunicorn config so theworkers
andthreads
count was equal - setting these to 4.