Workers go into restarting/crash cycle (WORKER TIMEOUT / signal 6) #339

lsmith77 · 2022-01-27T15:09:57Z

lsmith77
Jan 27, 2022

I am struggling to know which layer is the root cause here.

My app runs fine, but then suddenly it is unable to serve requests for a while and then "fixes itself". While it's unable to serve requests my logs show:

[2022-01-18 08:36:46 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1548)
[2022-01-18 08:36:46 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1575)
[2022-01-18 08:36:46 +0000] [1505] [WARNING] Worker with pid 1548 was terminated due to signal 6
[2022-01-18 08:36:46 +0000] [1505] [WARNING] Worker with pid 1575 was terminated due to signal 6
[2022-01-18 08:36:46 +0000] [1783] [INFO] Booting worker with pid: 1783
[2022-01-18 08:36:46 +0000] [1782] [INFO] Booting worker with pid: 1782
[2022-01-18 08:36:47 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1577)
[2022-01-18 08:36:47 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1578)
[2022-01-18 08:36:47 +0000] [1505] [WARNING] Worker with pid 1578 was terminated due to signal 6
[2022-01-18 08:36:47 +0000] [1784] [INFO] Booting worker with pid: 1784
[2022-01-18 08:36:47 +0000] [1505] [WARNING] Worker with pid 1577 was terminated due to signal 6
[2022-01-18 08:36:47 +0000] [1785] [INFO] Booting worker with pid: 1785
[2022-01-18 08:36:51 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1545)
[2022-01-18 08:36:51 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1551)
[2022-01-18 08:36:51 +0000] [1505] [CRITICAL] WORKER TIMEOUT (pid:1559)
[2022-01-18 08:36:52 +0000] [1505] [WARNING] Worker with pid 1551 was terminated due to signal 6

Initially, I thought it was related to load and resource limits, but it seems to also happen during "typical load" and when resources are nowhere near their limits.

Answered by nicholasmccrea

Feb 23, 2023

Managed to resolve this issue, sharing in case this helps.

Our issue originated from making external API calls from within an async endpoint. These API calls did not support async, which introduced blocking calls to the event loop, resulting in the uvicorn worker timing out. Our reliance on FastAPI Cache decorators for these async endpoints prevented us from simply redefining these endpoints as sync (async def -> def).

To resolve, we made use of the run_in_threadpool() utility function to ensure these sync calls are run in a separate threadpool, outside the event loop. Alongside this, we updated our gunicorn config so the workers and threads count was equal - setting these to 4.

from fastapi

View full answer

lsmith77 · 2022-01-27T15:12:44Z

lsmith77
Jan 27, 2022
Author

BTW I saw this ticket here #47 but I think its not the same issue.

0 replies

udit-pandey · 2022-02-01T06:42:11Z

udit-pandey
Feb 1, 2022

Facing the same issue. Whenever following code is executed with incorrect smtp_url, port, my worker crashes:

def validate_smtp(smtp_url: str):
     try:
         smtp = SMTP()
         smtp.connect(smtp_url)
         smtp.quit()
         return True
     except:
         return False

There is no crash if smtp_url or port is valid.
Dependencies:

gunicorn: 20.1.0
uvicorn: 0.17.1
fastapi: 0.73.0

0 replies

AnjaneyuluBatta505 · 2022-03-29T05:12:18Z

AnjaneyuluBatta505
Mar 29, 2022

I'm also facing the same issue. any workarounds?

0 replies

udit-pandey · 2022-03-29T06:29:52Z

udit-pandey
Mar 29, 2022

i resolved this issue by adding worker timeout while initiating my gunicorn application.

gunicorn -k uvicorn.workers.UvicornWorker ${APP_MODULE} --bind 0.0.0.0:80 --timeout ${WORKER_TIMEOUT}

0 replies

mudassirzr · 2022-04-21T08:26:14Z

mudassirzr
Apr 21, 2022

Facing the same issue when running long processes on websockets and it ends up terminating the websocket connection. Any fixes?

0 replies

yuanwu2017 · 2022-05-31T04:30:08Z

yuanwu2017
May 31, 2022

Facing the same issue when I use the haystack. I modifed the docker-compose.yml as following:
command: "/bin/bash -c 'sleep 10 && gunicorn rest_api.application:app -b 0.0.0.0 -k uvicorn.workers.UvicornWorker -- workers 1 --timeout 600'"
It can work.

0 replies

ankitksharma · 2022-07-12T10:04:00Z

ankitksharma
Jul 12, 2022

Facing this issue while using docker. Working perfectly fine if run directly with gunicorn -w 1 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8080 main:app.

None of the following suggested solutions worked:

Assigning more memory
Changing worker class to gevent
Changing python version to 3.7 from 3.9
Adding timeout
Running directly with uvicorn without gunicorn

Can someone please point me in the right direction to resolve this issue?

0 replies

komljenovicnikola · 2022-07-24T19:52:03Z

komljenovicnikola
Jul 24, 2022

Facing this same issue (on both CentOS and Ubuntu VM's), it happens during typical load and all resources are not near limits.

0 replies

merryHunter · 2022-07-29T05:35:19Z

merryHunter
Jul 29, 2022

Same here, anyone can suggest a good alternative?

0 replies

merryHunter · 2022-07-29T06:17:51Z

merryHunter
Jul 29, 2022

In my case it seemed to happen due to request timeout to external service.

0 replies

atTheShikhar · 2022-08-09T07:00:29Z

atTheShikhar
Aug 9, 2022

Any solution for this ? Facing the same issue when calling an endpoint which takes 1-2 min to execute.

0 replies

merryHunter · 2022-08-09T07:05:19Z

merryHunter
Aug 9, 2022

@atTheShikhar for me - I switched to combination Flask + uWSGI.

0 replies

atTheShikhar · 2022-08-09T07:09:17Z

atTheShikhar
Aug 9, 2022

@atTheShikhar for me - I switched to combination Flask + uWSGI.

Unfortunately i cannot change the SGI and framework, since most of the work are already done in my case. Just needed this one endpoint to work.

0 replies

merryHunter · 2022-08-09T07:15:50Z

merryHunter
Aug 9, 2022

Can you show your launching command? what parameters do you use? I assume gunicorn should run smoothly when using 1 worker on a single process.

0 replies

atTheShikhar · 2022-08-09T07:22:07Z

atTheShikhar
Aug 9, 2022

Can you show your launching command? what parameters do you use? I assume gunicorn should run smoothly when using 1 worker on a single process.

I am using docker so the final running command is this (the timeout part was added after reading above discussion)
CMD ["gunicorn", "-w", "4", "-k", "uvicorn.workers.UvicornWorker", "main:app", "--bind", "0.0.0.0:80", "--timeout", "300"]

Btw, this runs just fine locally. Problem only happens after i deploy it on GCR.

0 replies

merryHunter · 2022-08-09T07:30:42Z

merryHunter
Aug 9, 2022

hmm, for me it struggles both on EC2 machine and on Fargate... but so did you try running 1 worker? just to make work,

0 replies

atTheShikhar · 2022-08-09T07:38:10Z

atTheShikhar
Aug 9, 2022

yup i tried with 1 worker, still no luck.

0 replies

Aditya23456 · 2022-09-03T00:01:06Z

Aditya23456
Sep 3, 2022

anyone have any update?

0 replies

rbannon-tc · 2022-09-23T19:25:10Z

rbannon-tc
Sep 23, 2022

I've had the same issue. BUT, it only comes up when I started using max_requests setting...

0 replies

inspaya · 2022-11-20T11:41:16Z

inspaya
Nov 20, 2022

Adding some info if it helps. Gunicorn config below is run via supervisor, and was fine for a while. Added FastAPI Cache, all was good as well but crash rate has increased dramatically in past few days.

bind = "0.0.0.0:<PORT>"
wsgi_app = "main:app"
workers = 3 # worked for a while using 1 worker
worker_class = "uvicorn.workers.UvicornWorker"
errorlog = '<LOG_FILE>'
accesslog = '<LOG_FILE>'
loglevel = 'debug'
timeout = 240  # been increasing from 30s to solve [CRITICAL] WORKER TIMEOUT, now at 240s and still crashes occassionally

Server RAM: 1.9GB

Thanks

0 replies

nicholasmccrea · 2023-02-23T15:42:57Z

nicholasmccrea
Feb 23, 2023

Managed to resolve this issue, sharing in case this helps.

Our issue originated from making external API calls from within an async endpoint. These API calls did not support async, which introduced blocking calls to the event loop, resulting in the uvicorn worker timing out. Our reliance on FastAPI Cache decorators for these async endpoints prevented us from simply redefining these endpoints as sync (async def -> def).

To resolve, we made use of the run_in_threadpool() utility function to ensure these sync calls are run in a separate threadpool, outside the event loop. Alongside this, we updated our gunicorn config so the workers and threads count was equal - setting these to 4.

from fastapi.concurrency import run_in_threadpool

@api.get('/handler')
async def handler():
    ...
    # Slow async function
    await my_async_function()
    ....
    # Slow running sync function
    await run_in_threadpool(sync_function)

We released this update over 2 weeks ago and haven't seen any worker timeouts. Hopefully this helps 🙂

0 replies

merryHunter · 2023-02-23T15:46:56Z

merryHunter
Feb 23, 2023

@nicholasmccrea wow, should have been hard to identify it! Great news!

0 replies

mcazim98 · 2023-04-11T06:47:32Z

mcazim98
Apr 11, 2023

@nicholarmccrea your solution seems like a good solution, but i imagined removing the "async" from the endpoint functions allowed the requests to be handled in a different threadpool. Do you need to explicitly run it in a seperate function?

I am taking the knowledge from the comment on this post :
https://stackoverflow.com/questions/71516140/fastapi-runs-api-calls-in-serial-instead-of-parallel-fashion

0 replies

nicholasmccrea · 2023-04-12T14:01:40Z

nicholasmccrea
Apr 12, 2023

@nicholarmccrea your solution seems like a good solution, but i imagined removing the "async" from the endpoint functions allowed the requests to be handled in a different threadpool. Do you need to explicitly run it in a seperate function?

I am taking the knowledge from the comment on this post : https://stackoverflow.com/questions/71516140/fastapi-runs-api-calls-in-serial-instead-of-parallel-fashion

@mcazim98 in our case we were not able to redefine our endpoints as sync due to our reliance on FastAPI cache decorators. The FastAPI cache version we were using was 0.1.8, which did not support sync functions, therefore we needed to use the run_in_threadpool utility function as a workaround.

Thankfully, FastAPI cache now support sync functions as of version 0.2.0, which means we can now redefine our endpoints as sync and move away from using the run_in_threadpool function 🙂

0 replies

ruslaniv · 2023-04-25T05:27:10Z

ruslaniv
Apr 25, 2023

Yes I was having similar issues where endpoint was requesting a response from an ML model running within the saem fastapi application. Redefining the endpoints to sync versions fixed the issue.
Since this was a CPU bound task I do not think async was really necessary here, but I still do not understand why it was causing the workers to constantly crash although I experimented with different timeout settings.

0 replies

kenJPG · 2023-05-01T16:59:29Z

kenJPG
May 1, 2023

I encountered a similar issue, running FastAPI backend, where endpoint seemed to randomly ignore my requests sometimes, but sometimes 'fixes itself' and works. After spending 2 weeks on this, I realized this had to do with ports. My fix was to:

Explicitly forward the ports in my Docker container by including -p 8000:8000 in the docker run command. docker run -dit --gpus all -p 8000:8000 my_image
Then add --host 0.0.0.0 to my uvicorn command. uvicorn src:app --host 0.0.0.0

0 replies

MrChadMWood · 2024-08-02T16:44:54Z

MrChadMWood
Aug 2, 2024

I experienced a similar issue with the following (slightly different) symptoms:

backend-1    | [2024-08-02 14:46:36 +0000] [1] [ERROR] Worker (pid:648) was sent SIGABRT!
backend-1    | [2024-08-02 14:46:36 +0000] [692] [INFO] Booting worker with pid: 692
backend-1    | [2024-08-02 14:46:37 +0000] [692] [INFO] Started server process [692]
backend-1    | [2024-08-02 14:46:37 +0000] [692] [INFO] Waiting for application startup.
backend-1    | [2024-08-02 14:46:37 +0000] [692] [INFO] Application startup complete.
backend-1    | 172.19.0.1:48566 - "GET /v1/projects HTTP/1.1" 200
backend-1    | [2024-08-02 14:46:38 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:670)
storage-1    | 2024-08-02 14:46:38.790 UTC [106] LOG:  could not send data to client: Broken pipe
storage-1    | 2024-08-02 14:46:38.790 UTC [106] STATEMENT:  SELECT ...
storage-1    | 2024-08-02 14:46:38.791 UTC [106] FATAL:  connection to client lost
storage-1    | 2024-08-02 14:46:38.791 UTC [106] STATEMENT:  SELECT ...

I was able to resolve the issue by updating my connection pool to utilize asynchronous connections. Here's the difference between the two connection pools:

# ./core/database.py

from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession
from core.settings import dbconn_config
from sqlalchemy.exc import OperationalError
from fastapi import HTTPException, status


class UnreachableDatabase(Exception):
    def __init__(self, message="The database is currently unreachable. Please try again later."):
        self.message = message
        super().__init__(self.message)


# Database session
SQLALCHEMY_DATABASE_URL = 'postgresql://{username}:{password}@{hostname}:{port}/{database}'.format(**dbconn_config)
engine = create_engine(SQLALCHEMY_DATABASE_URL)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)

# Database async session
SQLALCHEMY_DATABASE_ASYNC_URL = 'postgresql+asyncpg://{username}:{password}@{hostname}:{port}/{database}'.format(**dbconn_config)
async_engine = create_async_engine(SQLALCHEMY_DATABASE_ASYNC_URL)
AsyncSessionLocal = sessionmaker(
    bind=async_engine,
    class_=AsyncSession,
    expire_on_commit=False
)

# Database dependency yielder
def get_db():
    db = SessionLocal()
    try:
        yield db
    except OperationalError as e:
        raise UnreachableDatabase() from e
    finally:
        db.close()

# Database async dependency yielder
async def get_async_db():
    db = AsyncSessionLocal()
    try:
        yield db
    except OperationalError as e:
        raise UnreachableDatabase() from e
    finally:
        await db.close()

In use, it looks something like this:

# ./v1/endpoints/space.py

from core.models.space import Space
from core.schemas.space import SpaceResponse
from core.database import get_async_db
from fastapi import Depends, APIRouter
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.future import select


router = APIRouter()

@router.get("/space", response_model=SpaceResponse)
async def pull(space_id: int, limit: int, db: AsyncSession = Depends(get_async_db)):
    # Initialize query with the primary filter
    stmt = (
        select(Space)
            .filter(Space.id > space_id)
            .order_by(Space.id, Space.updated_at)
            .limit(limit)
        )

    # Collect results
    result = await db.execute(stmt)
    documents = result.scalars().all()

    return {"documents": documents}

This change resolved my issue.

0 replies

Workers go into restarting/crash cycle (WORKER TIMEOUT / signal 6) #339

Replies: 27 comments

lsmith77 Jan 27, 2022 Author

lsmith77
Jan 27, 2022
Author