Possible race condition leading to a connection reset if worker is gracefully terminating #2315

rbagd · 2024-04-25T13:31:50Z

rbagd
Apr 25, 2024

We have encountered a relatively rare connection error which is probably due to a race condition as uvicorn worker is trying to shutdown.

Here is the setup

Synchronous FastAPI endpoints
uvicorn worker
gunicorn with --max-requests for regularly restarting workers

I can reproduce it with Python 3.11, both uvloop and asyncio, but couldn't reproduce with asyncio and Python 3.12.

To reproduce I launch below app as

gunicorn app:app -w 2 -k uvicorn.workers.UvicornWorker --max-requests 5 --keep-alive 60

# app.py
import time
import random

from fastapi import FastAPI

app = FastAPI()

@app.post("/")
def test():
    time.sleep(random.uniform(0, 0.1))
    return {"status": "ok"}

Then I stress test the application with many concurrent users constantly hitting the API.

# hit.py
import requests
import random
import time

for _ in range(10000):
    requests.post("http://localhost:8000", json={})
    time.sleep(random.uniform(0.5, 1))

for i in {0..20}; do (python hit.py)& done

After some waiting I eventually hit Connection reset by peer errors.

I did some initial investigation into what is happening. Here's a tcpdump for one of these errors which I tried to correlate with some events in the code. It always happens around the time when max-requests is reached and worker is shutting down.

CONNECTION MADE: 2024-04-25T12:11:57.946005 ('127.0.0.1', 43826)     # end of connection_made in httptools_impl.py
SHUTD MAIN LOOP: 2024-04-25T12:11:57.946031                          # just before await self.shutdown(sockets=sockets) in server.py
CONNECTION SHUT: 2024-04-25T12:11:57.946105 ('127.0.0.1', 43826)     # just prior to connection.shutdown() in server.py
CLOSE TRANSPORT: 2024-04-25T12:11:57.946108                          # just before self.transport.close() in httptools_impl.py
CONNECTION LOST: 2024-04-25T12:11:57.946135 ('127.0.0.1', 43826)     # start of connection_lost in httptools_impl.py

It seems that in certain cases worker doesn't shutdown gracefully despite data having just arrived in TCP stack.
After some deep-dive I noticed that every time that the error happens it is true that self.cycle=None in HttpToolsProtocol.shutdown() whenever the error is triggered, and if I am correct the reverse is true as well. It seems to me that adding a block

if self.cycle is None:
    return

into httptools_impl.py or h11_impl.py solves the issue but I am not really sure what this means.

zeha · 2024-05-18T19:48:56Z

zeha
May 18, 2024

Hi,

we are seeing the same(?) problem: when a worker restarts due to max-requests, sometimes a request gets lost. In these cases, RST/ACK can be observed.

Setup:

Python 3.11.8
uvicorn 0.29.0
gunicorn 21.2.0
fastapi 0.110.0
uvloop 0.19.0
h11 0.14.0
httpcore 1.0.4
httptools 0.6.1
gunicorn with workers = 4, -k uvicorn.workers.UvicornWorker -t 30 --graceful-timeout 30
uname -rvm: 5.10.0-29-amd64 #1 SMP Debian 5.10.216-1 (2024-05-03) x86_64 - Debian 11

Our App is an API server with async FastApi endpoints. It does receive relatively largely sized requests (say 2-15K).

I think the larger requests have a better chance of triggering the race.

While trying to repro this, I was trying with max_requests=10 and had a quite high repro chance. After fixing the uvicorn.error logger to write into a file, and also log the autorestart condition, the race was noticeably harder to hit.

I'm also seeing this error message in the error log now: Error while closing socket [Errno 9] Bad file descriptor

0 replies

zeha · 2024-05-18T20:01:33Z

zeha
May 18, 2024

Oh yeah, it's not very rare for us. With max_requests = 10000 and 4 workers, we hit this every few hours :-)

0 replies

zeha · 2024-05-18T20:34:44Z

zeha
May 18, 2024

Repro code:

asgi_sample.py:

async def app(scope, receive, send):
    headers = [(b"content-type", b"text/html")]
    body = b'<html>hi!<br>'
    await send({"type": "http.response.start", "status": 200, "headers": headers})
    await send({"type": "http.response.body", "body": body})

gunicorn invocation:

python3 -m gunicorn.app.wsgiapp  --bind ':9222' --workers 4 --max-requests 10 -k uvicorn.workers.UvicornWorker -D asgi_sample:app

curl script:

#!/bin/bash
REMOTE='192.168.103.39:9222'
echo 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' | \
curl -XPUT 'http://'$REMOTE'/404/404/404/404/404/404/404' \
 -H 'user-agent: python-requests/2.31.0' \
 -H 'sentry-trace: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' \
 -H 'baggage: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE' \
 -H 'x-request-id: FFFFFFFFFFFFFFFFFFFF' \
 -H 'accept-encoding: gzip, deflate' \
 -H 'accept: */*' \
 -H 'content-type: application/json' \
 --data-binary @-
echo

Running while true; do bash repro_vpt.sh ; done prints curl: (56) Recv failure: Connection reset by peer after 30-90sec for me. Sometimes it needs a few tries.

Note that repro_vpt.sh has the IP address of the host running gunicorn in the curl command line.

Having the "useless data" in the curl call seems to help with reproducing, but it is not completely necessary.

0 replies

kliver · 2024-11-05T15:40:21Z

kliver
Nov 5, 2024

Having the same issue using gunicorn + uvicorn + django 4, my workers handle a lot of requests so at peak RPM and --max-requests=8000 (25 containers, 1 worker each, round robin loadbalance) is noticeable, even I've randomly get it sometimes (it reflects in cloudflare as unknow error code 502).

Without --max-requests it works fine excepts I'm trying to avoid memory leaks taking down my workers.

It not seems related to requests duration, headers or content-length is just random

I see this in the logs when max-requests+1 is reached and the issue is triggered
[WARNING] Maximum request limit of 1000 exceeded. Terminating process. Shutting down Error while closing socket [Errno 9] Bad file descriptor

0 replies

zffocussss · 2024-11-12T06:44:58Z

zffocussss
Nov 12, 2024

same issues here

0 replies

zffocussss · 2024-11-12T06:46:13Z

zffocussss
Nov 12, 2024

@rbagd what if you use async endpoints instead of sync endpoints/routes?

0 replies

zffocussss · 2024-11-12T06:51:04Z

zffocussss
Nov 12, 2024

@Kludex can you please have a look at this issue?

0 replies

JacKeTUs · 2025-07-04T15:26:35Z

JacKeTUs
Jul 4, 2025

Hello!

I'm too, observing Uvicorn sending TCP RST flag without responding to accepted query. It's definitely not rare, my MRE can reproduce it within couple of seconds.

There seems to be a race condition between when connection is accepted / data is received by worker and parser in H11Protocol or HttpToolsProtocol actually starts parsing query and RequestResponseCycle object is created.
It's best reproduced with high volume of queries, but can appear randomly even on low volume.

This is tied to workers shutdown, and best reproduced with --max-requests (or --limit-max-requests) option. Uvicorn tick can end exactly when query is already accepted by worker but not parsed by HttpParser yet, and, if this worker exits on that tick, cycle in shutdown function (https://github.com/encode/uvicorn/blob/master/uvicorn/server.py#L273) will prematurely close the connection, without waiting for tasks to complete, sending out TCP RST after ACKing.

Traffic looks like this (similar to OP)

13150	1.734904873	127.0.0.1	42168	127.0.0.1	8000	TCP	74		42168 → 8000 [SYN] Seq=0 Win=65495 Len=0 MSS=65495 SACK_PERM TSval=3554313553 TSecr=0 WS=128
13151	1.734910373	127.0.0.1	8000	127.0.0.1	42168	TCP	74		8000 → 42168 [SYN, ACK] Seq=0 Ack=1 Win=65483 Len=0 MSS=65495 SACK_PERM TSval=3554313553 TSecr=3554313553 WS=128
13152	1.734916343	127.0.0.1	42168	127.0.0.1	8000	TCP	66		42168 → 8000 [ACK] Seq=1 Ack=1 Win=65536 Len=0 TSval=3554313553 TSecr=3554313553
13153	1.734926513	127.0.0.1	42168	127.0.0.1	8000	HTTP	84		GET / HTTP/1.0
13154	1.734930153	127.0.0.1	8000	127.0.0.1	42168	TCP	66		8000 → 42168 [ACK] Seq=1 Ack=19 Win=65536 Len=0 TSval=3554313553 TSecr=3554313553
13155	1.735252022	127.0.0.1	8000	127.0.0.1	42168	TCP	66		8000 → 42168 [RST, ACK] Seq=1 Ack=19 Win=65536 Len=0 TSval=3554313554 TSecr=3554313553

This, for example, causes nginx to tag upstream connection as 'down', because exception 104: Connection reset by peer occurs, which negatively affects load balancing between multiple upstreams (nginx starts returning 502 if all upstreams are down)
From docs/server-behavior.md:

Close any connections that are not currently waiting on an HTTP response, and wait for any other connections to finalize their HTTP responses.

But ACK from service (Uvicorn) is a clear sign that client is indeed waiting on an HTTP response, and that connection should not be closed.

MRE:

I created repo with this example for ease of setup and testing: https://github.com/JacKeTUs/uvicorn-rst-mre

mre-app.py

from fastapi import FastAPI

app = FastAPI()

@app.get("/")
async def root():
    return {"message": f"ok"}

Launched with gunicorn -k uvicorn_worker.UvicornWorker mre-app:app --max-requests 1.
Here we set max-requests as 1 for best reproducibility, it happens with any other number, and happens with Uvicorn and Hypercorn as well. And, well, uvicorn worker can process multiple requests on one tick, so exact value doesn't really matter. Worker count is also irrelevant, happens with multiple workers as well. timeout_graceful_shutdown has no effect, it's by default set at 30s in Gunicorn. Worker loop setting (asyncio, uvloop) or http setting (h11, httptools) also doesn't matter.

mre.py:

import socket

while True:
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect( ("localhost", 8000) )
    # Happens with HTTP/1.1 as well
    s.send(b"GET / HTTP/1.0\r\n\r\n")
    s.recv(255)
    s.close()

Here we're trying to send as many valid requests as fast as possible, and eventually we will send request at exact time when worker exits (at tick ending).

pip freeze:

annotated-types==0.7.0
anyio==4.9.0
click==8.2.1
fastapi==0.115.14
gunicorn==23.0.0
h11==0.16.0
idna==3.10
packaging==25.0
pydantic==2.11.7
pydantic_core==2.33.2
sniffio==1.3.1
starlette==0.46.2
typing-inspection==0.4.1
typing_extensions==4.14.1
uvicorn==0.35.0
uvicorn-worker==0.3.0

Expected result:

MRE should last indefinitely without any errors. All sent requests should return valid response (http 200 in this case).

Observed result:

MRE exits with exception 104: Connection reset by peer after couple of seconds. In traffic dump ACK flag sent out by service after receiving HTTP request, and RST ACK flag is sent by service shortly after. Task is not processed, no valid response sent in this session.

$ time python3 ./mre.py
Traceback (most recent call last):
  File "mre.py", line 8, in <module>
    s.recv(255)
    ~~~~~~^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

real    0m3,089s
user    0m0,059s
sys     0m0,187s

Possible solution

This can be bypassed easily by moving connection.shutdown() cycle (https://github.com/encode/uvicorn/blob/master/uvicorn/server.py#L272) after wait_tasks_to_complete routine, to here or to here. This will allow any task already accepted to actually process, before timeout after timeout_graceful_shutdown occurs and we forcefully shutdown any leftovers. Something like this: JacKeTUs@47b55d9

But probably proper fix is to introduce some kind of flag or event like 'connection_made' inside protocols classes, and in shutdown routine to check if that flag (connection has been made) is true, instead of checking that RequestResponseCycle is created. Something like this JacKeTUs@63311cb and this JacKeTUs@6406ab1

With both solutions i was able to run my example for a substantial amount of time without any issues and without any unexpected connection resets.

I would love to create corresponding PR's if one of these solutions acceptable 🙏

@Kludex, this bug negatively affects all users, using max-requests in any form, it causes seemingly random exceptions and errors while using Uvicorn, and it's definitely not rare. Can this bug be looked at, please?

3 replies

zffocussss Jul 7, 2025

which network traffic tools do you use to analyse the TCP connection reset issue?

JacKeTUs Jul 7, 2025

Just Wireshark will suffice. It should monitor lo interface, with capture filter like tcp port 8000, for my MRE.

zffocussss Jul 9, 2025

Just Wireshark will suffice. It should monitor lo interface, with capture filter like tcp port 8000, for my MRE.

that's terrific.I hope your PR can be accepted by @Kludex

zffocussss · 2025-07-18T09:42:44Z

zffocussss
Jul 18, 2025

@JacKeTUs, you can remove the below code block.

        for connection in list(self.server_state.connections):
            print("connection length:", len(connection.server_state.connections))
            connection.shutdown()
        await asyncio.sleep(0.1)

as the connections will be closed by _wait_tasks_to_complete().
I just tested it, no connection reset happened.

2 replies

JacKeTUs Jul 18, 2025

As far as i understand it, .shutdown() method just tells underlying implementation of Protocol like 'request shutdown', it doesn't mean connection will be closed after method succeeds. What we're observing is probably a bug in protocol's implementations with this 'requesting shutdown' thing - it just closes it before response.

Later code block with _wait_tasks_to_complete() should wait for proper shutdown after it was requested.

zffocussss Jul 18, 2025

Can you have a test in the case that the block is deleted?it proves to work

zffocussss · 2025-07-18T09:48:14Z

zffocussss
Jul 18, 2025

@Kludex
I am not clear about the order of the below code block

for sock in sockets or []:
            sock.close()  # pragma: full coverage
await asyncio.wait_for(
                self._wait_tasks_to_complete(),
                timeout=self.config.timeout_graceful_shutdown,
            )

As the socket is closed first, why can the ongoing response be sent by the closed socket?

0 replies

Uh oh!

Possible race condition leading to a connection reset if worker is gracefully terminating #2315

Uh oh!

Uh oh!

Replies: 10 comments · 5 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MRE:

Expected result:

Observed result:

Possible solution

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 10 comments 5 replies