Possible race condition leading to a connection reset if worker is gracefully terminating #2315
Replies: 10 comments 5 replies
-
Hi, we are seeing the same(?) problem: when a worker restarts due to max-requests, sometimes a request gets lost. In these cases, RST/ACK can be observed. Setup:
Our App is an API server with async FastApi endpoints. It does receive relatively largely sized requests (say 2-15K). I think the larger requests have a better chance of triggering the race. While trying to repro this, I was trying with I'm also seeing this error message in the error log now: |
Beta Was this translation helpful? Give feedback.
-
Oh yeah, it's not very rare for us. With max_requests = 10000 and 4 workers, we hit this every few hours :-) |
Beta Was this translation helpful? Give feedback.
-
Repro code: asgi_sample.py: async def app(scope, receive, send):
headers = [(b"content-type", b"text/html")]
body = b'<html>hi!<br>'
await send({"type": "http.response.start", "status": 200, "headers": headers})
await send({"type": "http.response.body", "body": body}) gunicorn invocation:
curl script: #!/bin/bash
REMOTE='192.168.103.39:9222'
echo 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' | \
curl -XPUT 'http://'$REMOTE'/404/404/404/404/404/404/404' \
-H 'user-agent: python-requests/2.31.0' \
-H 'sentry-trace: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' \
-H 'baggage: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE' \
-H 'x-request-id: FFFFFFFFFFFFFFFFFFFF' \
-H 'accept-encoding: gzip, deflate' \
-H 'accept: */*' \
-H 'content-type: application/json' \
--data-binary @-
echo Running Note that repro_vpt.sh has the IP address of the host running gunicorn in the curl command line. Having the "useless data" in the curl call seems to help with reproducing, but it is not completely necessary. |
Beta Was this translation helpful? Give feedback.
-
Having the same issue using gunicorn + uvicorn + django 4, my workers handle a lot of requests so at peak RPM and --max-requests=8000 (25 containers, 1 worker each, round robin loadbalance) is noticeable, even I've randomly get it sometimes (it reflects in cloudflare as unknow error code 502). Without --max-requests it works fine excepts I'm trying to avoid memory leaks taking down my workers. It not seems related to requests duration, headers or content-length is just random I see this in the logs when max-requests+1 is reached and the issue is triggered |
Beta Was this translation helpful? Give feedback.
-
same issues here |
Beta Was this translation helpful? Give feedback.
-
@rbagd what if you use async endpoints instead of sync endpoints/routes? |
Beta Was this translation helpful? Give feedback.
-
@Kludex can you please have a look at this issue? |
Beta Was this translation helpful? Give feedback.
-
Hello! I'm too, observing Uvicorn sending TCP RST flag without responding to accepted query. It's definitely not rare, my MRE can reproduce it within couple of seconds. There seems to be a race condition between when connection is accepted / data is received by worker and parser in H11Protocol or This is tied to workers shutdown, and best reproduced with Traffic looks like this (similar to OP)
This, for example, causes nginx to tag upstream connection as 'down', because exception
But ACK from service (Uvicorn) is a clear sign that client is indeed waiting on an HTTP response, and that connection should not be closed. MRE:I created repo with this example for ease of setup and testing: https://github.com/JacKeTUs/uvicorn-rst-mre
Launched with
Here we're trying to send as many valid requests as fast as possible, and eventually we will send request at exact time when worker exits (at tick ending).
Expected result:MRE should last indefinitely without any errors. All sent requests should return valid response (http 200 in this case). Observed result:MRE exits with exception
Possible solutionThis can be bypassed easily by moving But probably proper fix is to introduce some kind of flag or event like 'connection_made' inside protocols classes, and in shutdown routine to check if that flag (connection has been made) is true, instead of checking that With both solutions i was able to run my example for a substantial amount of time without any issues and without any unexpected connection resets. I would love to create corresponding PR's if one of these solutions acceptable 🙏 @Kludex, this bug negatively affects all users, using |
Beta Was this translation helpful? Give feedback.
-
@JacKeTUs, you can remove the below code block.
as the connections will be closed by _wait_tasks_to_complete(). |
Beta Was this translation helpful? Give feedback.
-
@Kludex
As the socket is closed first, why can the ongoing response be sent by the closed socket? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have encountered a relatively rare connection error which is probably due to a race condition as
uvicorn
worker is trying to shutdown.Here is the setup
uvicorn
workergunicorn
with--max-requests
for regularly restarting workersI can reproduce it with Python 3.11, both
uvloop
andasyncio
, but couldn't reproduce withasyncio
and Python 3.12.To reproduce I launch below app as
Then I stress test the application with many concurrent users constantly hitting the API.
After some waiting I eventually hit
Connection reset by peer
errors.I did some initial investigation into what is happening. Here's a
tcpdump
for one of these errors which I tried to correlate with some events in the code. It always happens around the time whenmax-requests
is reached and worker is shutting down.It seems that in certain cases worker doesn't shutdown gracefully despite data having just arrived in TCP stack.
After some deep-dive I noticed that every time that the error happens it is true that
self.cycle=None
inHttpToolsProtocol.shutdown()
whenever the error is triggered, and if I am correct the reverse is true as well. It seems to me that adding a blockinto
httptools_impl.py
orh11_impl.py
solves the issue but I am not really sure what this means.Beta Was this translation helpful? Give feedback.
All reactions