Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Cancellation propagation when using sse-starlette, fastapi, uvloop, and ...gevent? #10550

Open
antarus12345 opened this issue Sep 6, 2024 · 6 comments

Comments

@antarus12345
Copy link

antarus12345 commented Sep 6, 2024

Summary of problem

TL;DR:

  • When my application is run under ddtrace-run, my SSE events (using sse-starlette) see what are (to me) spurious propagation of CancellationErrors. This mostly results in noisy logs (which I pay to ingest into datadog) and just noisy APM (since they are reported as errors, since they are CancelledErrors.
  • When my application is not run under ddtrace-run, I see no errors.
  • When I remove gevent from my venv, I see no errors.
  • I had expected the errors to go away if I ran with DD_TRACE_GEVENT_ENABLED=false but this did not help (and the errors persisted provided gevent was installed.)

To spare you all the travesty of seeing the giant pile of spaghetti that is my production application, I have managed a smaller reproduction case that I hope helps you.

I'd love to keep gevent installed and either have this bug fixed, or just not instrument gevent in this application; I'm not intending to use gevent at all (I'm supposed to be using asyncio / anyio in fastapi..)

Which version of dd-trace-py are you using?

ddtrace-2.11.3

Which version of pip are you using?

23.0.1

Which libraries and their versions are you using?

`pip freeze` annotated-types==0.7.0 anyio==3.7.1 bytecode==0.15.1 certifi==2024.8.30 click==8.1.7 ddtrace==2.11.3 Deprecated==1.2.14 distro==1.9.0 envier==0.5.2 exceptiongroup==1.2.2 fastapi==0.104.1 gevent==24.2.1 greenlet==3.0.3 gunicorn==21.2.0 h11==0.14.0 httpcore==1.0.5 httptools==0.6.1 httpx==0.27.2 idna==3.8 importlib_metadata==8.4.0 openai==1.1.1 opentelemetry-api==1.27.0 packaging==24.1 protobuf==5.28.0 pydantic==2.9.0 pydantic_core==2.23.2 python-dotenv==1.0.1 PyYAML==6.0.2 sniffio==1.3.1 sse-starlette==1.6.5 starlette==0.27.0 tqdm==4.66.5 typing_extensions==4.12.2 tzdata==2024.1 uv==0.2.24 uvicorn==0.24.0.post1 uvloop==0.20.0 watchfiles==0.24.0 websockets==13.0.1 wrapt==1.16.0 xmltodict==0.13.0 zipp==3.20.1 zope.event==5.0 zope.interface==7.0.3

How can we reproduce your problem?

I am on py3.10
I created a venv, installed the above packages

My reproducer is:

(small_repro) (base) alec@alec-dev:~/small_repro$ cat dd_test.py 
from fastapi import FastAPI
import uvicorn
from sse_starlette import EventSourceResponse
import os


app = FastAPI()


@app.get('/foo')
def foo():
    def event_gen():
        for x in range(1,100):
            yield str(x)
    return EventSourceResponse(event_gen())

if __name__ == '__main__':
    uvicorn.run(app, host='0.0.0.0', port=int(os.getenv('PORT', 9005)))

What is the result that you get?

I open two terminals. One I use to run my reproducer:

ddtrace-run python dd_test.py

The second I use to curl the foo endpoint, we just throw the result away (it doesn't matter.)

curl localhost:9005/foo &>/dev/null

However, this triggers some odd cancellation exception:

(small_repro) (base) alec@alec-dev:~/small_repro$ ddtrace-run python dd_test.py
INFO:     Started server process [1981883]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9005 (Press CTRL+C to quit)
INFO:     127.0.0.1:49796 - "GET /foo HTTP/1.1" 200 OK
Exception in callback <function TaskGroup._spawn.<locals>.task_done at 0x7dc79eca2170>
handle: <Handle TaskGroup._spawn.<locals>.task_done>
Traceback (most recent call last):
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 236, in wrap
    await func()
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 281, in _ping
    await anyio.sleep(self._ping_interval)
  File "/home/alec/small_repro/lib/python3.10/site-packages/anyio/_core/_eventloop.py", line 81, in sleep
    return await get_asynclib().sleep(delay)
  File "/home/alec/miniconda3/envs/py3.10/lib/python3.10/asyncio/tasks.py", line 605, in sleep
    return await future
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 236, in wrap
    await func()
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 211, in listen_for_exit_signal
    await AppStatus.should_exit_event.wait()
  File "/home/alec/small_repro/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 1778, in wait
    if await self._event.wait():
  File "/home/alec/miniconda3/envs/py3.10/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/home/alec/small_repro/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 668, in task_done
    exc = _task.exception()
asyncio.exceptions.CancelledError
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/alec/small_repro/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/alec/small_repro/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/fastapi/applications.py", line 1106, in __call__
    await super().__call__(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/ddtrace/contrib/asgi/middleware.py", line 291, in __call__
    return await self.app(scope, receive, wrapped_send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/alec/small_repro/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
    await response(scope, receive, send)
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 247, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 236, in wrap
    await func()
  File "/home/alec/small_repro/lib/python3.10/site-packages/sse_starlette/sse.py", line 191, in listen_for_disconnect
    message = await receive()
  File "/home/alec/small_repro/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 587, in receive
    await self.message_event.wait()
  File "/home/alec/miniconda3/envs/py3.10/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError

Some additional context:

sse-starlette uses a task-group in its call function to basically terminate event streaming if one of many things happens. You can see this behavior here:

Cancellation of this task-group is expected. However somehow something teh tracer is doing when gevent is installed causes this Cancellation propagation. I have not yet waddled through the gevent internal traces.

What is the result that you expected?

(small_repro) (me):~/small_repro$ ddtrace-run python dd_test.py
INFO: Started server process [1981883]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9005 (Press CTRL+C to quit)
INFO: 127.0.0.1:49796 - "GET /foo HTTP/1.1" 200 OK

@antarus12345
Copy link
Author

Additional weirdness:

DD_TRACE_ENABLED=false ddtrace-run python dd_test.py also exhibits this failure, so its not even I think because of patch.all() like I expected, but is instead some of the nasty gevent stuff happening in sitecustomize?

@antarus12345
Copy link
Author

DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE=no ddtrace-run python dd_test.py does not exhibit the bug.

@antarus12345
Copy link
Author

antarus12345 commented Sep 6, 2024

Ah I see now:

As part of module cleanup in sitecustomize auto cleanup is enabled if gevent module is installed. The bug appears to be that module cleanup triggers this problem (presumably by fussing with asyncio or _asyncio.)

So some Qs:

  • I presume since I'm not actually using GEVENT, its safe to turn off DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE for this application?
  • I assume this is some heuristic (e.g. if gevent is installed, schedule module cleanup to fix up gevent.) I'm curious why we cannot tie this to whether gevent was loaded (as opposed to whether it is installed?)
  • I also presume this is just sloppiness on my end, we have more or less one giant requirements file for a bunch of apps (which is why gevent is installed in this venv to begin with when we are not using it) but this is probably a bad practice and likely contributes to your pain of trying to make ddtrace-run work in a variety of environments sanely.
  • Also just curious if / how we can fix the underlying bug:
  • Unloading asyncio and then loading it again appears to break the reproducer app. How much do we care? Presuming most apps don't have module unloading, and most apps are either using gevent or asyncio (but not both) maybe we care very little.

@antarus12345
Copy link
Author

and of course if I just edit sitecustomize.py (for fun) to remove gevent from MODULE_REQUIRES CLEANUP it also works fine. cool.

@victorct-tradeengage
Copy link

victorct-tradeengage commented Sep 7, 2024

Same issue here, we have Gevent installed which we use with celery for the pool. The codebase between FastAPI and Celery is shared.

@luciidlou
Copy link

Same issue here with our FastAPI application. Removing gevent from our dependencies solved the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants