Skip to content

feat: drain active SSE streams before agent shutdown#3900

Open
0vertake wants to merge 3 commits intosuperplanehq:mainfrom
0vertake:feat/agent-graceful-shutdown
Open

feat: drain active SSE streams before agent shutdown#3900
0vertake wants to merge 3 commits intosuperplanehq:mainfrom
0vertake:feat/agent-graceful-shutdown

Conversation

@0vertake
Copy link
Copy Markdown
Collaborator

@0vertake 0vertake commented Apr 1, 2026

Summary

  • Track in-flight SSE streams with ActiveStreamTracker so the agent waits for them to finish on SIGTERM instead of killing them mid-response
  • Reject new requests with 503 during shutdown
  • Set up timeout layering: drain (300s) < uvicorn (310s) < Docker/K8s (330s), each independently configurable

How it works

  1. K8s sends SIGTERM — no new traffic is routed to the pod
  2. FastAPI lifespan teardown calls begin_shutdown() (new requests get 503) then wait_for_drain() (blocks until active streams finish or DRAIN_TIMEOUT expires)
  3. Once drained, gRPC server and session store are cleaned up and uvicorn exits

Track in-flight SSE streams with an ActiveStreamTracker so the agent
service waits for them to finish when SIGTERM is received, instead of
killing them mid-response. New requests during shutdown are rejected
with HTTP 503. Configurable drain timeout via GRACEFUL_SHUTDOWN_TIMEOUT
env var (default 300s).

Signed-off-by: Milos Jovanovic <milosjovanovic519@gmail.com>
@superplanehq-integration
Copy link
Copy Markdown

👋 Commands for maintainers:

  • /sp start - Start an ephemeral machine (takes ~30s)
  • /sp stop - Stop a running machine (auto-executed on pr close)

@0vertake 0vertake requested a review from lucaspin April 1, 2026 18:58
Move tracker.acquire() into the request handler so the active count is
visible to wait_for_drain() before the async generator starts lazily.
Prevents a race where drain completes and closes resources while a
response is still being set up.

Signed-off-by: Milos Jovanovic <milosjovanovic519@gmail.com>
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

If the generator is never iterated (e.g. ASGI cancellation between
handler return and first __anext__), release() in the generator's
finally block never runs. Wrap the acquire-to-return gap so release()
is called on any failure.

Signed-off-by: Milos Jovanovic <milosjovanovic519@gmail.com>
return _DEFAULT_DRAIN_TIMEOUT


class ActiveStreamTracker:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a separate file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants