Shut down the multiprocess supervisor when a worker dies before startup#2987
Closed
Kludex wants to merge 3 commits into
Closed
Shut down the multiprocess supervisor when a worker dies before startup#2987Kludex wants to merge 3 commits into
Kludex wants to merge 3 commits into
Conversation
Contributor
|
📖 Docs preview: https://542b1075-uvicorn.marcelotryle.workers.dev |
This was referenced Jun 11, 2026
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="uvicorn/supervisors/multiprocess.py">
<violation number="1" location="uvicorn/supervisors/multiprocess.py:43">
P1: Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| if self.parent_conn.poll(timeout): | ||
| self.parent_conn.recv() | ||
| started: bool = self.parent_conn.recv() | ||
| self.ready = self.ready or started |
There was a problem hiding this comment.
P1: Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At uvicorn/supervisors/multiprocess.py, line 43:
<comment>Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.</comment>
<file context>
@@ -30,22 +30,25 @@ def __init__(
if self.parent_conn.poll(timeout):
- self.parent_conn.recv()
+ started: bool = self.parent_conn.recv()
+ self.ready = self.ready or started
return True
return False
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First step of the worker lifecycle cleanup discussed in #2980. The supervisor previously could not distinguish a worker that failed to boot from one that died mid-run, so its only policy was "restart on death" - which turns any startup failure into an infinite restart loop. This PR gives the supervisor that distinction without adding any new API or channel:
Serverin the child:Multiprocess(config, sockets)no longer takes atarget. The parent no longer pickles aserver.runbound method into children, which is the seam the follow-up PR (Skip importing the app in the parent process when using subprocesses #2988) uses to stop importing the app in the parent entirely (Severe memory regression in the supervisor process when workers > 1 (0.47.0+) #2980).Server.startedinstead of a fixedb"pong". The parent caches it, so the supervisor policy is one rule: a worker that dies before it was ever seen started is a startup failure - terminate everything and exit withSTARTUP_FAILURE(moved touvicorn.server, re-exported fromuvicorn.main); a worker that dies after is restarted as before. No second pipe, no callback, noServerAPI change - the supervisor consumes thestartedflag theServeralready exposes.This fixes a real restart-loop today: a lifespan startup failure with
--workersmakes the worker exit cleanly, so the supervisor respawned it forever. Now:Parent-side eager app loading is intentionally untouched here, so #2440 fail-fast behavior is unchanged. Removing it (the actual #2980 memory fix) is #2988, stacked on this.
How objects are linked today vs where this is going
Today (
main): the parent constructs theServerand imports the app, then ships a pickledserver.runbound method into each worker, which imports the app again inside its running event loop. The only worker→supervisor signal is liveness, so every death looks the same.flowchart TB subgraph P["parent process"] run["uvicorn.run()"] cfg["Config"] srv["Server"] mp["Multiprocess"] run -->|"load_app() - full app import,<br/>this copy is never served (#2980)"| cfg run --> srv run -->|"target=server.run"| mp end subgraph W["worker process"] tgt["Process.target()"] wsrv["server.run()"] app["app import - again,<br/>inside the running event loop (#941)"] tgt --> wsrv wsrv --> app end srv -.->|"pickled bound method"| tgt mp -->|spawn| tgt W -.->|"ping/pong: alive or not,<br/>nothing else"| mp W -->|"any death → restart<br/>(startup failure = infinite loop)"| mpTarget: the parent holds only
Configand sockets; the worker owns itsServerand the single app import. The healthcheck reply reports lifecycle (started), not just liveness, so the supervisor can apply a real policy.flowchart TB subgraph P2["parent process"] run2["uvicorn.run()"] cfg2["Config - just data"] mp2["Multiprocess(config, sockets)"] run2 --> cfg2 run2 --> mp2 end subgraph W2["worker process"] tgt2["Process.target()"] srv2["Server(config) - built in the worker"] app2["app import - once, in the worker"] tgt2 --> srv2 srv2 --> app2 end mp2 -->|"spawn: config + sockets"| tgt2 W2 -.->|"ping → pong carries server.started"| mp2 mp2 -->|"died before started → shut down, exit 3<br/>died after started → restart"| W2This PR implements the target diagram except for one edge: the parent's eager
load_app()stays for now, so #2440 fail-fast behavior is unchanged while this lands. Removing it (the actual #2980 memory fix) is #2988.Notes
started, is classified as a startup failure and shuts the server down. Arguably the right call for a crash-on-start, and far better than the silent infinite restart loop it replaces.MultiprocessandProcessconstructor signatures changed (droppedtarget). They are not documented public API.Serveris untouched.AI Disclaimer
This PR was developed with the assistance of either Claude or Codex. I've reviewed and verified the changes.