Skip to content

Shut down the multiprocess supervisor when a worker dies before startup#2987

Closed
Kludex wants to merge 3 commits into
mainfrom
worker-readiness
Closed

Shut down the multiprocess supervisor when a worker dies before startup#2987
Kludex wants to merge 3 commits into
mainfrom
worker-readiness

Conversation

@Kludex

@Kludex Kludex commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

First step of the worker lifecycle cleanup discussed in #2980. The supervisor previously could not distinguish a worker that failed to boot from one that died mid-run, so its only policy was "restart on death" - which turns any startup failure into an infinite restart loop. This PR gives the supervisor that distinction without adding any new API or channel:

  • The worker now builds its own Server in the child: Multiprocess(config, sockets) no longer takes a target. The parent no longer pickles a server.run bound method into children, which is the seam the follow-up PR (Skip importing the app in the parent process when using subprocesses #2988) uses to stop importing the app in the parent entirely (Severe memory regression in the supervisor process when workers > 1 (0.47.0+) #2980).
  • The existing healthcheck pong now carries Server.started instead of a fixed b"pong". The parent caches it, so the supervisor policy is one rule: a worker that dies before it was ever seen started is a startup failure - terminate everything and exit with STARTUP_FAILURE (moved to uvicorn.server, re-exported from uvicorn.main); a worker that dies after is restarted as before. No second pipe, no callback, no Server API change - the supervisor consumes the started flag the Server already exposes.

This fixes a real restart-loop today: a lifespan startup failure with --workers makes the worker exit cleanly, so the supervisor respawned it forever. Now:

ERROR:    Application startup failed. Exiting.
ERROR:    Child process [84413] died before startup completed, shutting down.
INFO:     Stopping parent process [84411]
(exit code 3)

Parent-side eager app loading is intentionally untouched here, so #2440 fail-fast behavior is unchanged. Removing it (the actual #2980 memory fix) is #2988, stacked on this.

How objects are linked today vs where this is going

Today (main): the parent constructs the Server and imports the app, then ships a pickled server.run bound method into each worker, which imports the app again inside its running event loop. The only worker→supervisor signal is liveness, so every death looks the same.

flowchart TB
    subgraph P["parent process"]
        run["uvicorn.run()"]
        cfg["Config"]
        srv["Server"]
        mp["Multiprocess"]
        run -->|"load_app() - full app import,<br/>this copy is never served (#2980)"| cfg
        run --> srv
        run -->|"target=server.run"| mp
    end
    subgraph W["worker process"]
        tgt["Process.target()"]
        wsrv["server.run()"]
        app["app import - again,<br/>inside the running event loop (#941)"]
        tgt --> wsrv
        wsrv --> app
    end
    srv -.->|"pickled bound method"| tgt
    mp -->|spawn| tgt
    W -.->|"ping/pong: alive or not,<br/>nothing else"| mp
    W -->|"any death → restart<br/>(startup failure = infinite loop)"| mp
Loading

Target: the parent holds only Config and sockets; the worker owns its Server and the single app import. The healthcheck reply reports lifecycle (started), not just liveness, so the supervisor can apply a real policy.

flowchart TB
    subgraph P2["parent process"]
        run2["uvicorn.run()"]
        cfg2["Config - just data"]
        mp2["Multiprocess(config, sockets)"]
        run2 --> cfg2
        run2 --> mp2
    end
    subgraph W2["worker process"]
        tgt2["Process.target()"]
        srv2["Server(config) - built in the worker"]
        app2["app import - once, in the worker"]
        tgt2 --> srv2
        srv2 --> app2
    end
    mp2 -->|"spawn: config + sockets"| tgt2
    W2 -.->|"ping → pong carries server.started"| mp2
    mp2 -->|"died before started → shut down, exit 3<br/>died after started → restart"| W2
Loading

This PR implements the target diagram except for one edge: the parent's eager load_app() stays for now, so #2440 fail-fast behavior is unchanged while this lands. Removing it (the actual #2980 memory fix) is #2988.

Notes

  • One edge case of the pull-based signal: a worker that completes startup but dies within one health tick (0.5s), before the parent ever observed started, is classified as a startup failure and shuts the server down. Arguably the right call for a crash-on-start, and far better than the silent infinite restart loop it replaces.
  • Multiprocess and Process constructor signatures changed (dropped target). They are not documented public API. Server is untouched.

AI Disclaimer

This PR was developed with the assistance of either Claude or Codex. I've reviewed and verified the changes.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

📖 Docs preview: https://542b1075-uvicorn.marcelotryle.workers.dev

@codspeed-hq

codspeed-hq Bot commented Jun 10, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 24 untouched benchmarks


Comparing worker-readiness (324e5c6) with main (e8a31bc)

Open in CodSpeed

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Re-trigger cubic

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="uvicorn/supervisors/multiprocess.py">

<violation number="1" location="uvicorn/supervisors/multiprocess.py:43">
P1: Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

if self.parent_conn.poll(timeout):
self.parent_conn.recv()
started: bool = self.parent_conn.recv()
self.ready = self.ready or started

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At uvicorn/supervisors/multiprocess.py, line 43:

<comment>Readiness is only learned from a successful healthcheck round-trip, so a worker that finishes startup and then dies before the first ping is still treated as "not ready" and shuts down the whole supervisor instead of being restarted.</comment>

<file context>
@@ -30,22 +30,25 @@ def __init__(
         if self.parent_conn.poll(timeout):
-            self.parent_conn.recv()
+            started: bool = self.parent_conn.recv()
+            self.ready = self.ready or started
             return True
         return False
</file context>

@Kludex Kludex closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant