Skip to content

fix: graceful shutdown instead of SIGKILL on worker processes#1349

Open
sansyrox wants to merge 2 commits intomainfrom
fix/graceful-shutdown
Open

fix: graceful shutdown instead of SIGKILL on worker processes#1349
sansyrox wants to merge 2 commits intomainfrom
fix/graceful-shutdown

Conversation

@sansyrox
Copy link
Copy Markdown
Member

@sansyrox sansyrox commented Mar 27, 2026

Summary

  • Replace process.kill() (SIGKILL) with process.terminate() (SIGTERM) + process.join(timeout=30) in the parent signal handler, with a force-kill fallback for workers that don't exit in time.
  • In child processes, register a SIGTERM handler via loop.add_signal_handler(signal.SIGTERM, loop.stop) so the event loop exits cleanly, allowing the Rust shutdown handler to fire and in-flight requests to complete.
  • Move loop.close() into a finally block so it always runs regardless of how the loop exits.

Test plan

  • Existing integration tests pass
  • Manual test: start multi-process server, send SIGTERM, verify processes exit with code 0 or -15 (not -9)
  • Manual test: verify in-flight requests complete before the process exits
  • Manual test: verify force-kill fallback works if a process hangs beyond 30s

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes
    • Improved shutdown flow: workers now receive a graceful termination window (up to 30s) before forcible stop, reducing abrupt terminations.
    • Safer restart behavior: stopped subprocesses are given a bounded grace period (≈5s) before escalation to forceful kill.
    • More robust signal handling: termination signals are handled more predictably on supported platforms and interruption paths always ensure event-loop cleanup.

Replace process.kill() (SIGKILL) with process.terminate() (SIGTERM)
followed by a 30s join timeout. Workers that don't exit in time are
force-killed as a fallback.

In child processes, register a SIGTERM handler on the asyncio event loop
so that loop.run_forever() exits cleanly, allowing the Rust shutdown
handler to fire and in-flight requests to complete.

Made-with: Cursor
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
robyn Ready Ready Preview, Comment Mar 28, 2026 8:05pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Refined shutdown flows: parent and reloader now use graceful termination (terminate + bounded wait) with escalation to kill for lingering processes; spawn_process integrates SIGTERM with asyncio loop on non-Windows and ensures the loop is closed on all paths; parent exit via sys.exit(0) after shutdown.

Changes

Cohort / File(s) Summary
Process Shutdown & asyncio integration
robyn/processpool.py
Replaced immediate kill() with terminate() + join(timeout=30) then kill() for still-alive workers; signal handler now calls sys.exit(0) after shutdown; spawn_process() registers loop.add_signal_handler(signal.SIGTERM, loop.stop) on non-Windows and always closes the event loop; KeyboardInterrupt is swallowed and handled in finally.
Reloader subprocess shutdown
robyn/reloader.py
Changed EventHandler.stop_server() to use process.terminate() with wait(timeout=5), warn then kill() if needed; removed explicit pre-restart kill() in reload() to rely on updated stop logic before spawning new subprocess.

Sequence Diagram(s)

sequenceDiagram
    participant Signal as Signal Handler
    participant Parent as Parent Process
    participant Workers as Worker Processes
    participant Cleanup as Cleanup Phase

    Signal->>Parent: SIGTERM received
    Parent->>Workers: terminate() [graceful]
    Note over Parent,Workers: Wait up to 30 seconds (join(timeout=30))
    
    alt All workers stopped
        Cleanup->>Parent: All processes stopped
    else Some workers still alive
        Cleanup->>Parent: Log warning about stragglers
        Cleanup->>Workers: kill() [force]
        Note over Cleanup,Workers: Wait additional 5 seconds (join(timeout=5))
    end
    
    Parent->>Parent: sys.exit(0)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 I nibble logs and watch signals hum,

terminate softly before the drum.
Thirty seconds to tidy the lair,
A stern little hop if stragglers stare,
Goodbye, cleanly — with carrot flair! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing SIGKILL shutdown with graceful SIGTERM-based shutdown for worker processes.
Description check ✅ Passed The description provides a clear summary of the changes and test plan, though it lacks the GitHub issue reference and some checklist items are incomplete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/graceful-shutdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 27, 2026

Merging this PR will not alter performance

✅ 189 untouched benchmarks


Comparing fix/graceful-shutdown (834cf7f) with main (3e04c65)

Open in CodSpeed

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@robyn/processpool.py`:
- Around line 56-63: The reloader currently bypasses the graceful shutdown logic
by calling prev_process.kill() directly; refactor so the reloader uses the same
terminate→join→kill sequence in robyn/processpool.py instead of force-killing.
Add or expose a helper (e.g., graceful_shutdown or shutdown_process_pool) that
takes a process or process_pool and runs process.terminate(), then joins each
process with timeout=30, logs and calls process.kill() only if still alive and
then joins again with timeout=5; update robyn/reloader.py to call that helper
(replacing prev_process.kill()) so reloads follow the same graceful path.
- Around line 233-240: The worker event loop only registers SIGTERM via
loop.add_signal_handler(signal.SIGTERM, loop.stop), so Ctrl+C (SIGINT) raises
KeyboardInterrupt in children and bypasses coordinated shutdown; update the
worker event loop to also register SIGINT the same way (e.g., add
loop.add_signal_handler(signal.SIGINT, loop.stop) alongside the SIGTERM
registration on non-Windows platforms) or explicitly ignore SIGINT in the worker
before the loop starts; ensure you keep the existing except
KeyboardInterrupt/finally structure but prefer registering SIGINT so
KeyboardInterrupt is not raised in workers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 138fabd6-6b1c-4fb6-828c-29463cd38c1e

📥 Commits

Reviewing files that changed from the base of the PR and between a54ff96 and d53d8b0.

📒 Files selected for processing (1)
  • robyn/processpool.py

Comment on lines +233 to 240
if not sys.platform.startswith("win32"):
loop.add_signal_handler(signal.SIGTERM, loop.stop)

loop.run_forever()
except KeyboardInterrupt:
pass
finally:
loop.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect whether workers have explicit SIGINT handling or process-group isolation.
rg -n -C2 'signal\.SIGINT|signal\.signal\(signal\.SIGINT|add_signal_handler\(signal\.SIGINT|start_new_session|setsid|setpgrp' .

Repository: sparckles/Robyn

Length of output: 1664


🏁 Script executed:

# First, let's see the overall structure of processpool.py
head -100 robyn/processpool.py | cat -n

Repository: sparckles/Robyn

Length of output: 3620


🏁 Script executed:

# Continue reading to see spawn_process and worker implementation
sed -n '100,250p' robyn/processpool.py | cat -n

Repository: sparckles/Robyn

Length of output: 5761


Workers can still bypass graceful shutdown on Ctrl+C.

Only SIGTERM is trapped in the worker event loop (line 135). The parent installs its SIGINT handler after spawning workers (lines 67–68 occur after line 36–51), so the children inherit default SIGINT behavior. When you press Ctrl+C, the signal reaches workers directly, raising KeyboardInterrupt, which is caught at line 138–139 and does nothing except close the loop. This skips any custom termination logic. The parent's terminating_signal_handler may still execute, but the worker-side flow is not coordinated. Please either ignore SIGINT in the worker event loop or handle it the same way as SIGTERM (calling loop.stop()).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@robyn/processpool.py` around lines 233 - 240, The worker event loop only
registers SIGTERM via loop.add_signal_handler(signal.SIGTERM, loop.stop), so
Ctrl+C (SIGINT) raises KeyboardInterrupt in children and bypasses coordinated
shutdown; update the worker event loop to also register SIGINT the same way
(e.g., add loop.add_signal_handler(signal.SIGINT, loop.stop) alongside the
SIGTERM registration on non-Windows platforms) or explicitly ignore SIGINT in
the worker before the loop starts; ensure you keep the existing except
KeyboardInterrupt/finally structure but prefer registering SIGINT so
KeyboardInterrupt is not raised in workers.

The reloader was sending SIGTERM then unconditionally calling kill(),
bypassing the graceful shutdown the PR adds. Now stop_server() uses
terminate(), waits up to 5 seconds, and only force-kills if the process
is still alive. The redundant prev_process.kill() in reload() is removed
since stop_server() already handles full cleanup.

Made-with: Cursor
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@robyn/reloader.py`:
- Around line 126-132: The post-kill wait in the reloader's shutdown path blocks
indefinitely because it calls self.process.wait() with no timeout after
self.process.kill(); change this to use a bounded wait like
self.process.wait(timeout=5) (mirroring the pattern in robyn/processpool.py) and
handle subprocess.TimeoutExpired the same way so the reloader cannot hang if the
child fails to reap; update the block around self.process.terminate(),
self.process.wait(...), subprocess.TimeoutExpired, self.process.kill(), and the
final self.process.wait(...) call to include the timeout and appropriate
exception handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d389933c-81c0-4893-8075-f8f3d065c5f9

📥 Commits

Reviewing files that changed from the base of the PR and between d53d8b0 and 834cf7f.

📒 Files selected for processing (1)
  • robyn/reloader.py

Comment on lines +126 to +132
self.process.terminate()
try:
self.process.wait(timeout=5)
except subprocess.TimeoutExpired:
logger.warning("Process %s did not shut down in time, forcing kill.", self.process.pid)
self.process.kill()
self.process.wait()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read-only verification: inspect kill/wait sequencing in shutdown paths.
rg -nP --type=py -C4 '\.terminate\(\)|\.kill\(\)|\.wait\(' robyn/reloader.py robyn/processpool.py

Repository: sparckles/Robyn

Length of output: 2282


Add a timeout to the post-kill wait to prevent indefinite blocking during reload.

Line 132 calls self.process.wait() without a timeout after process.kill(). If the process fails to reap, this blocks the reloader indefinitely, stalling both reload and shutdown operations. This should match the pattern used in robyn/processpool.py:63, which bounds the fallback wait with a timeout.

Proposed fix
             except subprocess.TimeoutExpired:
                 logger.warning("Process %s did not shut down in time, forcing kill.", self.process.pid)
                 self.process.kill()
-                self.process.wait()
+                try:
+                    self.process.wait(timeout=5)
+                except subprocess.TimeoutExpired:
+                    logger.error("Process %s did not exit after force kill.", self.process.pid)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@robyn/reloader.py` around lines 126 - 132, The post-kill wait in the
reloader's shutdown path blocks indefinitely because it calls
self.process.wait() with no timeout after self.process.kill(); change this to
use a bounded wait like self.process.wait(timeout=5) (mirroring the pattern in
robyn/processpool.py) and handle subprocess.TimeoutExpired the same way so the
reloader cannot hang if the child fails to reap; update the block around
self.process.terminate(), self.process.wait(...), subprocess.TimeoutExpired,
self.process.kill(), and the final self.process.wait(...) call to include the
timeout and appropriate exception handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant