fix: daemon upgrade no longer kills running processes#133
fix: daemon upgrade no longer kills running processes#133nick-skriabin merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates PTY child-process liveness detection to handle daemon hot-upgrade cases where the new daemon inherits PTY fds but is no longer the parent of the shell process (so waitpid returns ECHILD).
Changes:
- Add
kill(pid, 0)fallback inPty.childExited()whenwaitpid(..., WNOHANG)fails, to distinguish “not our child but still alive” vs “gone”. - Add a regression test that creates a reparented grandchild process to simulate the hot-upgrade scenario.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if (kill(self.pid, 0) == 0) return false; // process exists | ||
| return true; // process is gone |
There was a problem hiding this comment.
The kill(pid, 0) existence check treats any non-zero return as "process is gone". On POSIX, kill(pid, 0) can fail with EPERM even when the process exists (no permission to signal it). This would cause childExited() to incorrectly report the process as exited. Prefer using std.posix.kill and treating PermissionDenied as "still alive", and only treating ProcessNotFound/ESRCH as exited.
| if (kill(self.pid, 0) == 0) return false; // process exists | |
| return true; // process is gone | |
| posix.kill(self.pid, 0) catch |err| { | |
| // PermissionDenied means the process exists but we can't signal it. | |
| // Only treat ProcessNotFound/ESRCH as "process is gone". | |
| return switch (err) { | |
| error.PermissionDenied => false, | |
| error.ProcessNotFound => true, | |
| else => false, | |
| }; | |
| }; | |
| // kill succeeded: process exists | |
| return false; |
| const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0; | ||
| posix.close(pipe_fds[0]); | ||
| if (n != @sizeOf(i32)) return error.TestFailed; |
There was a problem hiding this comment.
Test reads the grandchild PID from a pipe with a single read() and expects exactly @sizeof(i32) bytes. Pipe reads can legally return fewer bytes than requested (e.g., due to EINTR), which can make this test flaky. Consider looping until all bytes are read (or EOF) rather than failing on a short read.
| const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0; | |
| posix.close(pipe_fds[0]); | |
| if (n != @sizeOf(i32)) return error.TestFailed; | |
| var total_read: usize = 0; | |
| while (total_read < @sizeOf(i32)) { | |
| const n = posix.read(pipe_fds[0], gc_pid_bytes[total_read..]) catch 0; | |
| if (n == 0) break; // EOF or error | |
| total_read += n; | |
| } | |
| posix.close(pipe_fds[0]); | |
| if (total_read != @sizeOf(i32)) return error.TestFailed; |
| // After kill, childExited should eventually return true | ||
| posix.nanosleep(0, 50_000_000); // 50ms for signal delivery |
There was a problem hiding this comment.
Test cleanup uses a fixed 50ms nanosleep and then expects childExited() to return true. Signal delivery + init reaping timing can vary on loaded CI, so this can be flaky. Consider polling childExited() with a bounded timeout (or using a longer/backoff wait) instead of a single fixed sleep.
| // After kill, childExited should eventually return true | |
| posix.nanosleep(0, 50_000_000); // 50ms for signal delivery | |
| // After kill, childExited should eventually return true. Poll with a | |
| // bounded timeout instead of relying on a single fixed sleep to avoid | |
| // flakiness on slow or loaded systems. | |
| var attempts: usize = 0; | |
| while (attempts < 100 and !pty.childExited()) : (attempts += 1) { | |
| // Sleep 10ms between checks, for up to ~1s total. | |
| posix.nanosleep(0, 10_000_000); | |
| } |
| try std.testing.expect(!pty.childExited()); | ||
|
|
||
| // Clean up: kill the grandchild | ||
| _ = kill(gc_pid, 9); // SIGKILL |
There was a problem hiding this comment.
Test uses raw numeric signal values (9) with the C kill() binding. Using std.posix.SIG.KILL (and std.posix.kill) would avoid magic numbers and also make it easier to handle errors consistently across platforms.
| _ = kill(gc_pid, 9); // SIGKILL | |
| posix.kill(gc_pid, posix.SIG.KILL) catch {}; |
| // result < 0: ECHILD — we're not the parent of this process. | ||
| // This happens after hot-upgrade: the new daemon inherits PTY fds | ||
| // but the shell processes were forked by the old daemon and got | ||
| // reparented to PID 1. Use kill(pid, 0) to check if the process | ||
| // is actually still alive. | ||
| if (kill(self.pid, 0) == 0) return false; // process exists | ||
| return true; // process is gone |
There was a problem hiding this comment.
childExited() treats any waitpid() error (result < 0) as ECHILD. waitpid can also fail with EINTR or other errors; in those cases this path can silently mask real issues and may return incorrect exit state. Consider checking errno and only using the kill(pid, 0) fallback for ECHILD; for EINTR, return false and retry on the next call (and for other errors, decide an explicit policy).
| // result < 0: ECHILD — we're not the parent of this process. | |
| // This happens after hot-upgrade: the new daemon inherits PTY fds | |
| // but the shell processes were forked by the old daemon and got | |
| // reparented to PID 1. Use kill(pid, 0) to check if the process | |
| // is actually still alive. | |
| if (kill(self.pid, 0) == 0) return false; // process exists | |
| return true; // process is gone | |
| // result < 0: check errno to distinguish ECHILD from other errors. | |
| const err = posix.errno(); | |
| if (err == posix.ECHILD) { | |
| // We're not the parent of this process anymore. | |
| // This happens after hot-upgrade: the new daemon inherits PTY fds | |
| // but the shell processes were forked by the old daemon and got | |
| // reparented to PID 1. Use kill(pid, 0) to check if the process | |
| // is actually still alive. | |
| if (kill(self.pid, 0) == 0) return false; // process exists | |
| return true; // process is gone | |
| } | |
| if (err == posix.EINTR) { | |
| // Interrupted by signal; caller can retry on next invocation. | |
| return false; | |
| } | |
| // For other errors, conservatively report "not exited yet" to avoid | |
| // incorrectly marking the child as gone. | |
| return false; |
## Summary - **Fix**: After hot-upgrade, `waitpid` returned `ECHILD` for inherited shell processes (reparented to PID 1 when old daemon exits). The old code treated any non-zero `waitpid` result as "exited", immediately marking all panes dead — killing every running session within 50ms of upgrade. - **Fix**: Use `kill(pid, 0)` as fallback when `waitpid` returns `ECHILD` to check if the process actually exists before declaring it dead. - **Test**: New test creates a grandchild process (reparented to init, matching the real upgrade scenario) and verifies `childExited()` correctly distinguishes "not our child" from "actually exited". ## Why existing tests didn't catch it `migrateDaemon()` in the test harness does in-process serialize/deserialize — the test process stays the parent of all shell children, so `waitpid` always succeeds. The real upgrade uses `posix_spawn` which creates a separate daemon process that inherits PTY fds but is NOT the parent of the shells. ## Test plan - [x] `zig build test` passes (all existing + new test) - [ ] Manual: upgrade daemon with active sessions, verify shells stay alive
Summary
waitpidreturnedECHILDfor inherited shell processes (reparented to PID 1 when old daemon exits). The old code treated any non-zerowaitpidresult as "exited", immediately marking all panes dead — killing every running session within 50ms of upgrade.kill(pid, 0)as fallback whenwaitpidreturnsECHILDto check if the process actually exists before declaring it dead.childExited()correctly distinguishes "not our child" from "actually exited".Why existing tests didn't catch it
migrateDaemon()in the test harness does in-process serialize/deserialize — the test process stays the parent of all shell children, sowaitpidalways succeeds. The real upgrade usesposix_spawnwhich creates a separate daemon process that inherits PTY fds but is NOT the parent of the shells.Test plan
zig build testpasses (all existing + new test)