Skip to content

fix: daemon upgrade no longer kills running processes#133

Merged
nick-skriabin merged 2 commits intomainfrom
fb-daemon-fd-inheritance
Mar 12, 2026
Merged

fix: daemon upgrade no longer kills running processes#133
nick-skriabin merged 2 commits intomainfrom
fb-daemon-fd-inheritance

Conversation

@nick-skriabin
Copy link
Contributor

@nick-skriabin nick-skriabin commented Mar 12, 2026

Summary

  • Fix: After hot-upgrade, waitpid returned ECHILD for inherited shell processes (reparented to PID 1 when old daemon exits). The old code treated any non-zero waitpid result as "exited", immediately marking all panes dead — killing every running session within 50ms of upgrade.
  • Fix: Use kill(pid, 0) as fallback when waitpid returns ECHILD to check if the process actually exists before declaring it dead.
  • Test: New test creates a grandchild process (reparented to init, matching the real upgrade scenario) and verifies childExited() correctly distinguishes "not our child" from "actually exited".

Why existing tests didn't catch it

migrateDaemon() in the test harness does in-process serialize/deserialize — the test process stays the parent of all shell children, so waitpid always succeeds. The real upgrade uses posix_spawn which creates a separate daemon process that inherits PTY fds but is NOT the parent of the shells.

Test plan

  • zig build test passes (all existing + new test)
  • Manual: upgrade daemon with active sessions, verify shells stay alive

Copilot AI review requested due to automatic review settings March 12, 2026 20:42
@nick-skriabin nick-skriabin changed the title Fb daemon fd inheritance fix: daemon upgrade no longer kills running processes Mar 12, 2026
@nick-skriabin nick-skriabin enabled auto-merge (squash) March 12, 2026 20:42
@nick-skriabin nick-skriabin merged commit 17aa42f into main Mar 12, 2026
4 checks passed
@nick-skriabin nick-skriabin deleted the fb-daemon-fd-inheritance branch March 12, 2026 20:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates PTY child-process liveness detection to handle daemon hot-upgrade cases where the new daemon inherits PTY fds but is no longer the parent of the shell process (so waitpid returns ECHILD).

Changes:

  • Add kill(pid, 0) fallback in Pty.childExited() when waitpid(..., WNOHANG) fails, to distinguish “not our child but still alive” vs “gone”.
  • Add a regression test that creates a reparented grandchild process to simulate the hot-upgrade scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +278 to +279
if (kill(self.pid, 0) == 0) return false; // process exists
return true; // process is gone
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kill(pid, 0) existence check treats any non-zero return as "process is gone". On POSIX, kill(pid, 0) can fail with EPERM even when the process exists (no permission to signal it). This would cause childExited() to incorrectly report the process as exited. Prefer using std.posix.kill and treating PermissionDenied as "still alive", and only treating ProcessNotFound/ESRCH as exited.

Suggested change
if (kill(self.pid, 0) == 0) return false; // process exists
return true; // process is gone
posix.kill(self.pid, 0) catch |err| {
// PermissionDenied means the process exists but we can't signal it.
// Only treat ProcessNotFound/ESRCH as "process is gone".
return switch (err) {
error.PermissionDenied => false,
error.ProcessNotFound => true,
else => false,
};
};
// kill succeeded: process exists
return false;

Copilot uses AI. Check for mistakes.
Comment on lines +338 to +340
const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0;
posix.close(pipe_fds[0]);
if (n != @sizeOf(i32)) return error.TestFailed;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test reads the grandchild PID from a pipe with a single read() and expects exactly @sizeof(i32) bytes. Pipe reads can legally return fewer bytes than requested (e.g., due to EINTR), which can make this test flaky. Consider looping until all bytes are read (or EOF) rather than failing on a short read.

Suggested change
const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0;
posix.close(pipe_fds[0]);
if (n != @sizeOf(i32)) return error.TestFailed;
var total_read: usize = 0;
while (total_read < @sizeOf(i32)) {
const n = posix.read(pipe_fds[0], gc_pid_bytes[total_read..]) catch 0;
if (n == 0) break; // EOF or error
total_read += n;
}
posix.close(pipe_fds[0]);
if (total_read != @sizeOf(i32)) return error.TestFailed;

Copilot uses AI. Check for mistakes.
Comment on lines +355 to +356
// After kill, childExited should eventually return true
posix.nanosleep(0, 50_000_000); // 50ms for signal delivery
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test cleanup uses a fixed 50ms nanosleep and then expects childExited() to return true. Signal delivery + init reaping timing can vary on loaded CI, so this can be flaky. Consider polling childExited() with a bounded timeout (or using a longer/backoff wait) instead of a single fixed sleep.

Suggested change
// After kill, childExited should eventually return true
posix.nanosleep(0, 50_000_000); // 50ms for signal delivery
// After kill, childExited should eventually return true. Poll with a
// bounded timeout instead of relying on a single fixed sleep to avoid
// flakiness on slow or loaded systems.
var attempts: usize = 0;
while (attempts < 100 and !pty.childExited()) : (attempts += 1) {
// Sleep 10ms between checks, for up to ~1s total.
posix.nanosleep(0, 10_000_000);
}

Copilot uses AI. Check for mistakes.
try std.testing.expect(!pty.childExited());

// Clean up: kill the grandchild
_ = kill(gc_pid, 9); // SIGKILL
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test uses raw numeric signal values (9) with the C kill() binding. Using std.posix.SIG.KILL (and std.posix.kill) would avoid magic numbers and also make it easier to handle errors consistently across platforms.

Suggested change
_ = kill(gc_pid, 9); // SIGKILL
posix.kill(gc_pid, posix.SIG.KILL) catch {};

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +279
// result < 0: ECHILD — we're not the parent of this process.
// This happens after hot-upgrade: the new daemon inherits PTY fds
// but the shell processes were forked by the old daemon and got
// reparented to PID 1. Use kill(pid, 0) to check if the process
// is actually still alive.
if (kill(self.pid, 0) == 0) return false; // process exists
return true; // process is gone
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

childExited() treats any waitpid() error (result < 0) as ECHILD. waitpid can also fail with EINTR or other errors; in those cases this path can silently mask real issues and may return incorrect exit state. Consider checking errno and only using the kill(pid, 0) fallback for ECHILD; for EINTR, return false and retry on the next call (and for other errors, decide an explicit policy).

Suggested change
// result < 0: ECHILD — we're not the parent of this process.
// This happens after hot-upgrade: the new daemon inherits PTY fds
// but the shell processes were forked by the old daemon and got
// reparented to PID 1. Use kill(pid, 0) to check if the process
// is actually still alive.
if (kill(self.pid, 0) == 0) return false; // process exists
return true; // process is gone
// result < 0: check errno to distinguish ECHILD from other errors.
const err = posix.errno();
if (err == posix.ECHILD) {
// We're not the parent of this process anymore.
// This happens after hot-upgrade: the new daemon inherits PTY fds
// but the shell processes were forked by the old daemon and got
// reparented to PID 1. Use kill(pid, 0) to check if the process
// is actually still alive.
if (kill(self.pid, 0) == 0) return false; // process exists
return true; // process is gone
}
if (err == posix.EINTR) {
// Interrupted by signal; caller can retry on next invocation.
return false;
}
// For other errors, conservatively report "not exited yet" to avoid
// incorrectly marking the child as gone.
return false;

Copilot uses AI. Check for mistakes.
nick-skriabin added a commit that referenced this pull request Mar 13, 2026
## Summary

- **Fix**: After hot-upgrade, `waitpid` returned `ECHILD` for inherited
shell processes (reparented to PID 1 when old daemon exits). The old
code treated any non-zero `waitpid` result as "exited", immediately
marking all panes dead — killing every running session within 50ms of
upgrade.
- **Fix**: Use `kill(pid, 0)` as fallback when `waitpid` returns
`ECHILD` to check if the process actually exists before declaring it
dead.
- **Test**: New test creates a grandchild process (reparented to init,
matching the real upgrade scenario) and verifies `childExited()`
correctly distinguishes "not our child" from "actually exited".

## Why existing tests didn't catch it

`migrateDaemon()` in the test harness does in-process
serialize/deserialize — the test process stays the parent of all shell
children, so `waitpid` always succeeds. The real upgrade uses
`posix_spawn` which creates a separate daemon process that inherits PTY
fds but is NOT the parent of the shells.

## Test plan

- [x] `zig build test` passes (all existing + new test)
- [ ] Manual: upgrade daemon with active sessions, verify shells stay
alive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants