fix: daemon upgrade no longer kills running processes by nick-skriabin · Pull Request #133 · semos-labs/attyx

nick-skriabin · 2026-03-12T20:42:01Z

Summary

Fix: After hot-upgrade, waitpid returned ECHILD for inherited shell processes (reparented to PID 1 when old daemon exits). The old code treated any non-zero waitpid result as "exited", immediately marking all panes dead — killing every running session within 50ms of upgrade.
Fix: Use kill(pid, 0) as fallback when waitpid returns ECHILD to check if the process actually exists before declaring it dead.
Test: New test creates a grandchild process (reparented to init, matching the real upgrade scenario) and verifies childExited() correctly distinguishes "not our child" from "actually exited".

Why existing tests didn't catch it

migrateDaemon() in the test harness does in-process serialize/deserialize — the test process stays the parent of all shell children, so waitpid always succeeds. The real upgrade uses posix_spawn which creates a separate daemon process that inherits PTY fds but is NOT the parent of the shells.

Test plan

zig build test passes (all existing + new test)
Manual: upgrade daemon with active sessions, verify shells stay alive

Copilot

Pull request overview

Updates PTY child-process liveness detection to handle daemon hot-upgrade cases where the new daemon inherits PTY fds but is no longer the parent of the shell process (so waitpid returns ECHILD).

Changes:

Add kill(pid, 0) fallback in Pty.childExited() when waitpid(..., WNOHANG) fails, to distinguish “not our child but still alive” vs “gone”.
Add a regression test that creates a reparented grandchild process to simulate the hot-upgrade scenario.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-12T20:46:44Z

src/app/pty.zig

+        if (kill(self.pid, 0) == 0) return false; // process exists
+        return true; // process is gone


The kill(pid, 0) existence check treats any non-zero return as "process is gone". On POSIX, kill(pid, 0) can fail with EPERM even when the process exists (no permission to signal it). This would cause childExited() to incorrectly report the process as exited. Prefer using std.posix.kill and treating PermissionDenied as "still alive", and only treating ProcessNotFound/ESRCH as exited.

Suggested change

if (kill(self.pid, 0) == 0) return false; // process exists

return true; // process is gone

posix.kill(self.pid, 0) catch |err| {

// PermissionDenied means the process exists but we can't signal it.

// Only treat ProcessNotFound/ESRCH as "process is gone".

return switch (err) {

error.PermissionDenied => false,

error.ProcessNotFound => true,

else => false,

};

};

// kill succeeded: process exists

return false;

Copilot · 2026-03-12T20:46:45Z

src/app/pty.zig

+    const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0;
+    posix.close(pipe_fds[0]);
+    if (n != @sizeOf(i32)) return error.TestFailed;


Test reads the grandchild PID from a pipe with a single read() and expects exactly @sizeof(i32) bytes. Pipe reads can legally return fewer bytes than requested (e.g., due to EINTR), which can make this test flaky. Consider looping until all bytes are read (or EOF) rather than failing on a short read.

Suggested change

const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0;

posix.close(pipe_fds[0]);

if (n != @sizeOf(i32)) return error.TestFailed;

var total_read: usize = 0;

while (total_read < @sizeOf(i32)) {

const n = posix.read(pipe_fds[0], gc_pid_bytes[total_read..]) catch 0;

if (n == 0) break; // EOF or error

total_read += n;

}

posix.close(pipe_fds[0]);

if (total_read != @sizeOf(i32)) return error.TestFailed;

Copilot · 2026-03-12T20:46:45Z

src/app/pty.zig

+    // After kill, childExited should eventually return true
+    posix.nanosleep(0, 50_000_000); // 50ms for signal delivery


Test cleanup uses a fixed 50ms nanosleep and then expects childExited() to return true. Signal delivery + init reaping timing can vary on loaded CI, so this can be flaky. Consider polling childExited() with a bounded timeout (or using a longer/backoff wait) instead of a single fixed sleep.

Suggested change

// After kill, childExited should eventually return true

posix.nanosleep(0, 50_000_000); // 50ms for signal delivery

// After kill, childExited should eventually return true. Poll with a

// bounded timeout instead of relying on a single fixed sleep to avoid

// flakiness on slow or loaded systems.

var attempts: usize = 0;

while (attempts < 100 and !pty.childExited()) : (attempts += 1) {

// Sleep 10ms between checks, for up to ~1s total.

posix.nanosleep(0, 10_000_000);

}

Copilot · 2026-03-12T20:46:45Z

src/app/pty.zig

+    try std.testing.expect(!pty.childExited());
+
+    // Clean up: kill the grandchild
+    _ = kill(gc_pid, 9); // SIGKILL


Test uses raw numeric signal values (9) with the C kill() binding. Using std.posix.SIG.KILL (and std.posix.kill) would avoid magic numbers and also make it easier to handle errors consistently across platforms.

Suggested change

_ = kill(gc_pid, 9); // SIGKILL

posix.kill(gc_pid, posix.SIG.KILL) catch {};

Copilot · 2026-03-12T20:46:46Z

src/app/pty.zig

+        // result < 0: ECHILD — we're not the parent of this process.
+        // This happens after hot-upgrade: the new daemon inherits PTY fds
+        // but the shell processes were forked by the old daemon and got
+        // reparented to PID 1. Use kill(pid, 0) to check if the process
+        // is actually still alive.
+        if (kill(self.pid, 0) == 0) return false; // process exists
+        return true; // process is gone


childExited() treats any waitpid() error (result < 0) as ECHILD. waitpid can also fail with EINTR or other errors; in those cases this path can silently mask real issues and may return incorrect exit state. Consider checking errno and only using the kill(pid, 0) fallback for ECHILD; for EINTR, return false and retry on the next call (and for other errors, decide an explicit policy).

Suggested change

// result < 0: ECHILD — we're not the parent of this process.

// This happens after hot-upgrade: the new daemon inherits PTY fds

// but the shell processes were forked by the old daemon and got

// reparented to PID 1. Use kill(pid, 0) to check if the process

// is actually still alive.

if (kill(self.pid, 0) == 0) return false; // process exists

return true; // process is gone

// result < 0: check errno to distinguish ECHILD from other errors.

const err = posix.errno();

if (err == posix.ECHILD) {

// We're not the parent of this process anymore.

// This happens after hot-upgrade: the new daemon inherits PTY fds

// but the shell processes were forked by the old daemon and got

// reparented to PID 1. Use kill(pid, 0) to check if the process

// is actually still alive.

if (kill(self.pid, 0) == 0) return false; // process exists

return true; // process is gone

}

if (err == posix.EINTR) {

// Interrupted by signal; caller can retry on next invocation.

return false;

}

// For other errors, conservatively report "not exited yet" to avoid

// incorrectly marking the child as gone.

return false;

## Summary - **Fix**: After hot-upgrade, `waitpid` returned `ECHILD` for inherited shell processes (reparented to PID 1 when old daemon exits). The old code treated any non-zero `waitpid` result as "exited", immediately marking all panes dead — killing every running session within 50ms of upgrade. - **Fix**: Use `kill(pid, 0)` as fallback when `waitpid` returns `ECHILD` to check if the process actually exists before declaring it dead. - **Test**: New test creates a grandchild process (reparented to init, matching the real upgrade scenario) and verifies `childExited()` correctly distinguishes "not our child" from "actually exited". ## Why existing tests didn't catch it `migrateDaemon()` in the test harness does in-process serialize/deserialize — the test process stays the parent of all shell children, so `waitpid` always succeeds. The real upgrade uses `posix_spawn` which creates a separate daemon process that inherits PTY fds but is NOT the parent of the shells. ## Test plan - [x] `zig build test` passes (all existing + new test) - [ ] Manual: upgrade daemon with active sessions, verify shells stay alive

nick-skriabin added 2 commits March 13, 2026 00:41

Fix daemon fd inheritance (stop losing child processes)

c707585

Merge branch 'main' into fb-daemon-fd-inheritance

e0cd126

Copilot AI review requested due to automatic review settings March 12, 2026 20:42

Copilot started reviewing on behalf of nick-skriabin March 12, 2026 20:42 View session

nick-skriabin changed the title ~~Fb daemon fd inheritance~~ fix: daemon upgrade no longer kills running processes Mar 12, 2026

nick-skriabin enabled auto-merge (squash) March 12, 2026 20:42

nick-skriabin merged commit 17aa42f into main Mar 12, 2026
4 checks passed

nick-skriabin deleted the fb-daemon-fd-inheritance branch March 12, 2026 20:43

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: daemon upgrade no longer kills running processes#133

fix: daemon upgrade no longer kills running processes#133
nick-skriabin merged 2 commits intomainfrom
fb-daemon-fd-inheritance

nick-skriabin commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (kill(self.pid, 0) == 0) return false; // process exists
		return true; // process is gone

-        if (kill(self.pid, 0) == 0) return false; // process exists
-        return true; // process is gone
+        posix.kill(self.pid, 0) catch |err| {
+            // PermissionDenied means the process exists but we can't signal it.
+            // Only treat ProcessNotFound/ESRCH as "process is gone".
+            return switch (err) {
+                error.PermissionDenied => false,
+                error.ProcessNotFound => true,
+                else => false,
+            };
+        };
+        // kill succeeded: process exists
+        return false;

-    const n = posix.read(pipe_fds[0], &gc_pid_bytes) catch 0;
-    posix.close(pipe_fds[0]);
-    if (n != @sizeOf(i32)) return error.TestFailed;
+    var total_read: usize = 0;
+    while (total_read < @sizeOf(i32)) {
+        const n = posix.read(pipe_fds[0], gc_pid_bytes[total_read..]) catch 0;
+        if (n == 0) break; // EOF or error
+        total_read += n;
+    }
+    posix.close(pipe_fds[0]);
+    if (total_read != @sizeOf(i32)) return error.TestFailed;

		// After kill, childExited should eventually return true
		posix.nanosleep(0, 50_000_000); // 50ms for signal delivery

-    // After kill, childExited should eventually return true
-    posix.nanosleep(0, 50_000_000); // 50ms for signal delivery
+    // After kill, childExited should eventually return true. Poll with a
+    // bounded timeout instead of relying on a single fixed sleep to avoid
+    // flakiness on slow or loaded systems.
+    var attempts: usize = 0;
+    while (attempts < 100 and !pty.childExited()) : (attempts += 1) {
+        // Sleep 10ms between checks, for up to ~1s total.
+        posix.nanosleep(0, 10_000_000);
+    }

	_ = kill(gc_pid, 9); // SIGKILL
	posix.kill(gc_pid, posix.SIG.KILL) catch {};

Conversation

nick-skriabin commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why existing tests didn't catch it

Test plan

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nick-skriabin commented Mar 12, 2026 •

edited

Loading