Skip to content

fix: improve worker lifecycle reliability#360

Open
whitmo wants to merge 1 commit intodlorenc:mainfrom
whitmo:work/jolly-koala
Open

fix: improve worker lifecycle reliability#360
whitmo wants to merge 1 commit intodlorenc:mainfrom
whitmo:work/jolly-koala

Conversation

@whitmo
Copy link
Copy Markdown

@whitmo whitmo commented Mar 3, 2026

Summary

  • Detect stale PID=0 workers: Workers where Claude failed to start had PID=0 and were never detected as dead by the health check. Now marks them for cleanup after a 2-minute grace period.
  • Clean up workers with dead processes: When a worker's process dies but its tmux window still exists, health check now marks non-persistent agents for cleanup instead of silently ignoring.
  • Rollback on creation failure: If createWorker() fails after creating worktree/tmux window (e.g., daemon registration fails), orphaned resources are now cleaned up.
  • Record task history on worker rm: Manual worker removal via worker rm now records task history, matching automatic cleanup behavior.

Context

This addresses the P0 roadmap item "Reliable worker lifecycle - Workers should start, complete, and clean up without manual intervention." The audit identified four gaps in the worker lifecycle that could require manual intervention:

  1. checkAgentHealth() skipped PID=0 workers entirely (line 338: if agent.PID > 0)
  2. Dead worker processes were logged but not cleaned up if tmux window still existed
  3. createWorker() had no rollback - partial failures left orphaned tmux windows and worktrees
  4. handleRemoveAgent() didn't call recordTaskHistory() unlike cleanupDeadAgents()

Test plan

  • New test: TestCheckAgentHealthPIDZeroStaleWorker - verifies PID=0 grace period logic
  • New test: TestHandleRemoveAgentRecordsTaskHistory - verifies task history is recorded on manual removal
  • All existing tests pass (go test ./... - 27 packages)
  • Build succeeds (go build ./cmd/multiclaude)

Opportunities (not implemented)

  • Add retry backoff for persistent agent auto-restart to avoid spam on repeated failures
  • Improve tmux session restoration to not nuke all agents on transient failure
  • Add a health check diagnostic endpoint showing per-agent status

🤖 Generated with Claude Code

Four targeted fixes to the worker start/complete/cleanup flow:

1. Detect stale workers with PID=0: Workers where Claude failed to start
   (PID=0) were never detected as dead by the health check. Now marks them
   for cleanup after a 2-minute grace period.

2. Clean up transient agents with dead processes: When a worker's process
   dies but its tmux window persists, the health check now marks non-persistent
   agents for cleanup instead of just logging a warning.

3. Rollback on worker creation failure: If createWorker() fails after creating
   resources (worktree, tmux window), those resources are now cleaned up instead
   of being left orphaned.

4. Record task history on manual worker removal: `worker rm` now records task
   history before removing, matching the behavior of automatic cleanup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@whitmo
Copy link
Copy Markdown
Author

whitmo commented Mar 7, 2026

Triage Review

Priority: P0 (Reliable worker lifecycle - roadmap item)
Build: Pass
Tests: Pass (new tests: TestCheckAgentHealthPIDZeroStaleWorker, TestHandleRemoveAgentRecordsTaskHistory)
Merge conflicts: Conflicts with PR #364 in internal/cli/cli.go (both modify createWorker)
Roadmap alignment: Directly addresses P0 "Reliable worker lifecycle" item

Changes:

  • Detects stale PID=0 workers after 2-min grace period
  • Cleans up dead worker processes
  • Adds rollback on worker creation failure
  • Records task history on manual worker rm

Recommendation: Merge before #364. Solid fix with good test coverage.

@whitmo
Copy link
Copy Markdown
Author

whitmo commented Mar 12, 2026

Local CI Verification (2026-03-12)

Check Result
go build PASS
go vet PASS
go test ./... PASS (all 22 packages)

CI Status: No GitHub Actions checks are running — this is expected for first-time fork PRs. GitHub requires a maintainer to approve workflow runs for PRs from forks.

Branch is rebased on upstream/main (0 commits behind). Ready for maintainer review and CI approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant