fix: improve worker lifecycle reliability#360
Open
whitmo wants to merge 1 commit intodlorenc:mainfrom
Open
Conversation
Four targeted fixes to the worker start/complete/cleanup flow: 1. Detect stale workers with PID=0: Workers where Claude failed to start (PID=0) were never detected as dead by the health check. Now marks them for cleanup after a 2-minute grace period. 2. Clean up transient agents with dead processes: When a worker's process dies but its tmux window persists, the health check now marks non-persistent agents for cleanup instead of just logging a warning. 3. Rollback on worker creation failure: If createWorker() fails after creating resources (worktree, tmux window), those resources are now cleaned up instead of being left orphaned. 4. Record task history on manual worker removal: `worker rm` now records task history before removing, matching the behavior of automatic cleanup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
Triage ReviewPriority: P0 (Reliable worker lifecycle - roadmap item) Changes:
Recommendation: Merge before #364. Solid fix with good test coverage. |
This was referenced Mar 7, 2026
Author
Local CI Verification (2026-03-12)
CI Status: No GitHub Actions checks are running — this is expected for first-time fork PRs. GitHub requires a maintainer to approve workflow runs for PRs from forks. Branch is rebased on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
createWorker()fails after creating worktree/tmux window (e.g., daemon registration fails), orphaned resources are now cleaned up.worker rm: Manual worker removal viaworker rmnow records task history, matching automatic cleanup behavior.Context
This addresses the P0 roadmap item "Reliable worker lifecycle - Workers should start, complete, and clean up without manual intervention." The audit identified four gaps in the worker lifecycle that could require manual intervention:
checkAgentHealth()skipped PID=0 workers entirely (line 338:if agent.PID > 0)createWorker()had no rollback - partial failures left orphaned tmux windows and worktreeshandleRemoveAgent()didn't callrecordTaskHistory()unlikecleanupDeadAgents()Test plan
TestCheckAgentHealthPIDZeroStaleWorker- verifies PID=0 grace period logicTestHandleRemoveAgentRecordsTaskHistory- verifies task history is recorded on manual removalgo test ./...- 27 packages)go build ./cmd/multiclaude)Opportunities (not implemented)
🤖 Generated with Claude Code