-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Summary
We encountered a failure where rancher-system-agent replaced itself with a partially written / corrupted binary during normal Rancher-managed upgrade activity.
The corrupted binary segfaults on startup (SIGSEGV), leaving the node permanently stuck in Rancher with:
“waiting for plan … to be applied”
Manual replacement of the agent binary was required to recover.
Impact
- Node cannot apply upgrade plans
- Rancher UI provides no actionable error
- Agent cannot self-repair once corrupted
- Requires out-of-band/manual node access to fix
- Breaks later Kubernetes upgrades even if corruption happened earlier
Environment
- Rancher Server upgraded from 2.12.2 → 2.13.1
- Managed RKE2 cluster
- Kubernetes upgrade later attempted: 1.32.10 → 1.32.11
- OS: Ubuntu 22.04
Observed Behavior
- Downstream RKE2 updated from 1.32.10 → 1.32.11
- Some nodes had different file sizes and SHA256 hashes for
/usr/local/bin/rancher-system-agent - Affected nodes showed immediate startup crashes:
rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV - Rancher UI showed the node stuck indefinitely waiting for a plan to be applied
- Copying a known-good
rancher-system-agentbinary from another node and restarting the service immediately resolved the issue
Binary modification timing (service stop, not reboot)
The corrupted agent binary was modified within seconds of the service being stopped by systemd.
Service log
Jan 21 14:56:04 systemd[1]: Stopping Rancher System Agent...
Jan 21 14:56:04 rancher-system-agent: signal received: "terminated"
Jan 21 14:56:04 systemd[1]: rancher-system-agent.service: Deactivated successfully.
The node was rebooted later. On the next start of the service, the corrupted binary began crashing immediately:
-- Boot <id> --
Jan 22 15:58:37 k8s029 kernel: Linux version 5.15.0-164-generic
Jan 22 15:58:53 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Failed with result 'signal'.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 1.
Jan 22 15:58:58 systemd[1]: Stopped Rancher System Agent.
Jan 22 15:58:58 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV
Corrupted binary metadata
File: rancher-system-agent.bak
Size: 42037248 bytes
Modify: 2026-01-21 14:56:07 -0500
Birth: 2025-01-16 09:46:58 -0500
The modification timestamp is ~3 seconds after the agent was terminated, strongly suggesting the binary was being updated or replaced when the service was stopped, leaving a partially written executable in place.
Delayed Failure Manifestation
The agent did not immediately fail after corruption.
- The corrupted binary existed on disk after the Rancher server upgrade
- The node appeared healthy until a later Kubernetes upgrade (1.32.10 → 1.32.11)
- During that upgrade the node could not drain cleanly ( hung task ) and was rebooted.
- After reboot the agent started, at which point the agent began crashing with SIGSEGV
Why This Appears to Be a Bug
- The agent replaces itself in-place
- There appears to be no checksum or content-length validation before installing the new binary
- The update is not atomic (no temp file + rename)
- There is no rollback or last-known-good fallback
- A normal service stop during update can leave the node unrecoverable
Expected Behavior
One or more of the following would prevent this class of failure:
- Download agent binary to a temporary file
- Verify checksum and/or size before replacement
- Use atomic rename to replace the binary
- Keep a last-known-good binary and fall back on startup failure
- Refuse to start if binary integrity checks fail, with a clear error
Workaround
Copied known good binary from working nodes and replaced truncated binary on broken node
Additional Notes
- The RKE2 installer tarball itself was checksum-verified successfully
- The corruption appears limited to the rancher-system-agent self-update path