rancher-system-agent self-update can leave truncated binary if interrupted, causing SIGSEGV and stuck upgrades

### Summary

We encountered a failure where `rancher-system-agent` replaced itself with a **partially written / corrupted binary** during normal Rancher-managed upgrade activity.

The corrupted binary **segfaults on startup (SIGSEGV)**, leaving the node permanently stuck in Rancher with:

> “waiting for plan … to be applied”

Manual replacement of the agent binary was required to recover.

### Impact

- Node cannot apply upgrade plans
- Rancher UI provides no actionable error
- Agent cannot self-repair once corrupted
- Requires out-of-band/manual node access to fix
- Breaks later Kubernetes upgrades even if corruption happened earlier

### Environment

- Rancher Server upgraded from **2.12.2 → 2.13.1**
- Managed RKE2 cluster
- Kubernetes upgrade later attempted: **1.32.10 → 1.32.11**
- OS: Ubuntu 22.04

### Observed Behavior

- Downstream RKE2 updated from **1.32.10 → 1.32.11**
- Some nodes had **different file sizes and SHA256 hashes** for `/usr/local/bin/rancher-system-agent`
- Affected nodes showed immediate startup crashes:
`rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV`
- Rancher UI showed the node stuck indefinitely waiting for a plan to be applied
- Copying a known-good `rancher-system-agent` binary from another node and restarting the service immediately resolved the issue



### Binary modification timing (service stop, not reboot)

The corrupted agent binary was modified within seconds of the service being stopped by systemd.

#### Service log
```
Jan 21 14:56:04 systemd[1]: Stopping Rancher System Agent...
Jan 21 14:56:04 rancher-system-agent: signal received: "terminated"
Jan 21 14:56:04 systemd[1]: rancher-system-agent.service: Deactivated successfully.
```

The node was rebooted later. On the next start of the service, the corrupted binary began crashing immediately:
```
-- Boot <id> --
Jan 22 15:58:37 k8s029 kernel: Linux version 5.15.0-164-generic
Jan 22 15:58:53 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Failed with result 'signal'.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 1.
Jan 22 15:58:58 systemd[1]: Stopped Rancher System Agent.
Jan 22 15:58:58 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV

```

#### Corrupted binary metadata
File: rancher-system-agent.bak
Size: 42037248 bytes
Modify: 2026-01-21 14:56:07 -0500
Birth: 2025-01-16 09:46:58 -0500

The modification timestamp is **~3 seconds after the agent was terminated**, strongly suggesting the binary was being updated or replaced when the service was stopped, leaving a partially written executable in place.

### Delayed Failure Manifestation

The agent did **not** immediately fail after corruption.

- The corrupted binary existed on disk after the Rancher server upgrade
- The node appeared healthy until a later Kubernetes upgrade (1.32.10 → 1.32.11)
- During that upgrade the node could not drain cleanly ( hung task ) and was rebooted.
- After reboot the agent started, at which point the agent began crashing with SIGSEGV

### Why This Appears to Be a Bug

- The agent replaces itself **in-place**
- There appears to be **no checksum or content-length validation** before installing the new binary
- The update is **not atomic** (no temp file + rename)
- There is **no rollback or last-known-good fallback**
- A normal service stop during update can leave the node unrecoverable

### Expected Behavior

One or more of the following would prevent this class of failure:

- Download agent binary to a temporary file
- Verify checksum and/or size before replacement
- Use atomic rename to replace the binary
- Keep a last-known-good binary and fall back on startup failure
- Refuse to start if binary integrity checks fail, with a clear error

### Workaround

Copied known good binary from working nodes and replaced truncated binary on broken node

#### Additional Notes

- The RKE2 installer tarball itself was checksum-verified successfully
- The corruption appears limited to the rancher-system-agent self-update path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rancher-system-agent self-update can leave truncated binary if interrupted, causing SIGSEGV and stuck upgrades #252

Summary

Impact

Environment

Observed Behavior

Binary modification timing (service stop, not reboot)

Service log

Corrupted binary metadata

Delayed Failure Manifestation

Why This Appears to Be a Bug

Expected Behavior

Workaround

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rancher-system-agent self-update can leave truncated binary if interrupted, causing SIGSEGV and stuck upgrades #252

Description

Summary

Impact

Environment

Observed Behavior

Binary modification timing (service stop, not reboot)

Service log

Corrupted binary metadata

Delayed Failure Manifestation

Why This Appears to Be a Bug

Expected Behavior

Workaround

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions