Skip to content

rancher-system-agent self-update can leave truncated binary if interrupted, causing SIGSEGV and stuck upgrades #252

@e100

Description

@e100

Summary

We encountered a failure where rancher-system-agent replaced itself with a partially written / corrupted binary during normal Rancher-managed upgrade activity.

The corrupted binary segfaults on startup (SIGSEGV), leaving the node permanently stuck in Rancher with:

“waiting for plan … to be applied”

Manual replacement of the agent binary was required to recover.

Impact

  • Node cannot apply upgrade plans
  • Rancher UI provides no actionable error
  • Agent cannot self-repair once corrupted
  • Requires out-of-band/manual node access to fix
  • Breaks later Kubernetes upgrades even if corruption happened earlier

Environment

  • Rancher Server upgraded from 2.12.2 → 2.13.1
  • Managed RKE2 cluster
  • Kubernetes upgrade later attempted: 1.32.10 → 1.32.11
  • OS: Ubuntu 22.04

Observed Behavior

  • Downstream RKE2 updated from 1.32.10 → 1.32.11
  • Some nodes had different file sizes and SHA256 hashes for /usr/local/bin/rancher-system-agent
  • Affected nodes showed immediate startup crashes:
    rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV
  • Rancher UI showed the node stuck indefinitely waiting for a plan to be applied
  • Copying a known-good rancher-system-agent binary from another node and restarting the service immediately resolved the issue

Binary modification timing (service stop, not reboot)

The corrupted agent binary was modified within seconds of the service being stopped by systemd.

Service log

Jan 21 14:56:04 systemd[1]: Stopping Rancher System Agent...
Jan 21 14:56:04 rancher-system-agent: signal received: "terminated"
Jan 21 14:56:04 systemd[1]: rancher-system-agent.service: Deactivated successfully.

The node was rebooted later. On the next start of the service, the corrupted binary began crashing immediately:

-- Boot <id> --
Jan 22 15:58:37 k8s029 kernel: Linux version 5.15.0-164-generic
Jan 22 15:58:53 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV
Jan 22 15:58:53 systemd[1]: rancher-system-agent.service: Failed with result 'signal'.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Scheduled restart job, restart counter is at 1.
Jan 22 15:58:58 systemd[1]: Stopped Rancher System Agent.
Jan 22 15:58:58 systemd[1]: Started Rancher System Agent.
Jan 22 15:58:58 systemd[1]: rancher-system-agent.service: Main process exited, code=killed, status=11/SEGV

Corrupted binary metadata

File: rancher-system-agent.bak
Size: 42037248 bytes
Modify: 2026-01-21 14:56:07 -0500
Birth: 2025-01-16 09:46:58 -0500

The modification timestamp is ~3 seconds after the agent was terminated, strongly suggesting the binary was being updated or replaced when the service was stopped, leaving a partially written executable in place.

Delayed Failure Manifestation

The agent did not immediately fail after corruption.

  • The corrupted binary existed on disk after the Rancher server upgrade
  • The node appeared healthy until a later Kubernetes upgrade (1.32.10 → 1.32.11)
  • During that upgrade the node could not drain cleanly ( hung task ) and was rebooted.
  • After reboot the agent started, at which point the agent began crashing with SIGSEGV

Why This Appears to Be a Bug

  • The agent replaces itself in-place
  • There appears to be no checksum or content-length validation before installing the new binary
  • The update is not atomic (no temp file + rename)
  • There is no rollback or last-known-good fallback
  • A normal service stop during update can leave the node unrecoverable

Expected Behavior

One or more of the following would prevent this class of failure:

  • Download agent binary to a temporary file
  • Verify checksum and/or size before replacement
  • Use atomic rename to replace the binary
  • Keep a last-known-good binary and fall back on startup failure
  • Refuse to start if binary integrity checks fail, with a clear error

Workaround

Copied known good binary from working nodes and replaced truncated binary on broken node

Additional Notes

  • The RKE2 installer tarball itself was checksum-verified successfully
  • The corruption appears limited to the rancher-system-agent self-update path

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions