-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Runner pod hangs indefinitely on memory limit instead of OOMKilling (cgroup v2 file-backed page thrashing) #4436
Description
Summary
When a workflow step exhausts the runner pod's limits.memory, the pod hangs indefinitely instead of being OOMKilled. The job only terminates when timeout-minutes expires. Tested on EKS with cgroup v2; should affect any cgroup v2 cluster running ARC in dind mode.
Environment
- ARC chart: gha-runner-scale-set v0.9.2 (
containerMode: dind) - Kubernetes: v1.34.4-eks
- Kernel: 6.12.73 (AL2023)
- cgroup version: v2 (
memory.maxenforced by kubelet) - Runner image:
ghcr.io/actions/actions-runner:2.323.0(.NET 8 runtime)
Steps to Reproduce
jobs:
memory-hog:
runs-on: <your-runner-set>
timeout-minutes: 10
steps:
- run: |
python3 -c "
import mmap
blocks = []
while True:
m = mmap.mmap(-1, 100 * 1024 * 1024)
m.write(b'\x01' * (100 * 1024 * 1024))
blocks.append(m)
"Expected: Pod exits with code 137 (SIGKILL/OOM), job marked failed.
Actual: Pod memory pegs at limit, process stalls, job hangs until timeout-minutes fires.
Root Cause
The pod live-locks in cgroup v2 direct reclaim because Runner.Listener's .NET runtime provides ~200 MiB of file-backed pages (JIT assemblies, shared libs) that the kernel can always reclaim and re-fault — an infinite reclaim loop.
How cgroup v2 OOM works
- Allocations push container past
memory.max - Kernel invokes direct reclaim on the cgroup
- If reclaim frees enough pages → allocation succeeds, loop continues
- If nothing can be reclaimed → OOM killer fires, container exits 137
Why it does not fire here
The runner container has two classes of memory consumers:
- Runner.Listener (.NET 8 runtime): ~200 MiB of file-backed mappings — JIT-compiled assemblies, shared libraries, .NET runtime code pages. These are clean, evictable pages the kernel can always drop and re-fault from disk.
- User step process (python3, etc.): anonymous heap — the actual allocation source.
When the user process pushes past memory.max, the kernel reclaims Runner.Listener's file-backed pages. Runner.Listener re-faults them on next access. The kernel reclaims them again. This loop repeats indefinitely — the kernel always has reclaimable pages, so the OOM killer condition is never met.
This is a known phenomenon in Kubernetes sig-node, called a memory.high livelock in KEP-2570 MemoryQoS. It blocked MemoryQoS Beta promotion in v1.28 and remains Alpha.
Isolation test
Tested on the same EKS cluster (kernel 6.12.73, cgroup v2), both with a 3 GiB memory limit:
| Container | Contents | Result |
|---|---|---|
python:3-slim pod |
python3 + sh only (~50 MiB file-backed) | OOMKilled at 2900 MiB ✅ |
| ARC runner pod | Runner.Listener + dind + python3 (~200 MiB file-backed) | Hangs indefinitely ❌ |
The only difference is Runner.Listener's .NET runtime providing a sustained supply of reclaimable file-backed pages. This also explains why plain Docker containers OOMKill correctly — they only run python3 + sh (~50 MiB file-backed), which exhausts quickly.
cgroup v2 memory.events from the hung runner pod
max 112847 <- kernel invoked reclaim 112k times
oom 0 <- OOM condition never declared
oom_kill 0 <- OOM killer never fired
oom_group_kill 0
Mitigations tested (all failed)
| # | Strategy | Result | Why |
|---|---|---|---|
| 1 | memory.high = 90% of limit |
Hang | Thrashing shifts to high boundary (high=125,632, max=0, oom_kill=0) |
| 2 | mlockall(MCL_CURRENT|MCL_FUTURE) on Runner.Listener via LD_PRELOAD |
Hang | Locking Listener pages shifts thrashing to python3/bash's ~50 MiB of file pages. Locking ALL processes breaks Runner.Worker (ENOMEM). |
| 3 | oom_score_adj=1000 on Runner.Listener |
Hang | OOM killer never invoked — oom_score_adj only affects victim selection, not trigger. Runner.Listener had highest possible score (1334). |
| 4 | memory.oom.group=1 on container + pod cgroup |
Hang | Only affects what happens when OOM fires, not whether it fires. Set from host via privileged pod (cgroup is read-only inside the container). |
| 5 | ulimit -v / RLIMIT_AS |
Runner killed | .NET runtime reserves ~15 GB VAS for GC heap. Any practical ulimit kills the runner itself. |
| 6 | PSI-based sidecar watchdog | PSI stays 0.00 | Kernel considers each reclaim "successful", so PSI reports no stalls — thrashing is invisible to PSI. |
PSI sidecar experiment detail
Deployed a 64 MiB Alpine sidecar in the runner pod with a read-only hostPath mount to /sys/fs/cgroup. The sidecar locates the runner container's cgroup on the host and polls memory.current + memory.events every 2 seconds.
The sidecar stayed fully responsive while the runner was completely hung — proving the sidecar approach works. But PSI is the wrong signal:
mem=540Mi psi_full=0.00 max=0 oom_kill=0
mem=1049Mi psi_full=0.00 max=0 oom_kill=0
mem=2876Mi psi_full=0.00 max=0 oom_kill=0
mem=3071Mi psi_full=0.00 max=4325 oom_kill=0 <- hit limit, thrashing begins
mem=3071Mi psi_full=0.00 max=5736 oom_kill=0
mem=2100Mi psi_full=0.00 max=6594 oom_kill=0 <- kernel reclaiming faster than stress
mem=1267Mi psi_full=0.00 max=6594 oom_kill=0 <- pod reaped
PSI full avg10 stayed at 0.00 the entire time. The kernel considers each reclaim "successful" (pages were freed), so no task registers as stalled. The thrashing is invisible to PSI.
A simpler trigger works: memory.current > 95% of memory.max AND memory.events.max > 0.
Current workaround
Users must set timeout-minutes on every job. Without it, a memory-exhausted runner hangs until the 6-hour GitHub Actions default.
Possible fixes
-
Sidecar watchdog using memory threshold (not PSI): monitor
memory.currentfrom a sidecar (separate cgroup), cancel the run and delete the EphemeralRunner CR when the threshold is exceeded. Proven to work — sidecar stayed responsive during the hang. Requiresactions:writeon the GitHub App to cancel runs. -
KEP-5507: Per-container
memory.oom.groupin pod spec (still a proposal). -
ARC init container setting
memory.oom.group=1in the runner cgroup at startup (viaCAP_SYS_ADMIN). Only helps if OOM fires — does not address the thrashing case where OOM never fires.
Related Issues
- ARC #4020: Steps get stuck when WebSocket drops after 5 min of no log output (thrashing produces zero output, triggering the idle timeout)
- ARC #4155: EphemeralRunner pods stuck Running after OOMKill (zombie runners)
- runner-container-hooks #228: Kubernetes mode terminates early or not at all
- kubernetes/enhancements #2570: MemoryQoS blocked by this exact livelock
- KEP-5507: Per-container OOM mode controls (proposal)