Runner pod hangs indefinitely on memory limit instead of OOMKilling (cgroup v2 file-backed page thrashing)

## Summary

When a workflow step exhausts the runner pod's `limits.memory`, the pod **hangs indefinitely** instead of being OOMKilled. The job only terminates when `timeout-minutes` expires. Tested on EKS with cgroup v2; should affect any cgroup v2 cluster running ARC in dind mode.

## Environment

- ARC chart: gha-runner-scale-set v0.9.2 (`containerMode: dind`)
- Kubernetes: v1.34.4-eks
- Kernel: 6.12.73 (AL2023)
- cgroup version: v2 (`memory.max` enforced by kubelet)
- Runner image: `ghcr.io/actions/actions-runner:2.323.0` (.NET 8 runtime)

## Steps to Reproduce

```yaml
jobs:
  memory-hog:
    runs-on: <your-runner-set>
    timeout-minutes: 10
    steps:
      - run: |
          python3 -c "
          import mmap
          blocks = []
          while True:
              m = mmap.mmap(-1, 100 * 1024 * 1024)
              m.write(b'\x01' * (100 * 1024 * 1024))
              blocks.append(m)
          "
```

**Expected:** Pod exits with code 137 (SIGKILL/OOM), job marked failed.
**Actual:** Pod memory pegs at limit, process stalls, job hangs until `timeout-minutes` fires.

## Root Cause

The pod live-locks in **cgroup v2 direct reclaim** because Runner.Listener's .NET runtime provides ~200 MiB of file-backed pages (JIT assemblies, shared libs) that the kernel can always reclaim and re-fault — an infinite reclaim loop.

### How cgroup v2 OOM works

1. Allocations push container past `memory.max`
2. Kernel invokes direct reclaim on the cgroup
3. If reclaim frees enough pages → allocation succeeds, loop continues
4. If nothing can be reclaimed → OOM killer fires, container exits 137

### Why it does not fire here

The runner container has two classes of memory consumers:

- **Runner.Listener** (.NET 8 runtime): ~200 MiB of **file-backed mappings** — JIT-compiled assemblies, shared libraries, .NET runtime code pages. These are clean, evictable pages the kernel can always drop and re-fault from disk.
- **User step process** (python3, etc.): anonymous heap — the actual allocation source.

When the user process pushes past `memory.max`, the kernel reclaims Runner.Listener's file-backed pages. Runner.Listener re-faults them on next access. The kernel reclaims them again. This loop repeats indefinitely — the kernel always has reclaimable pages, so the OOM killer condition is never met.

This is a known phenomenon in Kubernetes sig-node, called a **memory.high livelock** in [KEP-2570 MemoryQoS](https://github.com/kubernetes/enhancements/issues/2570). It blocked MemoryQoS Beta promotion in v1.28 and remains Alpha.

### Isolation test

Tested on the same EKS cluster (kernel 6.12.73, cgroup v2), both with a 3 GiB memory limit:

| Container | Contents | Result |
|---|---|---|
| `python:3-slim` pod | python3 + sh only (~50 MiB file-backed) | **OOMKilled at 2900 MiB** ✅ |
| ARC runner pod | Runner.Listener + dind + python3 (~200 MiB file-backed) | **Hangs indefinitely** ❌ |

The only difference is Runner.Listener's .NET runtime providing a sustained supply of reclaimable file-backed pages. This also explains why plain Docker containers OOMKill correctly — they only run python3 + sh (~50 MiB file-backed), which exhausts quickly.

### cgroup v2 `memory.events` from the hung runner pod

```
max 112847        <- kernel invoked reclaim 112k times
oom 0             <- OOM condition never declared
oom_kill 0        <- OOM killer never fired
oom_group_kill 0
```

## Mitigations tested (all failed)

| # | Strategy | Result | Why |
|---|---|---|---|
| 1 | `memory.high` = 90% of limit | Hang | Thrashing shifts to high boundary (high=125,632, max=0, oom_kill=0) |
| 2 | `mlockall(MCL_CURRENT\|MCL_FUTURE)` on Runner.Listener via LD_PRELOAD | Hang | Locking Listener pages shifts thrashing to python3/bash's ~50 MiB of file pages. Locking ALL processes breaks Runner.Worker (ENOMEM). |
| 3 | `oom_score_adj=1000` on Runner.Listener | Hang | OOM killer never invoked — oom_score_adj only affects victim selection, not trigger. Runner.Listener had highest possible score (1334). |
| 4 | `memory.oom.group=1` on container + pod cgroup | Hang | Only affects what happens when OOM fires, not whether it fires. Set from host via privileged pod (cgroup is read-only inside the container). |
| 5 | `ulimit -v` / RLIMIT_AS | Runner killed | .NET runtime reserves ~15 GB VAS for GC heap. Any practical ulimit kills the runner itself. |
| 6 | PSI-based sidecar watchdog | PSI stays 0.00 | Kernel considers each reclaim "successful", so PSI reports no stalls — thrashing is invisible to PSI. |

### PSI sidecar experiment detail

Deployed a 64 MiB Alpine sidecar in the runner pod with a read-only hostPath mount to `/sys/fs/cgroup`. The sidecar locates the runner container's cgroup on the host and polls `memory.current` + `memory.events` every 2 seconds.

**The sidecar stayed fully responsive while the runner was completely hung** — proving the sidecar approach works. But PSI is the wrong signal:

```
mem=540Mi  psi_full=0.00 max=0    oom_kill=0
mem=1049Mi psi_full=0.00 max=0    oom_kill=0
mem=2876Mi psi_full=0.00 max=0    oom_kill=0
mem=3071Mi psi_full=0.00 max=4325 oom_kill=0   <- hit limit, thrashing begins
mem=3071Mi psi_full=0.00 max=5736 oom_kill=0
mem=2100Mi psi_full=0.00 max=6594 oom_kill=0   <- kernel reclaiming faster than stress
mem=1267Mi psi_full=0.00 max=6594 oom_kill=0   <- pod reaped
```

PSI `full avg10` stayed at 0.00 the entire time. The kernel considers each reclaim "successful" (pages were freed), so no task registers as stalled. The thrashing is invisible to PSI.

A simpler trigger works: `memory.current > 95% of memory.max AND memory.events.max > 0`.

## Current workaround

Users must set `timeout-minutes` on every job. Without it, a memory-exhausted runner hangs until the 6-hour GitHub Actions default.

## Possible fixes

1. **Sidecar watchdog using memory threshold** (not PSI): monitor `memory.current` from a sidecar (separate cgroup), cancel the run and delete the EphemeralRunner CR when the threshold is exceeded. Proven to work — sidecar stayed responsive during the hang. Requires `actions:write` on the GitHub App to cancel runs.

2. **[KEP-5507](https://github.com/kubernetes/enhancements/issues/5507)**: Per-container `memory.oom.group` in pod spec (still a proposal).

3. **ARC init container setting `memory.oom.group=1`** in the runner cgroup at startup (via `CAP_SYS_ADMIN`). Only helps if OOM fires — does not address the thrashing case where OOM never fires.

## Related Issues

- [ARC #4020](https://github.com/actions/actions-runner-controller/issues/4020): Steps get stuck when WebSocket drops after 5 min of no log output (thrashing produces zero output, triggering the idle timeout)
- [ARC #4155](https://github.com/actions/actions-runner-controller/issues/4155): EphemeralRunner pods stuck Running after OOMKill (zombie runners)
- [runner-container-hooks #228](https://github.com/actions/runner-container-hooks/issues/228): Kubernetes mode terminates early or not at all
- [kubernetes/enhancements #2570](https://github.com/kubernetes/enhancements/issues/2570): MemoryQoS blocked by this exact livelock
- [KEP-5507](https://github.com/kubernetes/enhancements/issues/5507): Per-container OOM mode controls (proposal)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner pod hangs indefinitely on memory limit instead of OOMKilling (cgroup v2 file-backed page thrashing) #4436

Summary

Environment

Steps to Reproduce

Root Cause

How cgroup v2 OOM works

Why it does not fire here

Isolation test

cgroup v2 `memory.events` from the hung runner pod

Mitigations tested (all failed)

PSI sidecar experiment detail

Current workaround

Possible fixes

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Container	Contents	Result
`python:3-slim` pod	python3 + sh only (~50 MiB file-backed)	OOMKilled at 2900 MiB ✅
ARC runner pod	Runner.Listener + dind + python3 (~200 MiB file-backed)	Hangs indefinitely ❌

#	Strategy	Result	Why
1	`memory.high` = 90% of limit	Hang	Thrashing shifts to high boundary (high=125,632, max=0, oom_kill=0)
2	`mlockall(MCL_CURRENT\|MCL_FUTURE)` on Runner.Listener via LD_PRELOAD	Hang	Locking Listener pages shifts thrashing to python3/bash's ~50 MiB of file pages. Locking ALL processes breaks Runner.Worker (ENOMEM).
3	`oom_score_adj=1000` on Runner.Listener	Hang	OOM killer never invoked — oom_score_adj only affects victim selection, not trigger. Runner.Listener had highest possible score (1334).
4	`memory.oom.group=1` on container + pod cgroup	Hang	Only affects what happens when OOM fires, not whether it fires. Set from host via privileged pod (cgroup is read-only inside the container).
5	`ulimit -v` / RLIMIT_AS	Runner killed	.NET runtime reserves ~15 GB VAS for GC heap. Any practical ulimit kills the runner itself.
6	PSI-based sidecar watchdog	PSI stays 0.00	Kernel considers each reclaim "successful", so PSI reports no stalls — thrashing is invisible to PSI.

Runner pod hangs indefinitely on memory limit instead of OOMKilling (cgroup v2 file-backed page thrashing) #4436

Description

Summary

Environment

Steps to Reproduce

Root Cause

How cgroup v2 OOM works

Why it does not fire here

Isolation test

cgroup v2 memory.events from the hung runner pod

Mitigations tested (all failed)

PSI sidecar experiment detail

Current workaround

Possible fixes

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

cgroup v2 `memory.events` from the hung runner pod