charts/redpanda: opt-in chroot tuning init container for host tuners by david-yu · Pull Request #1522 · redpanda-data/redpanda-operator

david-yu · 2026-05-13T03:02:16Z

Summary

Adds tuning.apply_host_tuners (default false). When enabled, the tuning init container builds a chroot to the host filesystem and runs rpk redpanda tune all inside the host's network namespace — the tuners that previously failed inside the pod sandbox actually apply.

UX: apply_host_tuners: true is the one switch users need. The chart now default-enables tune_disk_irq, tune_disk_scheduler, tune_disk_nomerges, and tune_network in the rendered rpk config when the flag is on, so the tuners the chroot path exists to fix actually run without the user having to mirror them by hand in config.rpk.tune_*.

Closes K8S-101. Productionizes the chroot pattern from CORE-13685 (Stephan Dollberg's experiment).

Why

The default tuning container runs rpk redpanda tune all inside the pod's own namespaces. That works for sysctl-based tuners (aio_events, swappiness, THP) but the disk and net tuners can't see host /sys/block, /proc/sys/net, or host NICs, so they error with:

disk_irq      ERROR: directory '' does not exists
net           ERROR: open /proc/sys/net/core/rps_sock_flow_entries: no such file

…and rpk exits non-zero, crashlooping the init container for any user who sets these flags.

What changes

File	Change
`charts/redpanda/values.go`	New `Tuning.ApplyHostTuners` bool, default `false`. When `true`, `Tuning.Translate()` also sets `tune_disk_irq`/`tune_disk_scheduler`/`tune_disk_nomerges`/`tune_network` to `true` in the rendered `rpk` section so the chroot path actually has tuners to run. Docs cover SCC / PSA requirements, the one-pod-per-node constraint, and the first-arg-wins merge semantics.
`charts/redpanda/statefulset.go`	New `statefulSetInitContainerTuningOnHost` path that produces the chroot-mode init container. `StatefulSetVolumes` appends the host bind volumes only when the feature is enabled.
`charts/redpanda/testdata/template-cases.txtar`	New `tuning-host-mode` case + golden. Golden updated to include the four default-enabled flags.
`.changes/unreleased/charts-redpanda-Added-20260512-host-tuners.yaml`	Changelog entry.
Generated: `_statefulset.go.tpl`, `_values.go.tpl`, `values.schema.json`, `values_partial.gen.go`, golden txtar.

User-facing config (operator v2)

The chart is plumbed through clusterSpec on the v1alpha2 Redpanda CR. One flag, no follow-on:

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
spec:
  clusterSpec:
    tuning:
      tune_aio_events: true
      apply_host_tuners: true

That alone fires the four host-mode tuners. The chart's RPK config merge is first-arg-wins, so a per-tuner override in config.rpk.tune_* loses to the chart's value — by design. Users who want a specific tuner off should leave apply_host_tuners empty and wire host tuning via their own DaemonSet.

How the chroot path works

mount --bind /opt/redpanda /host/opt/redpanda
cp /host/redpanda_etc/redpanda.yaml /host/var/tmp/redpanda-tune.yaml
sed -i 's|^redpanda:|redpanda:\n  data_directory: /var/lib/redpanda/data|' /host/var/tmp/redpanda-tune.yaml
chroot /host /usr/bin/bash -c '
  nsenter -t 1 -n /opt/redpanda/bin/rpk redpanda tune all \
    --config /var/tmp/redpanda-tune.yaml \
    --node-tuner-state-path /tuner_state.yaml -v
  busctl call ... RestartUnit ss "irqbalance.service" "replace" \
    || pkill -f irqbalance || true
' || true

Three workarounds layered in:

data_directory injection. The chart-rendered redpanda.yaml omits redpanda.data_directory (the broker doesn't need it). rpk's disk tuners do, and rpk refuses to combine --dirs with --config. The script cp+seds a working copy into /var/tmp (because /tmp is not bind-mounted from the host) and points rpk at it.
busctl for irqbalance. systemctl can't traverse a chroot, so the script uses busctl against the host's systemd to restart irqbalance after rpk rewrites IRQ affinity. Falls back to pkill irqbalance for non-systemd hosts.
Exit-code tolerance. || true ; exit 0 so a single unsupported tuner (e.g. AWS lacks disk_write_cache) doesn't crashloop the init container — the failure mode today.

Host bind mounts are intentionally per-directory (/sys /proc /etc /usr /lib /lib64 /dev /var /run) rather than whole-/. Mounting / into /host creates mount-loops with /opt/redpanda; CORE-13685 found this the hard way.

Validation

Run 1 — EKS 1.33 / Amazon Linux 2023 / kernel 6.12, Redpanda 26.1.6, single replica

Before (no chroot patch — current behavior):

aio_events             APPLIED=true
cpu                    APPLIED=true
clocksource            APPLIED=true
swappiness             APPLIED=true
transparent_hugepages  APPLIED=true
disk_irq               APPLIED=false  ERROR: directory '' does not exists
disk_nomerges          APPLIED=false  ERROR: directory '' does not exists
disk_scheduler         APPLIED=false  ERROR: directory '' does not exists
net                    APPLIED=false  ERROR: open /proc/sys/net/core/rps_sock_flow_entries
fstrim                 APPLIED=false  ERROR: dial unix /run/systemd/private

Init container CrashLoopBackOff (rpk exits non-zero).

After (apply_host_tuners: true):

aio_events             APPLIED=true
cpu                    APPLIED=true
clocksource            APPLIED=true
swappiness             APPLIED=true
transparent_hugepages  APPLIED=true
disk_irq               APPLIED=true   ← fixed
disk_nomerges          APPLIED=true   ← fixed
disk_scheduler         APPLIED=true   ← fixed
net                    APPLIED=true   ← fixed
disk_write_cache       SUPPORTED=false ("only supported in GCP")

Init exits 0, Redpanda broker comes up cleanly, admin API responds.

Run 2 — EKS 1.31 / AL2023 / kernel 6.1.168 ARM (Graviton4 m8gd.metal-24xl), Redpanda v25.3.4, 3 replicas

End-to-end run including OMB producer/consumer smoke test (details + artifacts). With apply_host_tuners: true as the only tuning knob on the CR, freshly-rolled broker pods showed:

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
disk_irq               false    true     true       open /proc/irq/0/smp_affinity: no such file or directory
disk_nomerges          true     true     true
disk_scheduler         true     true     true
disk_write_cache       false    false    false      Disk write cache tuner is only supported in GCP
net                    true     true     true

Three of four host-mode tuners APPLIED=true ENABLED=true. disk_irq ENABLED but failing on a pre-existing AL2023 kernel issue (open /proc/irq/0/smp_affinity: no such file or directory — rpk should skip IRQs without a writeable smp_affinity rather than fail; tracked separately). The init container does NOT crashloop on this failure — the exit-code tolerance keeps the broker booting.

OMB ran clean for 5 min @ 20 MB/s, 0 publish errors, p99 publish 26 ms.

Lifecycle of host tuning state

What happens when you flip apply_host_tuners (or tune_aio_events) back to false, and what about reboot?

On the next pod replacement, the chart re-renders the StatefulSet:

`tune_aio_events`	`apply_host_tuners`	Init container rendered
`false`	(any)	none — no tuning init container at all
`true`	`false`	the regular in-pod tuning container (pre-PR default)
`true`	`true`	the chroot init container + host bind mounts

So turning the flag off and rolling pods does remove the chroot path and the host bind mounts. It does not, however, un-tune the host. The chart has no "untune" step — every kernel-level write the tuners made stays in place until something else reverses it.

Tuned state	Where it lives	Survives flag flip?	Survives host reboot?
`fs.aio-max-nr`	`/proc/sys/fs/aio-max-nr`	yes	no (sysctl resets at boot)
IRQ affinity (when `disk_irq` works)	`/proc/irq/N/smp_affinity`	yes	no
NVMe `scheduler` / `nomerges`	`/sys/block/nvmeXn1/queue/*`	yes	no (sysfs resets at boot)
`net.core.rps_sock_flow_entries`	sysctl	yes	no
NIC XPS masks	`/sys/class/net/ethN/queues/tx-N/xps_cpus`	yes	no
irqbalance unit state	systemd	yes (left as rpk left it)	depends on unit
`redpanda_node_tuner_state.yaml`	`/var/run/` (tmpfs)	yes	no (tmpfs wipes at boot)

Practical consequences:

Turning the flag off on a long-running cluster does not regress the host's kernel state. Existing brokers keep using the kernel state the previous tuning applied. The host stays tuned for as long as it stays up.
On host reboot, all the above kernel state reverts to AL2023 defaults. If the flag is still on, the chroot init container re-tunes on next pod start — this is the design: the chart re-tunes every pod start, so reboots are self-healing. If the flag has been turned off, the host comes up un-tuned and the broker on it runs against default kernel settings.
The state file /var/run/redpanda_node_tuner_state.yaml is on tmpfs, so it's gone at every reboot and rpk re-runs all tuners on first pod start after boot (rather than skipping based on stale state). That's the right behavior for transient kernel state, but worth knowing if you're trying to reason about "did the tuner actually run this boot?"

If you need to revert the host kernel state without a reboot (e.g. you turned the flag off mid-shift and want the change to take effect immediately), you have to either reboot the node or write a one-shot DaemonSet that resets the specific sysfs/sysctl values. The chart deliberately does not ship that — host de-tuning is a destructive operation on shared kernel state and shouldn't be silently performed when a user flips a chart value.

Security posture

Opt-in only. The default tuning container is unchanged. Enabling apply_host_tuners requires the same trust level as tune_aio_events already does — privileged container, root user, hostPath volumes.

For secure k8s installations:

OpenShift: bind the pod's ServiceAccount to a SCC that allows hostPath volumes and privileged: true. The built-in privileged SCC works. A custom SCC can be authored if narrower scope is required — the chart only mounts standard Linux directories (/sys /proc /etc /usr /lib /lib64 /dev /var /run) plus the tuner state file at /var/run/redpanda_node_tuner_state.yaml.
Pod Security Admission: namespace must be labeled pod-security.kubernetes.io/enforce: privileged (this is also required today for tune_aio_events).
One pod per node. Concurrent tuners race on the same kernel parameters. Users enabling this should also set a podAntiAffinity rule that disallows co-location of Redpanda pods on the same node. The value docs call this out.

Out of scope (deliberate)

Not exposing every rpk.tune_* flag as a first-class chart value. Today the chart exposes tune_aio_events, tune_clocksource, tune_ballast_file at the top level, and apply_host_tuners rolls up the four host-mode flags. The rest are reachable via config.rpk.tune_* if a user explicitly wants them on.
No --node-tuner-state-path plumbing on the redpanda main container. The state file persists across pod restarts on the same node (so re-runs are no-op) but the broker doesn't need to read it; the broker comes up fine without it. Can be added later if dedicated-mode reporting wants it.
disk_irq failure on AL2023 ARM kernels (e.g. m8gd.metal-24xl) is a core/rpk-side issue, not a chart issue. The chroot path exposes IRQ 0's missing smp_affinity to rpk, which then fails the whole tuner. Tracked separately — needs rpk to skip un-pinnable IRQs rather than error.

Test plan

task lint — passes
task generate — no diff after running
go test ./charts/redpanda/ -run TestTemplate — passes (new tuning-host-mode golden case includes the 4 default-enabled flags)
helm template with apply_host_tuners=true renders the expected init container + volumes + the 4 host-mode tune_*: true flags
helm template with default values is byte-identical to pre-PR output
EKS 1.33 / AL2023 / Redpanda 26.1.6 end-to-end verified (Run 1 above)
EKS 1.31 / AL2023 ARM (Graviton4) / Redpanda v25.3.4, 3-replica + OMB smoke test (Run 2 above)
CI Operator Test Suite (will run on push)
CI Acceptance Tests (will run on push)

Adds `tuning.apply_host_tuners` (default `false`). When enabled, the tuning init container builds a chroot to the host filesystem and runs `rpk redpanda tune all` inside the host's network namespace, so the tuners that previously failed inside the pod sandbox actually apply: - disk_irq / disk_scheduler / disk_nomerges — need /sys/block visibility - net — needs /proc/sys/net and host NICs - fstrim — needs /run/systemd/private Verified on EKS 1.33 / AL2023 / kernel 6.12 + Redpanda 26.1.6: post- patch, APPLIED=true for aio_events, cpu, clocksource, swappiness, transparent_hugepages, disk_irq, disk_nomerges, disk_scheduler, and net. The only tuner that stays unsupported is disk_write_cache (GCP- only). The redpanda broker comes up cleanly. Three workarounds the chart layers in: 1. The rendered redpanda.yaml omits `redpanda.data_directory` (the broker doesn't need it). rpk's disk tuners do, and rpk refuses to combine --dirs with --config. The script cp+seds a working copy into /var/tmp (because /tmp is not bind-mounted from the host) and points rpk at it. 2. systemctl can't traverse a chroot, so the script uses busctl against the host's systemd to restart irqbalance after rpk rewrites IRQ affinity. Falls back to `pkill irqbalance` for non-systemd hosts. 3. `|| true ; exit 0` so a single unsupported tuner (e.g. AWS lacks disk_write_cache) doesn't crashloop the init container — which is what happens today the moment a user sets any of the disk tuner flags. Host bind mounts are intentionally per-directory (/sys /proc /etc /usr /lib /lib64 /dev /var /run) rather than whole-/ — the latter creates mount-loops with /opt/redpanda (CORE-13685 found this the hard way). Security posture: opt-in only. Default is unchanged. Enabling requires the same trust level as `tune_aio_events` already does — privileged container, root user, hostPath volumes. On OpenShift bind a SCC that allows hostPath and privileged (the built-in `privileged` SCC works); on PSA clusters label the namespace `privileged`. Document warning in values.go: this MUST NOT be combined with multiple Redpanda pods per node, or concurrent tuners race on the same kernel parameters. This change is chart-only; the operator v2 path picks it up via the Redpanda CR's `clusterSpec.tuning.apply_host_tuners` passthrough, no operator code change needed. Closes K8S-101. Productionizes the chroot pattern from CORE-13685.

CI's TestHelmValuesCompat/clusterSpec failed because the new charts/redpanda Tuning.ApplyHostTuners field wasn't mirrored on the operator's v1alpha2 RedpandaClusterSpec.Tuning (and StretchTuning) — PartialValues serialized {"tuning":{"apply_host_tuners":false}} while the operator side serialized {"tuning":{}}. Adds the field to both structs, regenerates deepcopy, applyconfiguration, CRDs, and crd-docs. Also includes a protoc-gen-go-grpc v1.6.1 → v1.6.2 regen that CI's nix pin emits on `task generate` — required for the lint step's `git diff --exit-code` to pass.

…rs=true ApplyHostTuners only controls which tuning init container template the chart renders (chroot vs. in-pod). It does not enable the per-tuner rpk flags those tuners are gated on inside redpanda.yaml. End-to-end on m8gd.metal-24xl confirmed that flipping just apply_host_tuners=true produces "rpk redpanda tune all" output where every tuner except aio_events shows APPLIED=false / ENABLED=false — i.e. the chroot path is running but the disk_irq / disk_scheduler / disk_nomerges / net tuners that motivate the PR are silently skipped. Auto-enable those four when ApplyHostTuners is on, so a user who sets the one chart knob the PR documents actually exercises the code path. Override semantics are intentionally not provided: helmette.Merge is first-arg-wins, so config.rpk.tune_* set on the CR loses to the chart's defaults anyway. Calling that out in the docstring so users don't expect override. Golden testdata for tuning-host-mode updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

david-yu · 2026-05-13T06:18:04Z

Cross-cloud validation: GKE + AKS

Tested the PR (HEAD b876709c) on GKE and AKS in addition to the existing EKS runs in the PR description. Same operator chart from this branch, same Redpanda v25.3.4, same minimal CR (tune_aio_events: true + apply_host_tuners: true, no config.rpk.tune_* overrides), single broker, default storage class on each cloud.

GKE — Ubuntu 24.04 / kernel 6.8.0-1049-gke / `n2-standard-8`

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
ballast_file           false    false    true
clocksource            false    false    true
coredump               false    false    true
cpu                    false    false    true
disk_irq               true     true     true       ← fixed by chroot
disk_nomerges          true     true     true       ← fixed by chroot
disk_scheduler         true     true     true       ← fixed by chroot
disk_write_cache       false    false    true
fstrim                 false    false    false      err=fork/exec /usr/bin/which: no such file or directory
net                    true     true     true       ← fixed by chroot
swappiness             false    false    true
transparent_hugepages  false    false    true

All four host-mode tuners APPLIED=true ENABLED=true. Init container exits 0, broker Ready in under 90s. Note disk_write_cache reports SUPPORTED=true (vs. AWS where it's GCP-only); not enabled here because the CR doesn't toggle it.

AKS — Ubuntu 22.04 / kernel 5.15.0-1110-azure / `Standard_D8s_v5`

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
ballast_file           false    false    true
clocksource            false    false    true
coredump               false    false    true
cpu                    false    false    true
disk_irq               true     true     true       ← fixed by chroot
disk_nomerges          true     true     true       ← fixed by chroot
disk_scheduler         true     true     true       ← fixed by chroot
disk_write_cache       false    false    false      Disk write cache tuner is only supported in GCP
fstrim                 false    false    false      err=fork/exec /usr/bin/which: no such file or directory
net                    true     true     true       ← fixed by chroot
swappiness             false    false    true
transparent_hugepages  false    false    true

Identical pattern. All four host-mode tuners apply cleanly. Broker Ready in ~50s.

Cross-cloud summary for the four PR-target tuners

Tuner	EKS m8gd.metal-24xl (AL2023 ARM, kernel 6.1.168)	GKE Ubuntu 24.04 (kernel 6.8)	AKS Ubuntu 22.04 (kernel 5.15)
`disk_irq`	❌ pre-existing kernel-level issue: `/proc/irq/0/smp_affinity: no such file or directory`	✅ APPLIED	✅ APPLIED
`disk_scheduler`	✅ APPLIED	✅ APPLIED	✅ APPLIED
`disk_nomerges`	✅ APPLIED	✅ APPLIED	✅ APPLIED
`net`	✅ APPLIED	✅ APPLIED	✅ APPLIED

3 of 4 work on every cloud; disk_irq only fails on AL2023 ARM (the AWS metal SKU). That failure is exposed by the chroot path but caused by AL2023 ARM not having a writeable smp_affinity for IRQ 0 — not a chart problem (tracked separately for rpk to skip un-pinnable IRQs rather than error).

Minor non-blocker

Both GKE and AKS show fstrim errors with fork/exec /usr/bin/which: no such file or directory. That's rpk's tuner using which to detect fstrim's presence and Ubuntu's minimal container image not having it. Doesn't affect any of the four host-mode tuners and fstrim is ENABLED=false anyway. Worth a follow-up rpk fix but not a PR 1522 issue.