charts/redpanda: opt-in chroot tuning init container for host tuners#1522
charts/redpanda: opt-in chroot tuning init container for host tuners#1522david-yu wants to merge 3 commits into
Conversation
Adds `tuning.apply_host_tuners` (default `false`). When enabled, the
tuning init container builds a chroot to the host filesystem and runs
`rpk redpanda tune all` inside the host's network namespace, so the
tuners that previously failed inside the pod sandbox actually apply:
- disk_irq / disk_scheduler / disk_nomerges — need /sys/block visibility
- net — needs /proc/sys/net and host NICs
- fstrim — needs /run/systemd/private
Verified on EKS 1.33 / AL2023 / kernel 6.12 + Redpanda 26.1.6: post-
patch, APPLIED=true for aio_events, cpu, clocksource, swappiness,
transparent_hugepages, disk_irq, disk_nomerges, disk_scheduler, and
net. The only tuner that stays unsupported is disk_write_cache (GCP-
only). The redpanda broker comes up cleanly.
Three workarounds the chart layers in:
1. The rendered redpanda.yaml omits `redpanda.data_directory` (the
broker doesn't need it). rpk's disk tuners do, and rpk refuses to
combine --dirs with --config. The script cp+seds a working copy
into /var/tmp (because /tmp is not bind-mounted from the host)
and points rpk at it.
2. systemctl can't traverse a chroot, so the script uses busctl
against the host's systemd to restart irqbalance after rpk
rewrites IRQ affinity. Falls back to `pkill irqbalance` for
non-systemd hosts.
3. `|| true ; exit 0` so a single unsupported tuner (e.g. AWS lacks
disk_write_cache) doesn't crashloop the init container — which
is what happens today the moment a user sets any of the disk
tuner flags.
Host bind mounts are intentionally per-directory (/sys /proc /etc /usr
/lib /lib64 /dev /var /run) rather than whole-/ — the latter creates
mount-loops with /opt/redpanda (CORE-13685 found this the hard way).
Security posture: opt-in only. Default is unchanged. Enabling requires
the same trust level as `tune_aio_events` already does — privileged
container, root user, hostPath volumes. On OpenShift bind a SCC that
allows hostPath and privileged (the built-in `privileged` SCC works);
on PSA clusters label the namespace `privileged`. Document warning in
values.go: this MUST NOT be combined with multiple Redpanda pods per
node, or concurrent tuners race on the same kernel parameters.
This change is chart-only; the operator v2 path picks it up via the
Redpanda CR's `clusterSpec.tuning.apply_host_tuners` passthrough, no
operator code change needed.
Closes K8S-101. Productionizes the chroot pattern from CORE-13685.
CI's TestHelmValuesCompat/clusterSpec failed because the new
charts/redpanda Tuning.ApplyHostTuners field wasn't mirrored on the
operator's v1alpha2 RedpandaClusterSpec.Tuning (and StretchTuning) —
PartialValues serialized {"tuning":{"apply_host_tuners":false}} while
the operator side serialized {"tuning":{}}.
Adds the field to both structs, regenerates deepcopy, applyconfiguration,
CRDs, and crd-docs.
Also includes a protoc-gen-go-grpc v1.6.1 → v1.6.2 regen that CI's nix
pin emits on `task generate` — required for the lint step's
`git diff --exit-code` to pass.
…rs=true ApplyHostTuners only controls which tuning init container template the chart renders (chroot vs. in-pod). It does not enable the per-tuner rpk flags those tuners are gated on inside redpanda.yaml. End-to-end on m8gd.metal-24xl confirmed that flipping just apply_host_tuners=true produces "rpk redpanda tune all" output where every tuner except aio_events shows APPLIED=false / ENABLED=false — i.e. the chroot path is running but the disk_irq / disk_scheduler / disk_nomerges / net tuners that motivate the PR are silently skipped. Auto-enable those four when ApplyHostTuners is on, so a user who sets the one chart knob the PR documents actually exercises the code path. Override semantics are intentionally not provided: helmette.Merge is first-arg-wins, so config.rpk.tune_* set on the CR loses to the chart's defaults anyway. Calling that out in the docstring so users don't expect override. Golden testdata for tuning-host-mode updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-cloud validation: GKE + AKSTested the PR (HEAD GKE — Ubuntu 24.04 / kernel 6.8.0-1049-gke /
|
| Tuner | EKS m8gd.metal-24xl (AL2023 ARM, kernel 6.1.168) | GKE Ubuntu 24.04 (kernel 6.8) | AKS Ubuntu 22.04 (kernel 5.15) |
|---|---|---|---|
disk_irq |
❌ pre-existing kernel-level issue: /proc/irq/0/smp_affinity: no such file or directory |
✅ APPLIED | ✅ APPLIED |
disk_scheduler |
✅ APPLIED | ✅ APPLIED | ✅ APPLIED |
disk_nomerges |
✅ APPLIED | ✅ APPLIED | ✅ APPLIED |
net |
✅ APPLIED | ✅ APPLIED | ✅ APPLIED |
3 of 4 work on every cloud; disk_irq only fails on AL2023 ARM (the AWS metal SKU). That failure is exposed by the chroot path but caused by AL2023 ARM not having a writeable smp_affinity for IRQ 0 — not a chart problem (tracked separately for rpk to skip un-pinnable IRQs rather than error).
Minor non-blocker
Both GKE and AKS show fstrim errors with fork/exec /usr/bin/which: no such file or directory. That's rpk's tuner using which to detect fstrim's presence and Ubuntu's minimal container image not having it. Doesn't affect any of the four host-mode tuners and fstrim is ENABLED=false anyway. Worth a follow-up rpk fix but not a PR 1522 issue.
Setup notes (for reproducibility)
- Image:
pr1522-b876709cbuilt and pushed to ephemeral registries (GAR for GKE pull, ACR for AKS pull). Both arches (linux/amd64, linux/arm64) in the same manifest. - CR: minimal — single replica, no TLS, no NodePort, no anti-affinity (single-node test).
- Namespace label:
pod-security.kubernetes.io/enforce: privileged(required by chart's tuning init container; same astune_aio_events: truein PR pre-change behavior). - Clusters/ACR/GAR torn down post-test.
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
|
This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Summary
Adds
tuning.apply_host_tuners(defaultfalse). When enabled, the tuning init container builds a chroot to the host filesystem and runsrpk redpanda tune allinside the host's network namespace — the tuners that previously failed inside the pod sandbox actually apply.UX:
apply_host_tuners: trueis the one switch users need. The chart now default-enablestune_disk_irq,tune_disk_scheduler,tune_disk_nomerges, andtune_networkin the renderedrpkconfig when the flag is on, so the tuners the chroot path exists to fix actually run without the user having to mirror them by hand inconfig.rpk.tune_*.Closes K8S-101. Productionizes the chroot pattern from CORE-13685 (Stephan Dollberg's experiment).
Why
The default tuning container runs
rpk redpanda tune allinside the pod's own namespaces. That works for sysctl-based tuners (aio_events, swappiness, THP) but the disk and net tuners can't see host/sys/block,/proc/sys/net, or host NICs, so they error with:…and rpk exits non-zero, crashlooping the init container for any user who sets these flags.
What changes
charts/redpanda/values.goTuning.ApplyHostTunersbool, defaultfalse. Whentrue,Tuning.Translate()also setstune_disk_irq/tune_disk_scheduler/tune_disk_nomerges/tune_networktotruein the renderedrpksection so the chroot path actually has tuners to run. Docs cover SCC / PSA requirements, the one-pod-per-node constraint, and the first-arg-wins merge semantics.charts/redpanda/statefulset.gostatefulSetInitContainerTuningOnHostpath that produces the chroot-mode init container.StatefulSetVolumesappends the host bind volumes only when the feature is enabled.charts/redpanda/testdata/template-cases.txtartuning-host-modecase + golden. Golden updated to include the four default-enabled flags..changes/unreleased/charts-redpanda-Added-20260512-host-tuners.yaml_statefulset.go.tpl,_values.go.tpl,values.schema.json,values_partial.gen.go, golden txtar.User-facing config (operator v2)
The chart is plumbed through
clusterSpecon the v1alpha2 Redpanda CR. One flag, no follow-on:That alone fires the four host-mode tuners. The chart's RPK config merge is first-arg-wins, so a per-tuner override in
config.rpk.tune_*loses to the chart's value — by design. Users who want a specific tuner off should leaveapply_host_tunersempty and wire host tuning via their own DaemonSet.How the chroot path works
Three workarounds layered in:
data_directoryinjection. The chart-renderedredpanda.yamlomitsredpanda.data_directory(the broker doesn't need it). rpk's disk tuners do, and rpk refuses to combine--dirswith--config. The scriptcp+seds a working copy into/var/tmp(because/tmpis not bind-mounted from the host) and points rpk at it.systemctlcan't traverse a chroot, so the script usesbusctlagainst the host's systemd to restart irqbalance after rpk rewrites IRQ affinity. Falls back topkill irqbalancefor non-systemd hosts.|| true ; exit 0so a single unsupported tuner (e.g. AWS lacksdisk_write_cache) doesn't crashloop the init container — the failure mode today.Host bind mounts are intentionally per-directory (
/sys /proc /etc /usr /lib /lib64 /dev /var /run) rather than whole-/. Mounting/into/hostcreates mount-loops with/opt/redpanda; CORE-13685 found this the hard way.Validation
Run 1 — EKS 1.33 / Amazon Linux 2023 / kernel 6.12, Redpanda 26.1.6, single replica
Before (no chroot patch — current behavior):
Init container CrashLoopBackOff (rpk exits non-zero).
After (
apply_host_tuners: true):Init exits 0, Redpanda broker comes up cleanly, admin API responds.
Run 2 — EKS 1.31 / AL2023 / kernel 6.1.168 ARM (Graviton4 m8gd.metal-24xl), Redpanda v25.3.4, 3 replicas
End-to-end run including OMB producer/consumer smoke test (details + artifacts). With
apply_host_tuners: trueas the only tuning knob on the CR, freshly-rolled broker pods showed:Three of four host-mode tuners
APPLIED=true ENABLED=true.disk_irqENABLED but failing on a pre-existing AL2023 kernel issue (open /proc/irq/0/smp_affinity: no such file or directory— rpk should skip IRQs without a writeablesmp_affinityrather than fail; tracked separately). The init container does NOT crashloop on this failure — the exit-code tolerance keeps the broker booting.OMB ran clean for 5 min @ 20 MB/s, 0 publish errors, p99 publish 26 ms.
Lifecycle of host tuning state
What happens when you flip
apply_host_tuners(ortune_aio_events) back tofalse, and what about reboot?On the next pod replacement, the chart re-renders the StatefulSet:
tune_aio_eventsapply_host_tunersfalsetruefalsetruetrueSo turning the flag off and rolling pods does remove the chroot path and the host bind mounts. It does not, however, un-tune the host. The chart has no "untune" step — every kernel-level write the tuners made stays in place until something else reverses it.
fs.aio-max-nr/proc/sys/fs/aio-max-nrdisk_irqworks)/proc/irq/N/smp_affinityscheduler/nomerges/sys/block/nvmeXn1/queue/*net.core.rps_sock_flow_entries/sys/class/net/ethN/queues/tx-N/xps_cpusredpanda_node_tuner_state.yaml/var/run/(tmpfs)Practical consequences:
/var/run/redpanda_node_tuner_state.yamlis on tmpfs, so it's gone at every reboot and rpk re-runs all tuners on first pod start after boot (rather than skipping based on stale state). That's the right behavior for transient kernel state, but worth knowing if you're trying to reason about "did the tuner actually run this boot?"If you need to revert the host kernel state without a reboot (e.g. you turned the flag off mid-shift and want the change to take effect immediately), you have to either reboot the node or write a one-shot DaemonSet that resets the specific sysfs/sysctl values. The chart deliberately does not ship that — host de-tuning is a destructive operation on shared kernel state and shouldn't be silently performed when a user flips a chart value.
Security posture
Opt-in only. The default tuning container is unchanged. Enabling
apply_host_tunersrequires the same trust level astune_aio_eventsalready does — privileged container, root user, hostPath volumes.For secure k8s installations:
hostPathvolumes andprivileged: true. The built-inprivilegedSCC works. A custom SCC can be authored if narrower scope is required — the chart only mounts standard Linux directories (/sys /proc /etc /usr /lib /lib64 /dev /var /run) plus the tuner state file at/var/run/redpanda_node_tuner_state.yaml.pod-security.kubernetes.io/enforce: privileged(this is also required today fortune_aio_events).Out of scope (deliberate)
rpk.tune_*flag as a first-class chart value. Today the chart exposestune_aio_events,tune_clocksource,tune_ballast_fileat the top level, andapply_host_tunersrolls up the four host-mode flags. The rest are reachable viaconfig.rpk.tune_*if a user explicitly wants them on.--node-tuner-state-pathplumbing on the redpanda main container. The state file persists across pod restarts on the same node (so re-runs are no-op) but the broker doesn't need to read it; the broker comes up fine without it. Can be added later if dedicated-mode reporting wants it.disk_irqfailure on AL2023 ARM kernels (e.g.m8gd.metal-24xl) is a core/rpk-side issue, not a chart issue. The chroot path exposes IRQ 0's missingsmp_affinityto rpk, which then fails the whole tuner. Tracked separately — needsrpkto skip un-pinnable IRQs rather than error.Test plan
task lint— passestask generate— no diff after runninggo test ./charts/redpanda/ -run TestTemplate— passes (newtuning-host-modegolden case includes the 4 default-enabled flags)helm templatewithapply_host_tuners=truerenders the expected init container + volumes + the 4 host-modetune_*: trueflagshelm templatewith default values is byte-identical to pre-PR output