Skip to content

charts/redpanda: opt-in chroot tuning init container for host tuners#1522

Draft
david-yu wants to merge 3 commits into
mainfrom
feat/chart-host-tuners
Draft

charts/redpanda: opt-in chroot tuning init container for host tuners#1522
david-yu wants to merge 3 commits into
mainfrom
feat/chart-host-tuners

Conversation

@david-yu

@david-yu david-yu commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds tuning.apply_host_tuners (default false). When enabled, the tuning init container builds a chroot to the host filesystem and runs rpk redpanda tune all inside the host's network namespace — the tuners that previously failed inside the pod sandbox actually apply.

UX: apply_host_tuners: true is the one switch users need. The chart now default-enables tune_disk_irq, tune_disk_scheduler, tune_disk_nomerges, and tune_network in the rendered rpk config when the flag is on, so the tuners the chroot path exists to fix actually run without the user having to mirror them by hand in config.rpk.tune_*.

Closes K8S-101. Productionizes the chroot pattern from CORE-13685 (Stephan Dollberg's experiment).

Why

The default tuning container runs rpk redpanda tune all inside the pod's own namespaces. That works for sysctl-based tuners (aio_events, swappiness, THP) but the disk and net tuners can't see host /sys/block, /proc/sys/net, or host NICs, so they error with:

disk_irq      ERROR: directory '' does not exists
net           ERROR: open /proc/sys/net/core/rps_sock_flow_entries: no such file

…and rpk exits non-zero, crashlooping the init container for any user who sets these flags.

What changes

File Change
charts/redpanda/values.go New Tuning.ApplyHostTuners bool, default false. When true, Tuning.Translate() also sets tune_disk_irq/tune_disk_scheduler/tune_disk_nomerges/tune_network to true in the rendered rpk section so the chroot path actually has tuners to run. Docs cover SCC / PSA requirements, the one-pod-per-node constraint, and the first-arg-wins merge semantics.
charts/redpanda/statefulset.go New statefulSetInitContainerTuningOnHost path that produces the chroot-mode init container. StatefulSetVolumes appends the host bind volumes only when the feature is enabled.
charts/redpanda/testdata/template-cases.txtar New tuning-host-mode case + golden. Golden updated to include the four default-enabled flags.
.changes/unreleased/charts-redpanda-Added-20260512-host-tuners.yaml Changelog entry.
Generated: _statefulset.go.tpl, _values.go.tpl, values.schema.json, values_partial.gen.go, golden txtar.

User-facing config (operator v2)

The chart is plumbed through clusterSpec on the v1alpha2 Redpanda CR. One flag, no follow-on:

apiVersion: cluster.redpanda.com/v1alpha2
kind: Redpanda
spec:
  clusterSpec:
    tuning:
      tune_aio_events: true
      apply_host_tuners: true

That alone fires the four host-mode tuners. The chart's RPK config merge is first-arg-wins, so a per-tuner override in config.rpk.tune_* loses to the chart's value — by design. Users who want a specific tuner off should leave apply_host_tuners empty and wire host tuning via their own DaemonSet.

How the chroot path works

mount --bind /opt/redpanda /host/opt/redpanda
cp /host/redpanda_etc/redpanda.yaml /host/var/tmp/redpanda-tune.yaml
sed -i 's|^redpanda:|redpanda:\n  data_directory: /var/lib/redpanda/data|' /host/var/tmp/redpanda-tune.yaml
chroot /host /usr/bin/bash -c '
  nsenter -t 1 -n /opt/redpanda/bin/rpk redpanda tune all \
    --config /var/tmp/redpanda-tune.yaml \
    --node-tuner-state-path /tuner_state.yaml -v
  busctl call ... RestartUnit ss "irqbalance.service" "replace" \
    || pkill -f irqbalance || true
' || true

Three workarounds layered in:

  1. data_directory injection. The chart-rendered redpanda.yaml omits redpanda.data_directory (the broker doesn't need it). rpk's disk tuners do, and rpk refuses to combine --dirs with --config. The script cp+seds a working copy into /var/tmp (because /tmp is not bind-mounted from the host) and points rpk at it.
  2. busctl for irqbalance. systemctl can't traverse a chroot, so the script uses busctl against the host's systemd to restart irqbalance after rpk rewrites IRQ affinity. Falls back to pkill irqbalance for non-systemd hosts.
  3. Exit-code tolerance. || true ; exit 0 so a single unsupported tuner (e.g. AWS lacks disk_write_cache) doesn't crashloop the init container — the failure mode today.

Host bind mounts are intentionally per-directory (/sys /proc /etc /usr /lib /lib64 /dev /var /run) rather than whole-/. Mounting / into /host creates mount-loops with /opt/redpanda; CORE-13685 found this the hard way.

Validation

Run 1 — EKS 1.33 / Amazon Linux 2023 / kernel 6.12, Redpanda 26.1.6, single replica

Before (no chroot patch — current behavior):

aio_events             APPLIED=true
cpu                    APPLIED=true
clocksource            APPLIED=true
swappiness             APPLIED=true
transparent_hugepages  APPLIED=true
disk_irq               APPLIED=false  ERROR: directory '' does not exists
disk_nomerges          APPLIED=false  ERROR: directory '' does not exists
disk_scheduler         APPLIED=false  ERROR: directory '' does not exists
net                    APPLIED=false  ERROR: open /proc/sys/net/core/rps_sock_flow_entries
fstrim                 APPLIED=false  ERROR: dial unix /run/systemd/private

Init container CrashLoopBackOff (rpk exits non-zero).

After (apply_host_tuners: true):

aio_events             APPLIED=true
cpu                    APPLIED=true
clocksource            APPLIED=true
swappiness             APPLIED=true
transparent_hugepages  APPLIED=true
disk_irq               APPLIED=true   ← fixed
disk_nomerges          APPLIED=true   ← fixed
disk_scheduler         APPLIED=true   ← fixed
net                    APPLIED=true   ← fixed
disk_write_cache       SUPPORTED=false ("only supported in GCP")

Init exits 0, Redpanda broker comes up cleanly, admin API responds.

Run 2 — EKS 1.31 / AL2023 / kernel 6.1.168 ARM (Graviton4 m8gd.metal-24xl), Redpanda v25.3.4, 3 replicas

End-to-end run including OMB producer/consumer smoke test (details + artifacts). With apply_host_tuners: true as the only tuning knob on the CR, freshly-rolled broker pods showed:

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
disk_irq               false    true     true       open /proc/irq/0/smp_affinity: no such file or directory
disk_nomerges          true     true     true
disk_scheduler         true     true     true
disk_write_cache       false    false    false      Disk write cache tuner is only supported in GCP
net                    true     true     true

Three of four host-mode tuners APPLIED=true ENABLED=true. disk_irq ENABLED but failing on a pre-existing AL2023 kernel issue (open /proc/irq/0/smp_affinity: no such file or directory — rpk should skip IRQs without a writeable smp_affinity rather than fail; tracked separately). The init container does NOT crashloop on this failure — the exit-code tolerance keeps the broker booting.

OMB ran clean for 5 min @ 20 MB/s, 0 publish errors, p99 publish 26 ms.

Lifecycle of host tuning state

What happens when you flip apply_host_tuners (or tune_aio_events) back to false, and what about reboot?

On the next pod replacement, the chart re-renders the StatefulSet:

tune_aio_events apply_host_tuners Init container rendered
false (any) none — no tuning init container at all
true false the regular in-pod tuning container (pre-PR default)
true true the chroot init container + host bind mounts

So turning the flag off and rolling pods does remove the chroot path and the host bind mounts. It does not, however, un-tune the host. The chart has no "untune" step — every kernel-level write the tuners made stays in place until something else reverses it.

Tuned state Where it lives Survives flag flip? Survives host reboot?
fs.aio-max-nr /proc/sys/fs/aio-max-nr yes no (sysctl resets at boot)
IRQ affinity (when disk_irq works) /proc/irq/N/smp_affinity yes no
NVMe scheduler / nomerges /sys/block/nvmeXn1/queue/* yes no (sysfs resets at boot)
net.core.rps_sock_flow_entries sysctl yes no
NIC XPS masks /sys/class/net/ethN/queues/tx-N/xps_cpus yes no
irqbalance unit state systemd yes (left as rpk left it) depends on unit
redpanda_node_tuner_state.yaml /var/run/ (tmpfs) yes no (tmpfs wipes at boot)

Practical consequences:

  • Turning the flag off on a long-running cluster does not regress the host's kernel state. Existing brokers keep using the kernel state the previous tuning applied. The host stays tuned for as long as it stays up.
  • On host reboot, all the above kernel state reverts to AL2023 defaults. If the flag is still on, the chroot init container re-tunes on next pod start — this is the design: the chart re-tunes every pod start, so reboots are self-healing. If the flag has been turned off, the host comes up un-tuned and the broker on it runs against default kernel settings.
  • The state file /var/run/redpanda_node_tuner_state.yaml is on tmpfs, so it's gone at every reboot and rpk re-runs all tuners on first pod start after boot (rather than skipping based on stale state). That's the right behavior for transient kernel state, but worth knowing if you're trying to reason about "did the tuner actually run this boot?"

If you need to revert the host kernel state without a reboot (e.g. you turned the flag off mid-shift and want the change to take effect immediately), you have to either reboot the node or write a one-shot DaemonSet that resets the specific sysfs/sysctl values. The chart deliberately does not ship that — host de-tuning is a destructive operation on shared kernel state and shouldn't be silently performed when a user flips a chart value.

Security posture

Opt-in only. The default tuning container is unchanged. Enabling apply_host_tuners requires the same trust level as tune_aio_events already does — privileged container, root user, hostPath volumes.

For secure k8s installations:

  • OpenShift: bind the pod's ServiceAccount to a SCC that allows hostPath volumes and privileged: true. The built-in privileged SCC works. A custom SCC can be authored if narrower scope is required — the chart only mounts standard Linux directories (/sys /proc /etc /usr /lib /lib64 /dev /var /run) plus the tuner state file at /var/run/redpanda_node_tuner_state.yaml.
  • Pod Security Admission: namespace must be labeled pod-security.kubernetes.io/enforce: privileged (this is also required today for tune_aio_events).
  • One pod per node. Concurrent tuners race on the same kernel parameters. Users enabling this should also set a podAntiAffinity rule that disallows co-location of Redpanda pods on the same node. The value docs call this out.

Out of scope (deliberate)

  • Not exposing every rpk.tune_* flag as a first-class chart value. Today the chart exposes tune_aio_events, tune_clocksource, tune_ballast_file at the top level, and apply_host_tuners rolls up the four host-mode flags. The rest are reachable via config.rpk.tune_* if a user explicitly wants them on.
  • No --node-tuner-state-path plumbing on the redpanda main container. The state file persists across pod restarts on the same node (so re-runs are no-op) but the broker doesn't need to read it; the broker comes up fine without it. Can be added later if dedicated-mode reporting wants it.
  • disk_irq failure on AL2023 ARM kernels (e.g. m8gd.metal-24xl) is a core/rpk-side issue, not a chart issue. The chroot path exposes IRQ 0's missing smp_affinity to rpk, which then fails the whole tuner. Tracked separately — needs rpk to skip un-pinnable IRQs rather than error.

Test plan

  • task lint — passes
  • task generate — no diff after running
  • go test ./charts/redpanda/ -run TestTemplate — passes (new tuning-host-mode golden case includes the 4 default-enabled flags)
  • helm template with apply_host_tuners=true renders the expected init container + volumes + the 4 host-mode tune_*: true flags
  • helm template with default values is byte-identical to pre-PR output
  • EKS 1.33 / AL2023 / Redpanda 26.1.6 end-to-end verified (Run 1 above)
  • EKS 1.31 / AL2023 ARM (Graviton4) / Redpanda v25.3.4, 3-replica + OMB smoke test (Run 2 above)
  • CI Operator Test Suite (will run on push)
  • CI Acceptance Tests (will run on push)

Adds `tuning.apply_host_tuners` (default `false`). When enabled, the
tuning init container builds a chroot to the host filesystem and runs
`rpk redpanda tune all` inside the host's network namespace, so the
tuners that previously failed inside the pod sandbox actually apply:

- disk_irq / disk_scheduler / disk_nomerges — need /sys/block visibility
- net — needs /proc/sys/net and host NICs
- fstrim — needs /run/systemd/private

Verified on EKS 1.33 / AL2023 / kernel 6.12 + Redpanda 26.1.6: post-
patch, APPLIED=true for aio_events, cpu, clocksource, swappiness,
transparent_hugepages, disk_irq, disk_nomerges, disk_scheduler, and
net. The only tuner that stays unsupported is disk_write_cache (GCP-
only). The redpanda broker comes up cleanly.

Three workarounds the chart layers in:

  1. The rendered redpanda.yaml omits `redpanda.data_directory` (the
     broker doesn't need it). rpk's disk tuners do, and rpk refuses to
     combine --dirs with --config. The script cp+seds a working copy
     into /var/tmp (because /tmp is not bind-mounted from the host)
     and points rpk at it.

  2. systemctl can't traverse a chroot, so the script uses busctl
     against the host's systemd to restart irqbalance after rpk
     rewrites IRQ affinity. Falls back to `pkill irqbalance` for
     non-systemd hosts.

  3. `|| true ; exit 0` so a single unsupported tuner (e.g. AWS lacks
     disk_write_cache) doesn't crashloop the init container — which
     is what happens today the moment a user sets any of the disk
     tuner flags.

Host bind mounts are intentionally per-directory (/sys /proc /etc /usr
/lib /lib64 /dev /var /run) rather than whole-/ — the latter creates
mount-loops with /opt/redpanda (CORE-13685 found this the hard way).

Security posture: opt-in only. Default is unchanged. Enabling requires
the same trust level as `tune_aio_events` already does — privileged
container, root user, hostPath volumes. On OpenShift bind a SCC that
allows hostPath and privileged (the built-in `privileged` SCC works);
on PSA clusters label the namespace `privileged`. Document warning in
values.go: this MUST NOT be combined with multiple Redpanda pods per
node, or concurrent tuners race on the same kernel parameters.

This change is chart-only; the operator v2 path picks it up via the
Redpanda CR's `clusterSpec.tuning.apply_host_tuners` passthrough, no
operator code change needed.

Closes K8S-101. Productionizes the chroot pattern from CORE-13685.
david-yu and others added 2 commits May 12, 2026 21:07
CI's TestHelmValuesCompat/clusterSpec failed because the new
charts/redpanda Tuning.ApplyHostTuners field wasn't mirrored on the
operator's v1alpha2 RedpandaClusterSpec.Tuning (and StretchTuning) —
PartialValues serialized {"tuning":{"apply_host_tuners":false}} while
the operator side serialized {"tuning":{}}.

Adds the field to both structs, regenerates deepcopy, applyconfiguration,
CRDs, and crd-docs.

Also includes a protoc-gen-go-grpc v1.6.1 → v1.6.2 regen that CI's nix
pin emits on `task generate` — required for the lint step's
`git diff --exit-code` to pass.
…rs=true

ApplyHostTuners only controls which tuning init container template the
chart renders (chroot vs. in-pod). It does not enable the per-tuner rpk
flags those tuners are gated on inside redpanda.yaml. End-to-end on
m8gd.metal-24xl confirmed that flipping just apply_host_tuners=true
produces "rpk redpanda tune all" output where every tuner except
aio_events shows APPLIED=false / ENABLED=false — i.e. the chroot path is
running but the disk_irq / disk_scheduler / disk_nomerges / net tuners
that motivate the PR are silently skipped.

Auto-enable those four when ApplyHostTuners is on, so a user who sets
the one chart knob the PR documents actually exercises the code path.
Override semantics are intentionally not provided: helmette.Merge is
first-arg-wins, so config.rpk.tune_* set on the CR loses to the chart's
defaults anyway. Calling that out in the docstring so users don't expect
override.

Golden testdata for tuning-host-mode updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@david-yu

Copy link
Copy Markdown
Contributor Author

Cross-cloud validation: GKE + AKS

Tested the PR (HEAD b876709c) on GKE and AKS in addition to the existing EKS runs in the PR description. Same operator chart from this branch, same Redpanda v25.3.4, same minimal CR (tune_aio_events: true + apply_host_tuners: true, no config.rpk.tune_* overrides), single broker, default storage class on each cloud.

GKE — Ubuntu 24.04 / kernel 6.8.0-1049-gke / n2-standard-8

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
ballast_file           false    false    true
clocksource            false    false    true
coredump               false    false    true
cpu                    false    false    true
disk_irq               true     true     true       ← fixed by chroot
disk_nomerges          true     true     true       ← fixed by chroot
disk_scheduler         true     true     true       ← fixed by chroot
disk_write_cache       false    false    true
fstrim                 false    false    false      err=fork/exec /usr/bin/which: no such file or directory
net                    true     true     true       ← fixed by chroot
swappiness             false    false    true
transparent_hugepages  false    false    true

All four host-mode tuners APPLIED=true ENABLED=true. Init container exits 0, broker Ready in under 90s. Note disk_write_cache reports SUPPORTED=true (vs. AWS where it's GCP-only); not enabled here because the CR doesn't toggle it.

AKS — Ubuntu 22.04 / kernel 5.15.0-1110-azure / Standard_D8s_v5

TUNER                  APPLIED  ENABLED  SUPPORTED  ERROR
aio_events             true     true     true
ballast_file           false    false    true
clocksource            false    false    true
coredump               false    false    true
cpu                    false    false    true
disk_irq               true     true     true       ← fixed by chroot
disk_nomerges          true     true     true       ← fixed by chroot
disk_scheduler         true     true     true       ← fixed by chroot
disk_write_cache       false    false    false      Disk write cache tuner is only supported in GCP
fstrim                 false    false    false      err=fork/exec /usr/bin/which: no such file or directory
net                    true     true     true       ← fixed by chroot
swappiness             false    false    true
transparent_hugepages  false    false    true

Identical pattern. All four host-mode tuners apply cleanly. Broker Ready in ~50s.

Cross-cloud summary for the four PR-target tuners

Tuner EKS m8gd.metal-24xl (AL2023 ARM, kernel 6.1.168) GKE Ubuntu 24.04 (kernel 6.8) AKS Ubuntu 22.04 (kernel 5.15)
disk_irq ❌ pre-existing kernel-level issue: /proc/irq/0/smp_affinity: no such file or directory ✅ APPLIED ✅ APPLIED
disk_scheduler ✅ APPLIED ✅ APPLIED ✅ APPLIED
disk_nomerges ✅ APPLIED ✅ APPLIED ✅ APPLIED
net ✅ APPLIED ✅ APPLIED ✅ APPLIED

3 of 4 work on every cloud; disk_irq only fails on AL2023 ARM (the AWS metal SKU). That failure is exposed by the chroot path but caused by AL2023 ARM not having a writeable smp_affinity for IRQ 0 — not a chart problem (tracked separately for rpk to skip un-pinnable IRQs rather than error).

Minor non-blocker

Both GKE and AKS show fstrim errors with fork/exec /usr/bin/which: no such file or directory. That's rpk's tuner using which to detect fstrim's presence and Ubuntu's minimal container image not having it. Doesn't affect any of the four host-mode tuners and fstrim is ENABLED=false anyway. Worth a follow-up rpk fix but not a PR 1522 issue.

Setup notes (for reproducibility)

  • Image: pr1522-b876709c built and pushed to ephemeral registries (GAR for GKE pull, ACR for AKS pull). Both arches (linux/amd64, linux/arm64) in the same manifest.
  • CR: minimal — single replica, no TLS, no NodePort, no anti-affinity (single-node test).
  • Namespace label: pod-security.kubernetes.io/enforce: privileged (required by chart's tuning init container; same as tune_aio_events: true in PR pre-change behavior).
  • Clusters/ACR/GAR torn down post-test.

@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label May 19, 2026
@david-yu david-yu removed the stale label May 19, 2026
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label May 25, 2026
@david-yu david-yu removed the stale label May 25, 2026
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label May 31, 2026
@david-yu david-yu removed the stale label May 31, 2026
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Jun 6, 2026
@david-yu david-yu removed the stale label Jun 6, 2026
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Jun 12, 2026
@david-yu david-yu removed the stale label Jun 14, 2026
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions Bot added the stale label Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant