Skip to content

Commit 4977e3f

Browse files
authored
k8s: manage Caddy ingress image via spec (so-p3p) (#749)
Closes so-p3p: - New spec key `caddy-ingress-image`: on fresh install, deploys Caddy with this image; on subsequent `deployment start`, patches the running Caddy Deployment if the image differs. Defaults to the manifest's hardcoded image when absent - When the spec key is absent, SO does **not** touch a running Caddy — avoids silently reverting an image set out-of-band (ansible playbook, another deployment's spec) - `strategy: Recreate` on the Caddy Deployment manifest (required — hostPort 80/443 deadlocks rolling updates) - Reconcile runs under both `--perform-cluster-management` and the default `--skip-cluster-management` (it's a k8s-API patch, not a cluster-lifecycle op) - Image template by container name rather than string match, so the spec override wins regardless of what the shipped manifest hardcodes - Cluster-scoped caveat documented: `caddy-system` is shared across deployments, so the last `deployment start` that sets the key wins for everyone
1 parent 421b83c commit 4977e3f

7 files changed

Lines changed: 166 additions & 2 deletions

File tree

.pebbles/events.jsonl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,5 @@
5252
{"type":"status_update","timestamp":"2026-04-21T05:57:12.928842469Z","issue_id":"so-n1n","payload":{"status":"closed"}}
5353
{"type":"comment","timestamp":"2026-04-21T06:08:13.933886638Z","issue_id":"so-ad7","payload":{"body":"Fixed in PR #744 (cf8b7533). get_services() now includes the maintenance pod in the container-ports map so its per-pod Service is built and available for the Ingress swap."}}
5454
{"type":"status_update","timestamp":"2026-04-21T06:08:14.457815115Z","issue_id":"so-ad7","payload":{"status":"closed"}}
55+
{"type":"update","timestamp":"2026-04-21T09:00:47.364859946Z","issue_id":"so-p3p","payload":{"description":"## Problem\n\nThe Caddy ingress controller image is hardcoded in `ingress-caddy-kind-deploy.yaml`, with no mechanism to update it short of cluster recreation or manual `kubectl patch`. laconic-so should: (1) allow spec.yml to specify a Caddy image, (2) support updating the Caddy image as part of `deployment start`, (3) set `strategy: Recreate` on the Caddy Deployment since hostPort pods can't rolling-update.\n\n## Resolution\n\n- New spec key `caddy-ingress-image`. Fresh install uses it (fallback: manifest default). On subsequent `deployment start`, if the spec key is set and the running Caddy image differs, SO patches the Deployment and waits for rollout.\n- Spec key absent =\u003e SO does **not** touch a running Caddy, to avoid silently reverting images set out-of-band (ansible playbook, another deployment's spec).\n- `strategy: Recreate` added to the Caddy Deployment manifest.\n- Reconcile runs under both `--perform-cluster-management` and the default `--skip-cluster-management` (it's a plain k8s-API patch, not a cluster lifecycle op).\n- Image substitution locates the container by name instead of string-matching the shipped default, so the spec override wins regardless of what the manifest hardcodes.\n- Cluster-scoped caveat: `caddy-system` is shared across deployments; last `deployment start` that sets the key wins for everyone. Documented in `deployment_patterns.md`."}}
56+
{"type":"status_update","timestamp":"2026-04-21T09:00:47.745675131Z","issue_id":"so-p3p","payload":{"status":"closed"}}

docs/deployment_patterns.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,41 @@ with a `DeployerException` pointing at the `namespace:` spec
202202
override. Catches operator-error cases where the same deployment dir
203203
is effectively registered twice.
204204

205+
### Caddy ingress image lifecycle
206+
207+
The Caddy ingress controller lives in the cluster-scoped
208+
`caddy-system` namespace and is installed on first `deployment start`.
209+
Its image is configurable per deployment:
210+
211+
```yaml
212+
# spec.yml
213+
caddy-ingress-image: ghcr.io/laconicnetwork/caddy-ingress:v1.2.3
214+
```
215+
216+
Two cases, intentionally different:
217+
218+
- **Spec key set**: on first install the manifest is templated with
219+
this image. On subsequent `deployment start`, if the running Caddy
220+
Deployment's image differs, laconic-so patches it and waits for the
221+
rollout. The Deployment uses `strategy: Recreate` (hostPort 80/443
222+
blocks rolling updates from ever completing), so expect ~10–30s of
223+
ingress downtime while the old pod terminates and the new one
224+
starts.
225+
- **Spec key absent**: on first install the manifest's hardcoded
226+
default (`ghcr.io/laconicnetwork/caddy-ingress:latest`) is used.
227+
On subsequent `deployment start`, laconic-so does **not** touch the
228+
running Caddy Deployment. This matters when the image was set
229+
out-of-band (via an ansible playbook, or by another deployment's
230+
spec that's since been removed) — a silent revert to the default
231+
would be worse than doing nothing. If you want to go back to the
232+
default image, set `caddy-ingress-image` to it explicitly.
233+
234+
**Cluster-scoped caveat**: `caddy-system` is shared by every
235+
deployment on the cluster. Setting `caddy-ingress-image` in any one
236+
deployment's spec rolls the controller for all of them — last
237+
`deployment start` wins. Treat it as a cluster-level knob; keep the
238+
value consistent across the deployments sharing a cluster.
239+
205240
## Volume Persistence in k8s-kind
206241

207242
k8s-kind has 3 storage layers:

stack_orchestrator/constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,5 +48,7 @@
4848
high_memlock_spec_filename = "high-memlock-spec.json"
4949
acme_email_key = "acme-email"
5050
kind_mount_root_key = "kind-mount-root"
51+
caddy_ingress_image_key = "caddy-ingress-image"
52+
default_caddy_ingress_image = "ghcr.io/laconicnetwork/caddy-ingress:latest"
5153
external_services_key = "external-services"
5254
ca_certificates_key = "ca-certificates"

stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -160,6 +160,11 @@ metadata:
160160
app.kubernetes.io/component: controller
161161
spec:
162162
replicas: 1
163+
# Recreate is required: the pod binds hostPort 80/443, which a
164+
# RollingUpdate would try to double-claim during cutover (new pod
165+
# pending until old pod exits — never exits, rollout deadlocks).
166+
strategy:
167+
type: Recreate
163168
selector:
164169
matchLabels:
165170
app.kubernetes.io/name: caddy-ingress-controller

stack_orchestrator/deploy/k8s/deploy_k8s.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
)
3535
from stack_orchestrator.deploy.k8s.helpers import (
3636
install_ingress_for_kind,
37+
update_caddy_ingress_image,
3738
wait_for_ingress_in_kind,
3839
is_ingress_running,
3940
)
@@ -880,18 +881,35 @@ def _setup_cluster(self):
880881
check_mounts_compatible(existing, kind_config)
881882
self.connect_api()
882883
self._ensure_namespace()
884+
caddy_image = self.cluster_info.spec.get_caddy_ingress_image()
885+
# Fresh-install path: gated on cluster lifecycle ownership
886+
# because install_ingress_for_kind also seeds caddy-system
887+
# (namespace, secrets restore, cert-backup CronJob).
883888
if self.is_kind() and not self.skip_cluster_management:
884889
if not is_ingress_running():
885890
install_ingress_for_kind(
886891
self.cluster_info.spec.get_acme_email(),
887892
self.cluster_info.spec.get_kind_mount_root(),
893+
caddy_image=caddy_image,
888894
)
889895
wait_for_ingress_in_kind()
890896
if self.cluster_info.spec.get_unlimited_memlock():
891897
_create_runtime_class(
892898
constants.high_memlock_runtime,
893899
constants.high_memlock_runtime,
894900
)
901+
# Reconcile Caddy image whenever the operator explicitly set
902+
# it in spec, regardless of cluster lifecycle ownership —
903+
# --skip-cluster-management (the default) shouldn't prevent
904+
# a routine k8s-API-level patch of a running Deployment.
905+
# Spec absent => don't touch: the operator may have set the
906+
# image out-of-band (ansible playbook, prior explicit spec on
907+
# a different deployment) and a silent revert would be worse
908+
# than doing nothing. caddy-system is cluster-scoped, so
909+
# whichever deployment's spec sets the image last wins.
910+
if self.is_kind() and caddy_image is not None and is_ingress_running():
911+
if update_caddy_ingress_image(caddy_image):
912+
wait_for_ingress_in_kind()
895913

896914
def _create_ingress(self):
897915
"""Create or update Ingress with TLS certificate lookup."""

stack_orchestrator/deploy/k8s/helpers.py

Lines changed: 86 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,9 @@ def wait_for_ingress_in_kind():
466466

467467

468468
def install_ingress_for_kind(
469-
acme_email: str = "", kind_mount_root: Optional[str] = None
469+
acme_email: str = "",
470+
kind_mount_root: Optional[str] = None,
471+
caddy_image: Optional[str] = None,
470472
):
471473
api_client = client.ApiClient()
472474
ingress_install = os.path.abspath(
@@ -477,7 +479,7 @@ def install_ingress_for_kind(
477479
if opts.o.debug:
478480
print("Installing Caddy ingress controller in kind cluster")
479481

480-
# Template the YAML with email before applying
482+
# Template the YAML with email and image before applying
481483
with open(ingress_install) as f:
482484
yaml_content = f.read()
483485

@@ -488,6 +490,27 @@ def install_ingress_for_kind(
488490

489491
yaml_objects = list(yaml.safe_load_all(yaml_content))
490492

493+
# Override the Caddy container's image when a spec value is set.
494+
# Works regardless of what's hardcoded in the manifest — we locate
495+
# the container by name and overwrite its image field, rather than
496+
# relying on a string match of the default.
497+
if caddy_image:
498+
for obj in yaml_objects:
499+
if not obj:
500+
continue
501+
if (
502+
obj.get("kind") == "Deployment"
503+
and obj.get("metadata", {}).get("name")
504+
== "caddy-ingress-controller"
505+
):
506+
for c in (
507+
obj["spec"]["template"]["spec"].get("containers") or []
508+
):
509+
if c.get("name") == "caddy-ingress-controller":
510+
c["image"] = caddy_image
511+
if opts.o.debug:
512+
print(f"Configured Caddy image: {caddy_image}")
513+
491514
# Split: apply everything except the Caddy controller Deployment first,
492515
# so the namespace + secrets exist before the pod can start and read its
493516
# secret_store. Race-free: Caddy has no way to see the cluster until
@@ -530,6 +553,67 @@ def _is_caddy_deployment(o):
530553
_install_caddy_cert_backup(api_client, kind_mount_root)
531554

532555

556+
def update_caddy_ingress_image(caddy_image: str) -> bool:
557+
"""Patch the running Caddy ingress Deployment to a new image.
558+
559+
No-op if the live Deployment already runs the requested image.
560+
Returns True if a patch was applied, False otherwise.
561+
562+
Caddy lives in the cluster-scoped `caddy-system` namespace, so
563+
this affects every deployment sharing the cluster. The
564+
`strategy: Recreate` in the Deployment manifest handles the
565+
hostPort-80/443 handoff; expect ~10-30s of ingress downtime while
566+
the old pod terminates and the new one starts.
567+
"""
568+
apps_api = client.AppsV1Api()
569+
try:
570+
dep = apps_api.read_namespaced_deployment(
571+
name="caddy-ingress-controller", namespace="caddy-system"
572+
)
573+
except ApiException as e:
574+
if e.status == 404:
575+
if opts.o.debug:
576+
print(
577+
"Caddy ingress Deployment not found; nothing to "
578+
"update (install path handles fresh clusters)"
579+
)
580+
return False
581+
raise
582+
583+
containers = dep.spec.template.spec.containers or []
584+
current = containers[0].image if containers else None
585+
if current == caddy_image:
586+
if opts.o.debug:
587+
print(f"Caddy image already at {caddy_image}; no update needed")
588+
return False
589+
590+
print(
591+
f"Updating Caddy ingress image: {current} -> {caddy_image} "
592+
"(expect brief ingress downtime)"
593+
)
594+
patch = {
595+
"spec": {
596+
"template": {
597+
"spec": {
598+
"containers": [
599+
{
600+
"name": "caddy-ingress-controller",
601+
"image": caddy_image,
602+
"imagePullPolicy": "Always",
603+
}
604+
]
605+
}
606+
}
607+
}
608+
}
609+
apps_api.patch_namespaced_deployment(
610+
name="caddy-ingress-controller",
611+
namespace="caddy-system",
612+
body=patch,
613+
)
614+
return True
615+
616+
533617
def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
534618
for image in image_set:
535619
result = _run_command(

stack_orchestrator/deploy/spec.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,24 @@ def get_kind_mount_root(self) -> typing.Optional[str]:
304304
"""
305305
return self.obj.get(constants.kind_mount_root_key)
306306

307+
def get_caddy_ingress_image(self) -> typing.Optional[str]:
308+
"""Return the Caddy ingress controller image override, or None.
309+
310+
Returns None (not the default image) when the spec key is
311+
absent. That distinction matters: the install path falls back
312+
to the hardcoded default so there's always *some* image to
313+
deploy, while the update-on-reuse path treats None as "operator
314+
didn't ask to touch Caddy" and skips the patch — avoiding
315+
silent reverts of an image set out-of-band (e.g. via an
316+
ansible playbook or a prior deployment's spec).
317+
318+
Cluster-scoped: the Caddy ingress lives in the shared
319+
`caddy-system` namespace, so setting this key in any
320+
deployment's spec rolls the controller for every deployment
321+
using the cluster.
322+
"""
323+
return self.obj.get(constants.caddy_ingress_image_key)
324+
307325
def get_maintenance_service(self) -> typing.Optional[str]:
308326
"""Return maintenance-service value (e.g. 'dumpster-maintenance:8000') or None.
309327

0 commit comments

Comments
 (0)