so-o2o: detect etcd image dynamically + diagnose whitelist cleanup bugs (#745)

prathamesh0 · web-flow · commit 133490040740 · 2026-04-17T13:48:30.000+05:30
Replaces the hardcoded `gcr.io/etcd-development/etcd:v3.5.9` in `_clean_etcd_keeping_certs` with a dynamic ref captured from the running Kind node via `crictl`, persisted to `{backup_dir}/etcd-image.txt` and reused on subsequent cleanup runs. Self-adapts to Kind upgrades, no version table to maintain.

Testing on Kind v0.32 / etcd 3.6 surfaced two additional bugs in the whitelist cleanup that this PR does **not** fix (see so-o2o comments):
(a) the restore step pipes raw protobuf values through bash `echo`, corrupting binary bytes;
(b) the whitelist omits cluster-admin RBAC, SAs, and bootstrap tokens needed by kubeadm's pre-addon health check.

Merging this narrow fix + diagnosis trail; follow-up branch will replace the etcd-surgery approach with a kubectl-level Caddy secret backup/restore.
diff --git a/.pebbles/events.jsonl b/.pebbles/events.jsonl
@@ -41,3 +41,6 @@
 {"type":"close","timestamp":"2026-04-16T06:24:42.153940477Z","issue_id":"so-076.2","payload":{}}
 {"type":"create","timestamp":"2026-04-16T07:26:56.820142001Z","issue_id":"so-ad7","payload":{"description":"_restart_with_maintenance in deployment.py patches Ingress backends to point at the maintenance Service, but that Service is never created. get_services() in cluster_info.py only builds per-pod ClusterIP Services for pods referenced by http-proxy routes (cluster_info.py:991-992 'if not ports_set: continue'). The maintenance pod has no http-proxy route by design, so no Service is built for it.\n\nResult: during a restart with maintenance-service configured, the Ingress points to a non-existent Service. Caddy has no valid backend, connection fails, users see 'site cannot be reached' instead of the maintenance page. Cryovial logs correctly report the swap happened.\n\n_resolve_service_name_for_container (cluster_info.py:183) and get_services() (cluster_info.py:945) operate on inconsistent premises — the resolver assumes every pod has a {app_name}-{pod_name}-service; the builder only creates one for http-proxy-referenced pods.\n\nFix: create_services() should also build a Service for the container named by spec's maintenance-service: key.","priority":"3","title":"Maintenance swap routes Ingress to non-existent Service","type":"bug"}}
 {"type":"create","timestamp":"2026-04-16T08:21:00.832961223Z","issue_id":"so-b9a","payload":{"description":"_resolve_service_name_for_container (cluster_info.py:183) mechanically returns {app_name}-{pod_name}-service for any container, with no awareness of whether get_services() actually built that Service. get_services() only builds Services for pods referenced by http-proxy or maintenance-service.\n\nCurrent callers happen to be safe: get_ingress() only passes http-proxy containers, _restart_with_maintenance passes the maintenance container (covered by so-ad7). But any future caller that passes a container outside {http-proxy ∪ maintenance-service} gets a ghost Service name and silent failure.\n\nFix direction (when a third caller emerges): either teach the resolver to return None / raise when the Service wasn't built, or make get_services() build a per-pod Service unconditionally for every pod with compose ports, aligning structure with the resolver's assumption.","priority":"4","title":"Service-name resolver and builder operate on inconsistent premises","type":"bug"}}
+{"type":"comment","timestamp":"2026-04-16T13:36:20.150833128Z","issue_id":"so-o2o","payload":{"body":"Reproduced and partially diagnosed locally. Original 'backup not persisting' framing turns out to be inaccurate — the host bind-mount works fine and the cleanup function runs end-to-end. The actual bug is downstream of those.\n\nWhat we confirmed:\n- The etcd extraMount at \u003cdeployment_dir\u003e/data/cluster-backups/\u003cid\u003e/etcd is honored. After 'kind delete', the host-side data persists (16MB db file, snap files intact, owned by root mode 0700).\n- _clean_etcd_keeping_certs (helpers.py:120-279) actually runs to completion. Evidence: timestamped 'member.backup-YYYYMMDD-HHMMSS' dirs accumulate (created at line 257-260, the last step before the swap-in).\n\nWhat actually breaks:\n- After cleanup + 'kind create cluster', kubeadm init fails. kube-apiserver never opens :6443 ('connection refused' loop until kubeadm gives up). kubelet itself is healthy.\n- Hypothesis (high confidence, not yet proven by inspecting an etcd container log): version skew. Cleanup uses gcr.io/etcd-development/etcd:v3.5.9 (helpers.py:148) which produces v3.5-format on-disk data. Kind v0.32 ships kindest/node:v1.35.1 with etcd v3.6.x, which can't read v3.5-format data and crashes — apiserver can't reach it.\n- Diagnostic that nails the version skew: moving the persisted etcd dir aside ('mv etcd etcd.away') and re-running 'start --perform-cluster-management' succeeds cleanly. With persisted-etcd present, fails. So the cleanup output is what breaks the new cluster.\n\nWhy prod hasn't hit this: woodburn runs kind v0.20.0 (kindest/node:v1.27.x with etcd v3.5.x) — compatible with the v3.5.9 cleanup image. Bug is dormant there until kind is bumped.\n\nWhat we do NOT know:\n- Whether Caddy certs would actually survive a successful recreate. Cluster never came up after cleanup, so we couldn't inspect /registry/secrets/caddy-system in the new etcd. The cleanup function's whitelist preserves them in theory, but end-to-end preservation is unverified.\n\nWhat's also broken regardless of root cause:\n- _clean_etcd_keeping_certs gates ALL its diagnostic prints on opts.o.debug (lines 141, 145, 274, 278) and returns False silently on failure. With a normal (non-debug) run, the operator gets zero indication that cleanup attempted, succeeded, or failed. Silent failure was 90% of why this took so long to diagnose.\n\nFix direction:\n1. Source etcdctl/etcdutl from the same kindest/node image kind is using, so on-disk format always matches what the cluster will boot with. Self-adapts to kind upgrades.\n2. Make failure messages unconditional prints, not gated on debug.\n3. After (1), re-test cert preservation end-to-end and update findings."}}
+{"type":"comment","timestamp":"2026-04-16T14:34:14.74327248Z","issue_id":"so-o2o","payload":{"body":"Resolved direction: replace the etcd-level cleanup mechanism with a kubectl-level Caddy secret backup/restore in SO.\n\nWhy the layer change: the etcd approach (current code, or even a cleaner snapshot save/restore) shares one fundamental problem — restoring an entire etcd snapshot resurrects ALL prior cluster state, conflicts with the new cluster's bootstrap, and forces us into a maintained whitelist (kindnet/coredns/etc.). It also couples SO to etcd binary versions. Working at the k8s/kubectl layer instead gives us a selective, version-portable, operator-readable backup that exactly matches what we want: a small set of Caddy Secrets.\n\nKey insight from the use case: Caddy's secret_store reuses any valid cert it finds in its store on startup. ACME challenges only fire for domains it has no cert for. So if we restore the right Secrets BEFORE Caddy comes up on a new cluster, no Let's Encrypt traffic happens, no rate limit risk, no TLS gap, and new domains added later still provision normally.\n\nProposed spec API:\n  network:\n    caddy-cert-backup:\n      path: ~/.credentials/caddy-certs/   # operator-owned host dir; presence enables\n\nLifecycle:\n1. destroy_cluster, before 'kind delete', if path configured:\n   kubectl get secrets -n caddy-system -l manager=caddy -o yaml \u003e \u003cpath\u003e/caddy-secrets.yaml\n   (loud failure if cluster gone or kubectl errors)\n2. install_ingress_for_kind, after manifest applied:\n   kubectl apply -f \u003cpath\u003e/caddy-secrets.yaml if it exists\n\nWhat this lets us delete:\n- _clean_etcd_keeping_certs (~150 LoC of docker-in-docker shell-in-Python)\n- _get_etcd_host_path_from_kind_config\n- etcd extraMount setup in _generate_kind_mounts\n- Hardcoded gcr.io/etcd-development/etcd:v3.5.9 image\n- The whitelist that needs maintenance per kind version\n- All silent-failure paths\n\nWhat woodburn can simplify after the SO fix:\n- inject_caddy_certs.yml and the kubectl-backup half of cluster_recreate.yml become redundant\n- extract-caddy-certs.sh (auger) stays as a disaster-recovery tool for cases where the in-SO backup didn't run\n\nPros vs. alternatives:\n- Operator-readable PEM YAML, can be git-committed / sops-encrypted / mirrored to S3\n- Zero version coupling\n- Selective by design (only what we want)\n- Lifecycle hooks are atomic with cluster lifecycle, no remembering to run a separate backup playbook\n- ~30 LoC in SO, no docker-in-docker\n\nBranch and implementation pending operator decision to proceed."}}
+{"type":"comment","timestamp":"2026-04-17T08:13:32.753112339Z","issue_id":"so-o2o","payload":{"body":"Tested the version-detection fix (commit 832ab66d) locally. Fix works for its scope but surfaces two more bugs downstream. Current approach is broken at the architectural level, not just one-bug-fixable.\n\nWhat 832ab66d does: captures etcd image ref from crictl after cluster create, writes to {backup_dir}/etcd-image.txt, reads it on subsequent cleanup runs. Self-adapts to Kind upgrades. No more hardcoded v3.5.9. Confirmed locally: etcd-image.txt is written after first create, cleanup on second start uses it, member.backup-YYYYMMDD-HHMMSS dir is produced (proves cleanup ran end-to-end).\n\nWhat still fails after version fix: kubeadm init on cluster recreate. apiserver comes up but returns:\n- 403 Forbidden: User \"kubernetes-admin\" cannot get path /livez\n- 500: Body was not decodable ... json: cannot unmarshal array into Go value of type struct\n- eventually times out waiting for apiserver /livez\n\nTwo new bugs behind those:\n\n(a) Restore step corrupts binary values. In _clean_etcd_keeping_certs the restore loop is:\n    key=$(echo $encoded | base64 -d | jq -r .key | base64 -d)\n    val=$(echo $encoded | base64 -d | jq -r .value | base64 -d)\n    echo \"$val\" | /backup/etcdctl put \"$key\"\nk8s stores objects as protobuf. Piping raw protobuf through bash variable expansion + echo mangles non-printable bytes, truncates at null bytes, and appends a trailing newline. Explains the \"cannot unmarshal\" from apiserver — the kubernetes Service/Endpoints objects in /registry are corrupted on re-put.\n\n(b) Whitelist is too narrow. We keep only /registry/secrets/caddy-system and the /registry/services entries for kubernetes. Everything else is deleted — including /registry/clusterrolebindings (cluster-admin is gone), /registry/serviceaccounts, /registry/secrets/kube-system (bootstrap tokens), RBAC roles, apiserver's auth config. Explains the 403 for kubernetes-admin — cluster-admin binding doesn't exist yet and kubeadm's pre-addon health check can't authorize.\n\nFixing (a) would mean rewriting the restore step to not use shell piping — either use a proper etcdctl-based Go tool, or write directly to the on-disk snapshot format. Fixing (b) means exhaustively whitelisting everything kubeadm/apiserver bootstrapping needs — a moving target across k8s versions. Both together are a significant undertaking for the actual requirement (\"keep 4 Caddy secrets across cluster recreate\").\n\nDecision: merge 832ab66d for the narrow version-detection fix + diagnosis trail, then implement the kubectl-level backup/restore on a separate branch. The etcd approach is not salvageable at reasonable cost."}}
diff --git a/stack_orchestrator/deploy/k8s/helpers.py b/stack_orchestrator/deploy/k8s/helpers.py
@@ -17,8 +17,10 @@
 from kubernetes.client.exceptions import ApiException
 import os
 from pathlib import Path
+import shlex
 import subprocess
 import re
+import time
 from typing import Set, Mapping, List, Optional, cast
 import yaml
 
@@ -117,6 +119,66 @@ def _get_etcd_host_path_from_kind_config(config_file: str) -> Optional[str]:
     return None
 
 
+def _etcd_image_ref_path(etcd_path: str) -> Path:
+    """Location of the persisted etcd image reference file."""
+    return Path(etcd_path).parent / "etcd-image.txt"
+
+
+def _capture_etcd_image(cluster_name: str, etcd_path: str) -> bool:
+    """Persist the etcd image ref from a running Kind cluster.
+
+    Kind runs etcd as a static pod via containerd inside the node container.
+    We query crictl to discover which etcd image the current Kind version
+    uses, then write it alongside the etcd backup so future
+    ``_clean_etcd_keeping_certs`` calls use a matching version (avoiding
+    on-disk format skew between etcd releases).
+    """
+    node_name = f"{cluster_name}-control-plane"
+    query_cmd = (
+        f"docker exec {node_name} crictl images 2>/dev/null "
+        "| awk '/etcd/ {print $1\":\"$2; exit}'"
+    )
+    image_ref = ""
+    for _ in range(15):
+        result = subprocess.run(query_cmd, shell=True, capture_output=True, text=True)
+        image_ref = result.stdout.strip()
+        if image_ref:
+            break
+        time.sleep(1)
+
+    if not image_ref:
+        print(f"Warning: could not capture etcd image ref from {node_name}")
+        return False
+
+    image_file = _etcd_image_ref_path(etcd_path)
+    write_cmd = (
+        f"docker run --rm -v {image_file.parent}:/work alpine:3.19 "
+        f"sh -c 'echo {shlex.quote(image_ref)} > /work/{image_file.name}'"
+    )
+    result = subprocess.run(write_cmd, shell=True, capture_output=True, text=True)
+    if result.returncode != 0:
+        print(f"Warning: failed to write {image_file}: {result.stderr}")
+        return False
+
+    if opts.o.debug:
+        print(f"Captured etcd image: {image_ref} -> {image_file}")
+    return True
+
+
+def _read_etcd_image_ref(etcd_path: str) -> Optional[str]:
+    """Read etcd image ref persisted by a prior cluster create."""
+    image_file = _etcd_image_ref_path(etcd_path)
+    read_cmd = (
+        f"docker run --rm -v {image_file.parent}:/work:ro alpine:3.19 "
+        f"cat /work/{image_file.name}"
+    )
+    result = subprocess.run(read_cmd, shell=True, capture_output=True, text=True)
+    if result.returncode != 0:
+        return None
+    ref = result.stdout.strip()
+    return ref or None
+
+
 def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
     """Clean persisted etcd, keeping only TLS certificates.
 
@@ -142,10 +204,20 @@ def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
             print(f"No etcd snapshot at {db_path}, skipping cleanup")
         return False
 
+    etcd_image = _read_etcd_image_ref(etcd_path)
+    if not etcd_image:
+        print(
+            f"Warning: etcd data at {etcd_path} but no image ref file "
+            f"({_etcd_image_ref_path(etcd_path)}); skipping cleanup"
+        )
+        return False
+
     if opts.o.debug:
-        print(f"Cleaning persisted etcd at {etcd_path}, keeping only TLS certs")
+        print(
+            f"Cleaning persisted etcd at {etcd_path} using {etcd_image}, "
+            "keeping only TLS certs"
+        )
 
-    etcd_image = "gcr.io/etcd-development/etcd:v3.5.9"
     temp_dir = "/tmp/laconic-etcd-cleanup"
 
     # Whitelist: prefixes to KEEP - everything else gets deleted.
@@ -306,6 +378,12 @@ def create_cluster(name: str, config_file: str):
     result = _run_command(f"kind create cluster --name {name} --config {config_file}")
     if result.returncode != 0:
         raise DeployerException(f"kind create cluster failed: {result}")
+
+    # Persist the etcd image ref so future _clean_etcd_keeping_certs calls
+    # use a version that matches the on-disk format kind is writing now.
+    if etcd_path:
+        _capture_etcd_image(name, etcd_path)
+
     return name