chore(pebbles): close so-o2o (#747)

prathamesh0 · claude · web-flow · commit eb4704b563d5 · 2026-04-17T17:40:12.000+05:30
Implementation shipped in PR #746. Woodburn migration (one-shot host-kubectl export to seed the backup file) completed manually. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/.pebbles/events.jsonl b/.pebbles/events.jsonl
@@ -44,3 +44,5 @@
 {"type":"comment","timestamp":"2026-04-16T13:36:20.150833128Z","issue_id":"so-o2o","payload":{"body":"Reproduced and partially diagnosed locally. Original 'backup not persisting' framing turns out to be inaccurate — the host bind-mount works fine and the cleanup function runs end-to-end. The actual bug is downstream of those.\n\nWhat we confirmed:\n- The etcd extraMount at \u003cdeployment_dir\u003e/data/cluster-backups/\u003cid\u003e/etcd is honored. After 'kind delete', the host-side data persists (16MB db file, snap files intact, owned by root mode 0700).\n- _clean_etcd_keeping_certs (helpers.py:120-279) actually runs to completion. Evidence: timestamped 'member.backup-YYYYMMDD-HHMMSS' dirs accumulate (created at line 257-260, the last step before the swap-in).\n\nWhat actually breaks:\n- After cleanup + 'kind create cluster', kubeadm init fails. kube-apiserver never opens :6443 ('connection refused' loop until kubeadm gives up). kubelet itself is healthy.\n- Hypothesis (high confidence, not yet proven by inspecting an etcd container log): version skew. Cleanup uses gcr.io/etcd-development/etcd:v3.5.9 (helpers.py:148) which produces v3.5-format on-disk data. Kind v0.32 ships kindest/node:v1.35.1 with etcd v3.6.x, which can't read v3.5-format data and crashes — apiserver can't reach it.\n- Diagnostic that nails the version skew: moving the persisted etcd dir aside ('mv etcd etcd.away') and re-running 'start --perform-cluster-management' succeeds cleanly. With persisted-etcd present, fails. So the cleanup output is what breaks the new cluster.\n\nWhy prod hasn't hit this: woodburn runs kind v0.20.0 (kindest/node:v1.27.x with etcd v3.5.x) — compatible with the v3.5.9 cleanup image. Bug is dormant there until kind is bumped.\n\nWhat we do NOT know:\n- Whether Caddy certs would actually survive a successful recreate. Cluster never came up after cleanup, so we couldn't inspect /registry/secrets/caddy-system in the new etcd. The cleanup function's whitelist preserves them in theory, but end-to-end preservation is unverified.\n\nWhat's also broken regardless of root cause:\n- _clean_etcd_keeping_certs gates ALL its diagnostic prints on opts.o.debug (lines 141, 145, 274, 278) and returns False silently on failure. With a normal (non-debug) run, the operator gets zero indication that cleanup attempted, succeeded, or failed. Silent failure was 90% of why this took so long to diagnose.\n\nFix direction:\n1. Source etcdctl/etcdutl from the same kindest/node image kind is using, so on-disk format always matches what the cluster will boot with. Self-adapts to kind upgrades.\n2. Make failure messages unconditional prints, not gated on debug.\n3. After (1), re-test cert preservation end-to-end and update findings."}}
 {"type":"comment","timestamp":"2026-04-16T14:34:14.74327248Z","issue_id":"so-o2o","payload":{"body":"Resolved direction: replace the etcd-level cleanup mechanism with a kubectl-level Caddy secret backup/restore in SO.\n\nWhy the layer change: the etcd approach (current code, or even a cleaner snapshot save/restore) shares one fundamental problem — restoring an entire etcd snapshot resurrects ALL prior cluster state, conflicts with the new cluster's bootstrap, and forces us into a maintained whitelist (kindnet/coredns/etc.). It also couples SO to etcd binary versions. Working at the k8s/kubectl layer instead gives us a selective, version-portable, operator-readable backup that exactly matches what we want: a small set of Caddy Secrets.\n\nKey insight from the use case: Caddy's secret_store reuses any valid cert it finds in its store on startup. ACME challenges only fire for domains it has no cert for. So if we restore the right Secrets BEFORE Caddy comes up on a new cluster, no Let's Encrypt traffic happens, no rate limit risk, no TLS gap, and new domains added later still provision normally.\n\nProposed spec API:\n  network:\n    caddy-cert-backup:\n      path: ~/.credentials/caddy-certs/   # operator-owned host dir; presence enables\n\nLifecycle:\n1. destroy_cluster, before 'kind delete', if path configured:\n   kubectl get secrets -n caddy-system -l manager=caddy -o yaml \u003e \u003cpath\u003e/caddy-secrets.yaml\n   (loud failure if cluster gone or kubectl errors)\n2. install_ingress_for_kind, after manifest applied:\n   kubectl apply -f \u003cpath\u003e/caddy-secrets.yaml if it exists\n\nWhat this lets us delete:\n- _clean_etcd_keeping_certs (~150 LoC of docker-in-docker shell-in-Python)\n- _get_etcd_host_path_from_kind_config\n- etcd extraMount setup in _generate_kind_mounts\n- Hardcoded gcr.io/etcd-development/etcd:v3.5.9 image\n- The whitelist that needs maintenance per kind version\n- All silent-failure paths\n\nWhat woodburn can simplify after the SO fix:\n- inject_caddy_certs.yml and the kubectl-backup half of cluster_recreate.yml become redundant\n- extract-caddy-certs.sh (auger) stays as a disaster-recovery tool for cases where the in-SO backup didn't run\n\nPros vs. alternatives:\n- Operator-readable PEM YAML, can be git-committed / sops-encrypted / mirrored to S3\n- Zero version coupling\n- Selective by design (only what we want)\n- Lifecycle hooks are atomic with cluster lifecycle, no remembering to run a separate backup playbook\n- ~30 LoC in SO, no docker-in-docker\n\nBranch and implementation pending operator decision to proceed."}}
 {"type":"comment","timestamp":"2026-04-17T08:13:32.753112339Z","issue_id":"so-o2o","payload":{"body":"Tested the version-detection fix (commit 832ab66d) locally. Fix works for its scope but surfaces two more bugs downstream. Current approach is broken at the architectural level, not just one-bug-fixable.\n\nWhat 832ab66d does: captures etcd image ref from crictl after cluster create, writes to {backup_dir}/etcd-image.txt, reads it on subsequent cleanup runs. Self-adapts to Kind upgrades. No more hardcoded v3.5.9. Confirmed locally: etcd-image.txt is written after first create, cleanup on second start uses it, member.backup-YYYYMMDD-HHMMSS dir is produced (proves cleanup ran end-to-end).\n\nWhat still fails after version fix: kubeadm init on cluster recreate. apiserver comes up but returns:\n- 403 Forbidden: User \"kubernetes-admin\" cannot get path /livez\n- 500: Body was not decodable ... json: cannot unmarshal array into Go value of type struct\n- eventually times out waiting for apiserver /livez\n\nTwo new bugs behind those:\n\n(a) Restore step corrupts binary values. In _clean_etcd_keeping_certs the restore loop is:\n    key=$(echo $encoded | base64 -d | jq -r .key | base64 -d)\n    val=$(echo $encoded | base64 -d | jq -r .value | base64 -d)\n    echo \"$val\" | /backup/etcdctl put \"$key\"\nk8s stores objects as protobuf. Piping raw protobuf through bash variable expansion + echo mangles non-printable bytes, truncates at null bytes, and appends a trailing newline. Explains the \"cannot unmarshal\" from apiserver — the kubernetes Service/Endpoints objects in /registry are corrupted on re-put.\n\n(b) Whitelist is too narrow. We keep only /registry/secrets/caddy-system and the /registry/services entries for kubernetes. Everything else is deleted — including /registry/clusterrolebindings (cluster-admin is gone), /registry/serviceaccounts, /registry/secrets/kube-system (bootstrap tokens), RBAC roles, apiserver's auth config. Explains the 403 for kubernetes-admin — cluster-admin binding doesn't exist yet and kubeadm's pre-addon health check can't authorize.\n\nFixing (a) would mean rewriting the restore step to not use shell piping — either use a proper etcdctl-based Go tool, or write directly to the on-disk snapshot format. Fixing (b) means exhaustively whitelisting everything kubeadm/apiserver bootstrapping needs — a moving target across k8s versions. Both together are a significant undertaking for the actual requirement (\"keep 4 Caddy secrets across cluster recreate\").\n\nDecision: merge 832ab66d for the narrow version-detection fix + diagnosis trail, then implement the kubectl-level backup/restore on a separate branch. The etcd approach is not salvageable at reasonable cost."}}
+{"type":"comment","timestamp":"2026-04-17T11:04:26.542659482Z","issue_id":"so-o2o","payload":{"body":"Shipped in PR #746. Etcd-persistence approach replaced with a kubectl-level Caddy Secret backup/restore gated on kind-mount-root.\n\nSummary of what landed:\n- components/ingress/caddy-cert-backup.yaml: SA/Role/RoleBinding + CronJob (alpine/kubectl:1.35.3) firing every 5min, writes {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml via atomic tmp+rename.\n- install_ingress_for_kind splits into 3 phases: pre-Deployment manifests → _restore_caddy_certs (kubectl apply from backup file) → Caddy Deployment → _install_caddy_cert_backup. Caddy pod can't exist until phase 3, so certs are always in place before secret_store startup.\n- Deleted _clean_etcd_keeping_certs, _get_etcd_host_path_from_kind_config, _capture_etcd_image, _read_etcd_image_ref, _etcd_image_ref_path and the etcd+PKI block in _generate_kind_mounts.\n- No new spec keys.\n\nTest coverage in tests/k8s-deploy/run-deploy-test.sh: install assertion after first --perform-cluster-management start, plus full E2E (seed fake manager=caddy Secret → trigger CronJob → verify backup file → stop/start --perform-cluster-management for cluster recreate → assert secret restored with matching decoded value).\n\nWoodburn migration: one-shot host-kubectl export to seed {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml was done manually on the running cluster (the in-cluster CronJob couldn't reach the host because the /srv/kind → /mnt extraMount was staged in kind-config.yml but never applied to the running cluster — it was added after cluster creation). File is in place for the eventual cluster recreate."}}
+{"type":"close","timestamp":"2026-04-17T11:04:26.999711375Z","issue_id":"so-o2o","payload":{}}