cerc-io
diff --git a/‎.pebbles/events.jsonl‎
Lines changed: 6 additions & 0 deletions b/‎.pebbles/events.jsonl‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/deployment_patterns.md‎
Lines changed: 138 additions & 1 deletion b/‎docs/deployment_patterns.md‎
Lines changed: 138 additions & 1 deletion
diff --git a/‎stack_orchestrator/constants.py‎
Lines changed: 1 addition & 0 deletions b/‎stack_orchestrator/constants.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎stack_orchestrator/deploy/deployment_context.py‎
Lines changed: 27 additions & 0 deletions b/‎stack_orchestrator/deploy/deployment_context.py‎
Lines changed: 27 additions & 0 deletions
@@ -46,3 +46,9 @@
 {"type":"comment","timestamp":"2026-04-17T08:13:32.753112339Z","issue_id":"so-o2o","payload":{"body":"Tested the version-detection fix (commit 832ab66d) locally. Fix works for its scope but surfaces two more bugs downstream. Current approach is broken at the architectural level, not just one-bug-fixable.\n\nWhat 832ab66d does: captures etcd image ref from crictl after cluster create, writes to {backup_dir}/etcd-image.txt, reads it on subsequent cleanup runs. Self-adapts to Kind upgrades. No more hardcoded v3.5.9. Confirmed locally: etcd-image.txt is written after first create, cleanup on second start uses it, member.backup-YYYYMMDD-HHMMSS dir is produced (proves cleanup ran end-to-end).\n\nWhat still fails after version fix: kubeadm init on cluster recreate. apiserver comes up but returns:\n- 403 Forbidden: User \"kubernetes-admin\" cannot get path /livez\n- 500: Body was not decodable ... json: cannot unmarshal array into Go value of type struct\n- eventually times out waiting for apiserver /livez\n\nTwo new bugs behind those:\n\n(a) Restore step corrupts binary values. In _clean_etcd_keeping_certs the restore loop is:\n    key=$(echo $encoded | base64 -d | jq -r .key | base64 -d)\n    val=$(echo $encoded | base64 -d | jq -r .value | base64 -d)\n    echo \"$val\" | /backup/etcdctl put \"$key\"\nk8s stores objects as protobuf. Piping raw protobuf through bash variable expansion + echo mangles non-printable bytes, truncates at null bytes, and appends a trailing newline. Explains the \"cannot unmarshal\" from apiserver — the kubernetes Service/Endpoints objects in /registry are corrupted on re-put.\n\n(b) Whitelist is too narrow. We keep only /registry/secrets/caddy-system and the /registry/services entries for kubernetes. Everything else is deleted — including /registry/clusterrolebindings (cluster-admin is gone), /registry/serviceaccounts, /registry/secrets/kube-system (bootstrap tokens), RBAC roles, apiserver's auth config. Explains the 403 for kubernetes-admin — cluster-admin binding doesn't exist yet and kubeadm's pre-addon health check can't authorize.\n\nFixing (a) would mean rewriting the restore step to not use shell piping — either use a proper etcdctl-based Go tool, or write directly to the on-disk snapshot format. Fixing (b) means exhaustively whitelisting everything kubeadm/apiserver bootstrapping needs — a moving target across k8s versions. Both together are a significant undertaking for the actual requirement (\"keep 4 Caddy secrets across cluster recreate\").\n\nDecision: merge 832ab66d for the narrow version-detection fix + diagnosis trail, then implement the kubectl-level backup/restore on a separate branch. The etcd approach is not salvageable at reasonable cost."}}
 {"type":"comment","timestamp":"2026-04-17T11:04:26.542659482Z","issue_id":"so-o2o","payload":{"body":"Shipped in PR #746. Etcd-persistence approach replaced with a kubectl-level Caddy Secret backup/restore gated on kind-mount-root.\n\nSummary of what landed:\n- components/ingress/caddy-cert-backup.yaml: SA/Role/RoleBinding + CronJob (alpine/kubectl:1.35.3) firing every 5min, writes {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml via atomic tmp+rename.\n- install_ingress_for_kind splits into 3 phases: pre-Deployment manifests → _restore_caddy_certs (kubectl apply from backup file) → Caddy Deployment → _install_caddy_cert_backup. Caddy pod can't exist until phase 3, so certs are always in place before secret_store startup.\n- Deleted _clean_etcd_keeping_certs, _get_etcd_host_path_from_kind_config, _capture_etcd_image, _read_etcd_image_ref, _etcd_image_ref_path and the etcd+PKI block in _generate_kind_mounts.\n- No new spec keys.\n\nTest coverage in tests/k8s-deploy/run-deploy-test.sh: install assertion after first --perform-cluster-management start, plus full E2E (seed fake manager=caddy Secret → trigger CronJob → verify backup file → stop/start --perform-cluster-management for cluster recreate → assert secret restored with matching decoded value).\n\nWoodburn migration: one-shot host-kubectl export to seed {kind-mount-root}/caddy-cert-backup/caddy-secrets.yaml was done manually on the running cluster (the in-cluster CronJob couldn't reach the host because the /srv/kind → /mnt extraMount was staged in kind-config.yml but never applied to the running cluster — it was added after cluster creation). File is in place for the eventual cluster recreate."}}
 {"type":"close","timestamp":"2026-04-17T11:04:26.999711375Z","issue_id":"so-o2o","payload":{}}
+{"type":"create","timestamp":"2026-04-20T13:14:26.312724048Z","issue_id":"so-7fc","payload":{"description":"## Problem\n\nFile-level host-path compose volumes (e.g. `../config/foo.sh:/opt/foo.sh`) were synthesized into a kind extraMount + k8s hostPath PV chain with a sanitized containerPath (`/mnt/host-path-\u003csanitized\u003e`).\n\n- On kind: two deployments of the same stack sharing a cluster collide at that containerPath — kind only honors the first deployment's bind, so subsequent deployments' pods silently read the first's file. No error, no warning.\n- On real k8s: the same code emits `hostPath: /mnt/host-path-*` but nothing populates that path on worker nodes — effectively broken.\n\nFile-level host-path binds are conceptually k8s ConfigMaps. The `snowballtools-base-backend` stack already uses the ConfigMap-backed named-volume pattern manually; this issue is to make that automatic for all stacks.\n\n## Resolution\n\nImplemented on branch `feat/so-b86-auto-configmap-host-path` (commit `cb84388d`), stacked on top of `feat/kind-mount-invariant-check`.\n\n**No deployment-dir file rewriting.** Compose files, spec.yml, and `{deployment_dir}/config/\u003cpod\u003e/` are untouched — trivially diffable against stack source, no synthetic volume names. ConfigMaps are materialized at deploy start and visible only in k8s (`kubectl get cm -n \u003cns\u003e`).\n\n### Deploy create — validation only\n\n| Source shape | Behavior |\n|---|---|\n| Single file | Accepted |\n| Flat directory, no subdirs, ≤ ~700 KiB | Accepted |\n| Directory with subdirs | `DeployerException` — guidance: embed in image / split configmaps / initContainer |\n| File or directory \u003e ~700 KiB | `DeployerException` — ConfigMap budget (accounts for base64 + metadata) |\n| `:rw` on any host-path bind | `DeployerException` — use a named volume for writable data |\n\n### Deploy start — k8s object generation\n\n- `cluster_info.get_configmaps()` walks pod + job compose volumes and emits a `V1ConfigMap` per host-path bind (deduped by sanitized name), content read from `{deployment_dir}/config/\u003cpod\u003e/\u003cfile\u003e`.\n- `volumes_for_pod_files` emits `V1ConfigMapVolumeSource` instead of `V1HostPathVolumeSource` for host-path binds.\n- `volume_mounts_for_service` stats the source and sets `V1VolumeMount.sub_path` to the filename when source is a regular file.\n- `_generate_kind_mounts` no longer emits `/mnt/host-path-*` extraMounts — ConfigMap path bypasses the kind node FS entirely.\n\n### Transition\n\nThe `/mnt/host-path-*` skip in `check_mounts_compatible` is retained as a transition tolerance for deployments created before this change. Test coverage in `tests/k8s-deploy/run-deploy-test.sh` asserts host-path ConfigMaps exist in the namespace, compose/spec in deployment dir unchanged, and no `/mnt/host-path-*` entries in kind-config.yml.","priority":"2","title":"File-level host-path compose volumes alias across deployments sharing a kind cluster","type":"bug"}}
+{"type":"status_update","timestamp":"2026-04-20T13:14:26.833816262Z","issue_id":"so-7fc","payload":{"status":"closed"}}
+{"type":"comment","timestamp":"2026-04-21T05:57:12.476299839Z","issue_id":"so-n1n","payload":{"body":"Already merged: 929bdab8 is an ancestor of origin/main; all four extraMount emit sites in helpers.py carry `propagation: HostToContainer` (umbrella, per-volume named, per-volume host-path, high-memlock spec)."}}
+{"type":"status_update","timestamp":"2026-04-21T05:57:12.928842469Z","issue_id":"so-n1n","payload":{"status":"closed"}}
+{"type":"comment","timestamp":"2026-04-21T06:08:13.933886638Z","issue_id":"so-ad7","payload":{"body":"Fixed in PR #744 (cf8b7533). get_services() now includes the maintenance pod in the container-ports map so its per-pod Service is built and available for the Ingress swap."}}
+{"type":"status_update","timestamp":"2026-04-21T06:08:14.457815115Z","issue_id":"so-ad7","payload":{"status":"closed"}}
@@ -164,6 +164,44 @@ To stop a single deployment without affecting the cluster:
 laconic-so deployment --dir my-deployment stop --skip-cluster-management
 ```
 
+Stacks sharing a cluster must agree on mount topology. See
+[Volume Persistence in k8s-kind](#volume-persistence-in-k8s-kind).
+
+### cluster-id vs deployment-id
+
+Each deployment's `deployment.yml` carries two identifiers with
+different roles:
+
+- **`cluster-id`** — which kind cluster this deployment attaches to.
+  Used for the kube-config context name (`kind-{cluster-id}`) and for
+  kind lifecycle ops. Inherited from the running cluster at
+  `deploy create` time when one exists; freshly generated otherwise.
+  Shared across every deployment that joins the same cluster.
+- **`deployment-id`** — this particular deployment's identity.
+  Generated fresh on every `deploy create` and never inherited. Flows
+  into `app_name`, the prefix on every k8s resource name this
+  deployment creates (PVs, ConfigMaps, Deployments, PVCs, …). Distinct
+  per deployment even when the cluster is shared.
+
+The split prevents silent resource-name collisions between
+deployments sharing a cluster: two deployments of the same stack,
+or any two deployments that happen to declare a volume with the same
+name, still produce distinct `{deployment-id}-{vol}` PV names.
+
+**Backward compatibility**: `deployment.yml` files written before the
+`deployment-id` field existed fall back to using `cluster-id` as the
+deployment-id. Existing resource names stay stable across this
+upgrade — no PV renames, no re-bind, no data orphaning. The next
+`deploy create` writes both fields going forward.
+
+**Namespace ownership**: on top of distinct resource names, SO stamps
+the k8s namespace with a `laconic.com/deployment-dir` annotation on
+first creation. A subsequent `deployment start` from a different
+deployment directory that would land in the same namespace fails
+with a `DeployerException` pointing at the `namespace:` spec
+override. Catches operator-error cases where the same deployment dir
+is effectively registered twice.
+
 ## Volume Persistence in k8s-kind
 
 k8s-kind has 3 storage layers:
@@ -172,7 +210,9 @@ k8s-kind has 3 storage layers:
 - **Kind Node**: A Docker container simulating a k8s node
 - **Pod Container**: Your workload
 
-For k8s-kind, volumes with paths are mounted from Docker Host → Kind Node → Pod via extraMounts.
+Volumes with paths are mounted from Docker Host → Kind Node → Pod via kind
+`extraMounts`. Kind applies `extraMounts` only at cluster creation — they
+cannot be added to a running cluster.
 
 | spec.yml volume | Storage Location | Survives Pod Restart | Survives Cluster Restart |
 |-----------------|------------------|---------------------|-------------------------|
@@ -200,3 +240,100 @@ Empty-path volumes appear persistent because they survive pod restarts (data liv
 in Kind Node container). However, this data is lost when the kind cluster is
 recreated. This "false persistence" has caused data loss when operators assumed
 their data was safe.
+
+### Shared Clusters: Use `kind-mount-root`
+
+Because kind `extraMounts` can only be set at cluster creation, the first
+deployment to start locks in the mount topology. Later deployments that
+declare new `extraMounts` have them silently ignored — their PVs fall
+through to the kind node's overlay filesystem and lose data on cluster
+destroy.
+
+The fix is an umbrella mount. Set `kind-mount-root` in the spec, pointing
+at a host directory all stacks will share:
+
+```yaml
+# spec.yml
+kind-mount-root: /srv/kind
+
+volumes:
+  my-data: /srv/kind/my-stack/data   # visible at /mnt/my-stack/data in-node
+```
+
+SO emits a single `extraMount` (`<kind-mount-root>` → `/mnt`). Any new
+host subdirectory under the root is visible in the node immediately — no
+cluster recreate needed to add stacks.
+
+**All stacks sharing a cluster must agree on `kind-mount-root`** and keep
+their host paths under it.
+
+### Mount Compatibility Enforcement
+
+`laconic-so deployment start` validates mount topology:
+
+- **On first cluster creation** without an umbrella mount: prints a
+  warning (future stacks may require a full recreate to add mounts).
+- **On cluster reuse**: compares the new deployment's `extraMounts`
+  against the live mounts on the control-plane container. Any mismatch
+  (wrong host path, or mount missing) fails the deploy.
+
+### Static files in compose volumes → auto-ConfigMap
+
+Compose volumes that bind a host file or flat directory into a container
+(e.g. `../config/test/script.sh:/opt/run.sh`) are used to inject static
+content that ships with the stack. k8s doesn't have a native notion of
+this — the canonical way to inject static content is a ConfigMap.
+
+At `deploy start`, laconic-so auto-generates a namespace-scoped
+ConfigMap per host-path compose volume (deduped by source) and mounts
+it into the pod instead of routing the bind through the kind node:
+
+| Source shape | Behavior |
+|---|---|
+| Single file | ConfigMap with one key (the filename); pod mount uses `subPath` so the single key lands at the compose target path |
+| Flat directory (no subdirs, ≤ ~700 KiB) | ConfigMap with one key per file; pod mount exposes all keys at the target path |
+| Directory with subdirs, or over budget | Rejected at `deploy create` — embed in the container image, split into multiple ConfigMaps, or use an initContainer |
+| `:rw` on any host-path bind | Rejected at `deploy create` — use a named volume with a spec-configured host path for writable data |
+
+The deployment dir layout is unchanged: compose files stay verbatim and
+`spec.yml` is not rewritten. Source files remain under
+`{deployment_dir}/config/{pod}/` (as copied by `deploy create`); the
+ConfigMap is built from them at deploy start and no kind extraMount is
+emitted for these paths.
+
+This works identically on kind and real k8s (ConfigMaps are
+cluster-native; no node-side landing pad required), and two deployments
+of the same stack sharing a cluster get their own per-namespace
+ConfigMaps — no aliasing.
+
+### Writable / generated data → named volume + host path
+
+For volumes the workload *writes to* (databases, ledgers, caches, logs),
+use a named volume backed by a spec-configured host path under
+`kind-mount-root`:
+
+```yaml
+# compose
+volumes:
+  - my-data:/var/lib/foo
+
+# spec.yml
+kind-mount-root: /srv/kind
+volumes:
+  my-data: /srv/kind/my-stack/data
+```
+
+Works on both kind (via the umbrella mount) and real k8s (operator
+provisions `/srv/kind/my-stack/data` on each node).
+
+### Migrating an Existing Cluster
+
+If a cluster was created without an umbrella mount and you need to add a
+stack that requires new host-path mounts, the cluster must be recreated:
+
+1. Back up ephemeral state (DBs, caches) from PVs that lack host mounts —
+   these are in the kind node overlay FS and do not survive `kind delete`.
+2. Update every stack's spec to set a shared `kind-mount-root` and place
+   host paths under it.
+3. Stop all deployments, destroy the cluster, recreate it by starting any
+   stack (umbrella now active), and restore state.
@@ -23,6 +23,7 @@
 k8s_kind_deploy_type = "k8s-kind"
 k8s_deploy_type = "k8s"
 cluster_id_key = "cluster-id"
+deployment_id_key = "deployment-id"
 kube_config_key = "kube-config"
 deploy_to_key = "deploy-to"
 network_key = "network"
 
@@ -26,6 +26,7 @@
 class DeploymentContext:
     deployment_dir: Path
     id: str
+    deployment_id: str
     spec: Spec
     stack: Stack
 
@@ -48,8 +49,27 @@ def get_compose_file(self, name: str):
         return self.get_compose_dir() / f"docker-compose-{name}.yml"
 
     def get_cluster_id(self):
+        """Identifier of the kind cluster this deployment attaches to.
+
+        Shared across deployments that join the same kind cluster. Used
+        for the kube-config context name (`kind-{cluster-id}`) and for
+        kind cluster lifecycle ops.
+        """
         return self.id
 
+    def get_deployment_id(self):
+        """Identifier of this particular deployment's k8s resources.
+
+        Distinct per deployment even when multiple deployments share a
+        cluster. Used as compose_project_name → app_name → prefix for
+        all k8s resource names (PVs, ConfigMaps, Deployments, …).
+
+        Backward compat: for deployment.yml files written before this
+        field existed, falls back to cluster-id so existing on-disk
+        resource names remain stable (no PV renames, no re-bind).
+        """
+        return self.deployment_id
+
     def init(self, dir: Path):
         self.deployment_dir = dir.absolute()
         self.spec = Spec()
@@ -60,6 +80,12 @@ def init(self, dir: Path):
         if deployment_file_path.exists():
             obj = get_yaml().load(open(deployment_file_path, "r"))
             self.id = obj[constants.cluster_id_key]
+            # Fallback to cluster-id for deployments created before the
+            # deployment-id field was introduced. Keeps existing resource
+            # names stable across this upgrade.
+            self.deployment_id = obj.get(
+                constants.deployment_id_key, self.id
+            )
         # Handle the case of a legacy deployment with no file
         # Code below is intended to match the output from _make_default_cluster_name()
         # TODO: remove when we no longer need to support legacy deployments
@@ -68,6 +94,7 @@ def init(self, dir: Path):
             unique_cluster_descriptor = f"{path},{self.get_stack_file()},None,None"
             hash = hashlib.md5(unique_cluster_descriptor.encode()).hexdigest()[:16]
             self.id = f"{constants.cluster_name_prefix}{hash}"
+            self.deployment_id = self.id
 
     def modify_yaml(self, file_path: Path, modifier_func):
         """Load a YAML, apply a modification function, and write it back."""