-
Notifications
You must be signed in to change notification settings - Fork 441
Open
Description
Problem
etcd auto-recovery (#4503) does not detect pods stuck in init containers during initial HostedCluster bootstrap. Cluster remained stuck for 70+ minutes with no recovery triggered.
- etcd-0 stuck in 'PodInitializing' - 'ensure-dns' init container blocked on DNS resolution.
- both etcd-1 and etcd-2 couldn't reach etcd-0 (TLS rejected IP connections since certs only have DNS SANs)
- Manual 'kubectl delete' command fixed it immediately
Logs
{"level":"warn","ts":"2026-01-29T11:56:49.557821Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"https://etcd-0.etcd-discovery..svc:2380/version","error":"dial tcp: lookup etcd-0.etcd-discovery..svc on 172.30.0.10:53: no such host"}
{"level":"warn","ts":"2026-01-29T11:56:45.128004Z","caller":"embed/config_logging.go:161","msg":"rejected connection on peer endpoint","remote-addr":"10.130.5.160:39932","ip-addresses":[],"dns-names":[".etcd-discovery..svc",".etcd-discovery..svc.cluster.local","127.0.0.1","::1"],"error":"tls: "10.130.5.160" does not match any of DNSNames"}Results in the status
status:
conditions:
- lastTransitionTime: "2026-01-29T11:56:46Z"
message: 'containers with incomplete status: [ensure-dns reset-member]'
reason: ContainersNotInitialized
status: "False"
type: Initialized
phase: Pending
initContainerStatuses:
- name: ensure-dns
ready: false
state:
waiting:
reason: PodInitializing Note: 3 other HostedClusters were created simultaneously and succeeded. This appears to be an intermittent race condition during parallel pod boostrap
Current recovery only monitors etcd endpoint health checks, not init container hangs
Related
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels