Skip to content

etcd pod stuck in PodInitializing not detected by auto-recovery #7602

@Matopst

Description

@Matopst

Problem

etcd auto-recovery (#4503) does not detect pods stuck in init containers during initial HostedCluster bootstrap. Cluster remained stuck for 70+ minutes with no recovery triggered.

  • etcd-0 stuck in 'PodInitializing' - 'ensure-dns' init container blocked on DNS resolution.
  • both etcd-1 and etcd-2 couldn't reach etcd-0 (TLS rejected IP connections since certs only have DNS SANs)
  • Manual 'kubectl delete' command fixed it immediately

Logs

{"level":"warn","ts":"2026-01-29T11:56:49.557821Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"https://etcd-0.etcd-discovery..svc:2380/version","error":"dial tcp: lookup etcd-0.etcd-discovery..svc on 172.30.0.10:53: no such host"}
{"level":"warn","ts":"2026-01-29T11:56:45.128004Z","caller":"embed/config_logging.go:161","msg":"rejected connection on peer endpoint","remote-addr":"10.130.5.160:39932","ip-addresses":[],"dns-names":[".etcd-discovery..svc",".etcd-discovery..svc.cluster.local","127.0.0.1","::1"],"error":"tls: "10.130.5.160" does not match any of DNSNames"}

Results in the status

status:                                                                                                                                                                                                                                                                                                                  
  conditions:                                                                                                                                                                                                                                                                                                            
  - lastTransitionTime: "2026-01-29T11:56:46Z"                                                                                                                                                                                                                                                                           
    message: 'containers with incomplete status: [ensure-dns reset-member]'                                                                                                                                                                                                                                              
    reason: ContainersNotInitialized                                                                                                                                                                                                                                                                                     
    status: "False"                                                                                                                                                                                                                                                                                                      
    type: Initialized                                                                                                                                                                                                                                                                                                    
  phase: Pending                                                                                                                                                                                                                                                                                                         
  initContainerStatuses:                                                                                                                                                                                                                                                                                                 
  - name: ensure-dns                                                                                                                                                                                                                                                                                                     
    ready: false                                                                                                                                                                                                                                                                                                         
    state:                                                                                                                                                                                                                                                                                                               
      waiting:                                                                                                                                                                                                                                                                                                           
        reason: PodInitializing   

Note: 3 other HostedClusters were created simultaneously and succeeded. This appears to be an intermittent race condition during parallel pod boostrap

Current recovery only monitors etcd endpoint health checks, not init container hangs

Related

  • #4503 - Current fix (doesn't cover this scenario)
  • #4354 / #4475 - Reverted approach
  • #1985 - Added ensure-dns init container

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions