etcd pod stuck in PodInitializing not detected by auto-recovery

## Problem

etcd auto-recovery ([#4503](https://github.com/openshift/hypershift/pull/4503)) does not detect pods stuck in init containers during initial HostedCluster bootstrap. Cluster remained stuck for 70+ minutes with no recovery triggered. 

- etcd-0 stuck in 'PodInitializing' - 'ensure-dns' init container blocked on DNS resolution.
- both [etcd-1](https://issues.redhat.com//browse/etcd-1) and [etcd-2](https://issues.redhat.com//browse/etcd-2) couldn't reach etcd-0 (TLS rejected IP connections since certs only have DNS SANs) 
- Manual 'kubectl delete' command fixed it immediately

## Logs
```json
{"level":"warn","ts":"2026-01-29T11:56:49.557821Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"https://etcd-0.etcd-discovery..svc:2380/version","error":"dial tcp: lookup etcd-0.etcd-discovery..svc on 172.30.0.10:53: no such host"}
{"level":"warn","ts":"2026-01-29T11:56:45.128004Z","caller":"embed/config_logging.go:161","msg":"rejected connection on peer endpoint","remote-addr":"10.130.5.160:39932","ip-addresses":[],"dns-names":[".etcd-discovery..svc",".etcd-discovery..svc.cluster.local","127.0.0.1","::1"],"error":"tls: "10.130.5.160" does not match any of DNSNames"}
```
Results in the status
```yaml
status:                                                                                                                                                                                                                                                                                                                  
  conditions:                                                                                                                                                                                                                                                                                                            
  - lastTransitionTime: "2026-01-29T11:56:46Z"                                                                                                                                                                                                                                                                           
    message: 'containers with incomplete status: [ensure-dns reset-member]'                                                                                                                                                                                                                                              
    reason: ContainersNotInitialized                                                                                                                                                                                                                                                                                     
    status: "False"                                                                                                                                                                                                                                                                                                      
    type: Initialized                                                                                                                                                                                                                                                                                                    
  phase: Pending                                                                                                                                                                                                                                                                                                         
  initContainerStatuses:                                                                                                                                                                                                                                                                                                 
  - name: ensure-dns                                                                                                                                                                                                                                                                                                     
    ready: false                                                                                                                                                                                                                                                                                                         
    state:                                                                                                                                                                                                                                                                                                               
      waiting:                                                                                                                                                                                                                                                                                                           
        reason: PodInitializing   
```

**Note:** 3 other HostedClusters were created simultaneously and succeeded. This appears to be an intermittent race condition during parallel pod boostrap

Current recovery only monitors etcd endpoint health checks, not init container hangs

### Related
- [#4503](https://github.com/openshift/hypershift/pull/4503) - Current fix (doesn't cover this scenario)                                                                                                                                                                                                                 
- [#4354](https://github.com/openshift/hypershift/pull/4354) / [#4475](https://github.com/openshift/hypershift/pull/4475) - Reverted approach                                                                                                                                                                            
- [#1985](https://github.com/openshift/hypershift/pull/1985) - Added `ensure-dns` init container

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd pod stuck in PodInitializing not detected by auto-recovery #7602

Problem

Logs

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

etcd pod stuck in PodInitializing not detected by auto-recovery #7602

Description

Problem

Logs

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions