Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 2.24 KB

spire_pod_initializing.md

File metadata and controls

62 lines (42 loc) · 2.24 KB

Spire Server Pods stuck in Pod Initializing

Description

There is a known issue when the Cray Spire servers are started and Cray Spire Postgres is not ready, where the pod will get stuck in a PodInitializing state and not be able to be restarted.

Starting the Cray Spire server when its Postgres cluster is not ready leads to the pod crashing. However due to how the Cray Spire servers are registered when the main process crashes there is a loop that never gets killed in a side process. That leads to the pod crashing but never getting cleaned up to be restarted. There is no way to manually restart that pod without restarting Containerd to forcefully clean up the running container.

Symptoms

  • The cray-spire-server pods may be in a PodInitializing state.

  • The cray-spire-agent pods may be in a Init:CrashLoopBackOff state.

  • Services may fail to acquire tokens from the cray-spire-server.

  • The cray-spire-server pods contain the following error in the logs.

    time="2024-10-25T10:13:50Z" level=info msg="Opening SQL database" db_type=postgres subsystem_name=built-in_plugin.sql
    time="2024-10-25T10:13:50Z" level=error msg="Fatal run error" error="datastore-sql: dial tcp: lookup cray-spire-postgres-pooler.spire.svc.cluster.local: no such host"
    time="2024-10-25T10:13:50Z" level=error msg="Server crashed" error="datastore-sql: dial tcp: lookup cray-spire-postgres-pooler.spire.svc.cluster.local: no such host"
    

Solution

Apply workaround

  1. Find the node that the first Cray Spire server is attempting to start on.

    kubectl get pods -n spire -o wide | grep cray-spire-server-0

    Output example:

    cray-spire-server-0                                0/2     PodInitializing         0                5h32m   10.34.0.129   ncn-w004   <none>           <none>
    
  2. Verify that Postgres is running.

    kubectl get pods -n spire | grep cray-spire-postgres
  3. Delete the pod.

    kubectl delete pod -n spire cray-spire-server-0
  4. SSH to the node it was running on and restart Containerd.

    ssh ncn-w004 systemctl restart containerd
  5. Check that the cray-spire-server started up.

    kubectl get pods -n spire