There is a known issue when the Cray Spire servers are started and Cray Spire Postgres is not ready, where the pod will get stuck in a PodInitializing
state and not be able to be restarted.
Starting the Cray Spire server when its Postgres cluster is not ready leads to the pod crashing. However due to how the Cray Spire servers are registered when the main process crashes
there is a loop that never gets killed in a side process. That leads to the pod crashing but never getting cleaned up to be restarted. There is no way to manually restart that pod
without restarting Containerd
to forcefully clean up the running container.
-
The
cray-spire-server
pods may be in aPodInitializing
state. -
The
cray-spire-agent
pods may be in aInit:CrashLoopBackOff
state. -
Services may fail to acquire tokens from the
cray-spire-server
. -
The
cray-spire-server
pods contain the following error in the logs.time="2024-10-25T10:13:50Z" level=info msg="Opening SQL database" db_type=postgres subsystem_name=built-in_plugin.sql time="2024-10-25T10:13:50Z" level=error msg="Fatal run error" error="datastore-sql: dial tcp: lookup cray-spire-postgres-pooler.spire.svc.cluster.local: no such host" time="2024-10-25T10:13:50Z" level=error msg="Server crashed" error="datastore-sql: dial tcp: lookup cray-spire-postgres-pooler.spire.svc.cluster.local: no such host"
-
Find the node that the first Cray Spire server is attempting to start on.
kubectl get pods -n spire -o wide | grep cray-spire-server-0
Output example:
cray-spire-server-0 0/2 PodInitializing 0 5h32m 10.34.0.129 ncn-w004 <none> <none>
-
Verify that Postgres is running.
kubectl get pods -n spire | grep cray-spire-postgres
-
Delete the pod.
kubectl delete pod -n spire cray-spire-server-0
-
SSH to the node it was running on and restart
Containerd
.ssh ncn-w004 systemctl restart containerd
-
Check that the
cray-spire-server
started up.kubectl get pods -n spire