When deploying the final NCN, at times etcd may fail to rejoin the etcd cluster.
This procedure provides steps to recover from this issue.
- This procedure requires root privileges.
- The etcd cluster on master NCNs has two healthy members.
-
Identify unhealthy member.
Run
etcdctl member list
on each master node.etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output from healthy master node (assuming run from
ncn-m002
):60a0d077eb0db20f, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false
Example output from unhealthy master node (assuming run from
ncn-m001
):{"level":"warn","ts":"2023-03-06T17:44:25.725Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00022e000/#initially=[https://127.0.0.1:2379]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""} Error: context deadline exceeded
Given the above,
ncn-m001
is unhealthy, and the remainder of these steps provide an example of how to remove and re-addncn-m001
back into the etcd cluster. -
Stop etcd on the unhealthy NCN (
ncn-m001
is used as an example):systemctl stop etcd
-
Remove and re-add the unhealthy member from the cluster on a healthy NCN (
ncn-m002
in this example):Determine the
member id
,name
(same as NCN name), andpeer-urls
from current member list from output above:60a0d077eb0db20f, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false ^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^ member id name peer-urls
Using the values above, remove the member:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member remove 60a0d077eb0db20f
Example output:
Member 60a0d077eb0db20f removed from cluster f1c6e6ee71e931c3
Then re-add the member:
etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member add ncn-m001 --peer-urls=https://10.252.1.4:2380
Example output:
Member be55f20f284cbc1b added to cluster f1c6e6ee71e931c3 ETCD_NAME="ncn-m001" ETCD_INITIAL_CLUSTER="ncn-m003=https://10.252.1.6:2380,ncn-m001=https://10.252.1.4:2380,ncn-m002=https://10.252.1.5:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.252.1.4:2380" ETCD_INITIAL_CLUSTER_STATE="existing"
Member list should now show
ncn-m001
asunstarted
:etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output:
c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false be55f20f284cbc1b, unstarted, , https://10.252.1.4:2380, , false
-
Remove etcd member data on unhealthy NCN (
ncn-m001
):rm -rf /var/lib/etcd/member
-
Set the etcd service to start as
existing
(ncn-m001
):sed -i 's/new/existing/' /etc/systemd/system/etcd.service /srv/cray/resources/common/etcd/etcd.service systemctl daemon-reload
-
Start etcd on the unhealthy NCN (
ncn-m001
):systemctl start etcd
-
Member list on all three masters should now show
ncn-m001
back in the cluster with a new member id:etcdctl --endpoints https://127.0.0.1:2379 --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt member list
Example output:
b0dd65d7036d6932, started, ncn-m003, https://10.252.1.6:2380, https://10.252.1.6:2379,https://127.0.0.1:2379, false be55f20f284cbc1b, started, ncn-m001, https://10.252.1.4:2380, https://10.252.1.4:2379,https://127.0.0.1:2379, false c0d7b0944e709721, started, ncn-m002, https://10.252.1.5:2380, https://10.252.1.5:2379,https://127.0.0.1:2379, false
-
At this point, etcd is healthy. If the NCN has yet to join the K8S cluster, running the following script should join it now that etcd is healthy:
/srv/cray/scripts/common/kubernetes-cloudinit.sh