-
Couldn't load subscription status.
- Fork 129
Description
Asked in https://kubernetes.slack.com/archives/CQSNS615F/p1639651250121300, recommended to bring here.
I’m seeing an issue where my Solr cloud (3 replicas) doesn’t recover when cert-manager renews the certificate. It seems to be tied to my updateStrategy.
What I’m seeing:
- Certificate is renewed by cert-manager at 60days (expiry is 90days - default for LetsEncrypt),
restartOnTLSSecretUpdatesetting starts to restart the cluster
solrTLS:
restartOnTLSSecretUpdate: true
pkcs12Secret:
name: solr-tls
key: keystore.p12
keyStorePasswordSecret:
name: pkcs12-keystore-password
key: password-key- Below settings ensure only 1 pod restarts at a time so we still have 2 active pods to serve requests
updateStrategy:
managed:
maxPodsUnavailable: 1
maxShardReplicasUnavailable: 1After the 1 Pod restarts, its collections are shown as having a “Down” status, in the logs I see
2021-12-16 10:15:14.746 ERROR (recoveryExecutor-11-thread-3-processing-n:pod-1.default:443_solr x:transaction_10_shard1_replica_n5 c:transaction_10 s:shard1 r:core_node6) [c:transaction_10 s:shard1 r:core_node6 x:transaction_10_shard1_replica_n5] o.a.s.c.RecoveryStrategy Failed to connect leader https://pod-2.default:443/solr on recovery, try again-
The cluster is stuck at this state, as the updateStrategy settings won’t allow the other 2 pods to restart since the 1 pod’s collections are “Down”
The error seems to indicate that the pod that has restarted can’t communicate with the pods that haven’t restarted yet, presumably because the certificates. -
A (bad) “workaround” I found is that if it set the below, then all 3 pods will restart and come up heathy. But this obviously has the downside of a short downtime.
updateStrategy:
managed:
maxPodsUnavailable: 3
maxShardReplicasUnavailable: 3I don’t think the active certs would have been expired, as Lets Encrypt certs have 90day duration and they renewed at 60days (which is the default for cert-manager of 2/3 of the duration). https://cert-manager.io/docs/usage/certificate/#renewal & https://cert-manager.io/docs/faq/#if-renewbefore-or-duration-is-not-defined-what-will-be-the-default-value
I wonder if the act of cert-manager renewing the certificates invalidates the active one though? I can’t find this specifically in the docs. This would be a problem then. It also would explain why I was still seeing this problem when triggering a manual renewal using the cert-manager cmctl https://cert-manager.io/docs/usage/cmctl/#renew cli. If this is the case, we would need restartOnTLSSecretUpdate to be able to ignore the updateStrategy. managed. max*Unavailable settings.
Sort of related issue: cert-manager/cert-manager#1168, but Solr has restartOnTLSSecretUpdate for our pods.