Skip to content

auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390

@nosvalds

Description

@nosvalds

Asked in https://kubernetes.slack.com/archives/CQSNS615F/p1639651250121300, recommended to bring here.

I’m seeing an issue where my Solr cloud (3 replicas) doesn’t recover when cert-manager renews the certificate. It seems to be tied to my updateStrategy.
What I’m seeing:

  • Certificate is renewed by cert-manager at 60days (expiry is 90days - default for LetsEncrypt), restartOnTLSSecretUpdate setting starts to restart the cluster
solrTLS:  
    restartOnTLSSecretUpdate: true
    pkcs12Secret:
      name: solr-tls
      key: keystore.p12
    keyStorePasswordSecret:
      name: pkcs12-keystore-password
      key: password-key
  • Below settings ensure only 1 pod restarts at a time so we still have 2 active pods to serve requests
  updateStrategy:
    managed:
      maxPodsUnavailable: 1
      maxShardReplicasUnavailable: 1

After the 1 Pod restarts, its collections are shown as having a “Down” status, in the logs I see

2021-12-16 10:15:14.746 ERROR (recoveryExecutor-11-thread-3-processing-n:pod-1.default:443_solr x:transaction_10_shard1_replica_n5 c:transaction_10 s:shard1 r:core_node6) [c:transaction_10 s:shard1 r:core_node6 x:transaction_10_shard1_replica_n5] o.a.s.c.RecoveryStrategy Failed to connect leader https://pod-2.default:443/solr on recovery, try again
  • The cluster is stuck at this state, as the updateStrategy settings won’t allow the other 2 pods to restart since the 1 pod’s collections are “Down”
    The error seems to indicate that the pod that has restarted can’t communicate with the pods that haven’t restarted yet, presumably because the certificates.

  • A (bad) “workaround” I found is that if it set the below, then all 3 pods will restart and come up heathy. But this obviously has the downside of a short downtime.

updateStrategy:
    managed:
      maxPodsUnavailable: 3
      maxShardReplicasUnavailable: 3

I don’t think the active certs would have been expired, as Lets Encrypt certs have 90day duration and they renewed at 60days (which is the default for cert-manager of 2/3 of the duration). https://cert-manager.io/docs/usage/certificate/#renewal & https://cert-manager.io/docs/faq/#if-renewbefore-or-duration-is-not-defined-what-will-be-the-default-value

I wonder if the act of cert-manager renewing the certificates invalidates the active one though? I can’t find this specifically in the docs. This would be a problem then. It also would explain why I was still seeing this problem when triggering a manual renewal using the cert-manager cmctl https://cert-manager.io/docs/usage/cmctl/#renew cli. If this is the case, we would need restartOnTLSSecretUpdate to be able to ignore the updateStrategy. managed. max*Unavailable settings.

Sort of related issue: cert-manager/cert-manager#1168, but Solr has restartOnTLSSecretUpdate for our pods.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions