auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails

Asked in https://kubernetes.slack.com/archives/CQSNS615F/p1639651250121300, recommended to bring here.

I’m seeing an issue where my Solr cloud (3 replicas) doesn’t recover when cert-manager renews the certificate. It seems to be tied to my updateStrategy.
What I’m seeing:

- Certificate is renewed by cert-manager at 60days (expiry is 90days - default for LetsEncrypt), `restartOnTLSSecretUpdate` setting starts to restart the cluster

```yaml
solrTLS:  
    restartOnTLSSecretUpdate: true
    pkcs12Secret:
      name: solr-tls
      key: keystore.p12
    keyStorePasswordSecret:
      name: pkcs12-keystore-password
      key: password-key
```

- Below settings ensure only 1 pod restarts at a time so we still have 2 active pods to serve requests

```yaml
  updateStrategy:
    managed:
      maxPodsUnavailable: 1
      maxShardReplicasUnavailable: 1
```

After the 1 Pod restarts, its collections are shown as having a “Down” status, in the logs I see

```yaml
2021-12-16 10:15:14.746 ERROR (recoveryExecutor-11-thread-3-processing-n:pod-1.default:443_solr x:transaction_10_shard1_replica_n5 c:transaction_10 s:shard1 r:core_node6) [c:transaction_10 s:shard1 r:core_node6 x:transaction_10_shard1_replica_n5] o.a.s.c.RecoveryStrategy Failed to connect leader https://pod-2.default:443/solr on recovery, try again
```

- The cluster is stuck at this state, as the updateStrategy settings won’t allow the other 2 pods to restart since the 1 pod’s collections are “Down”
The error seems to indicate that the pod that has restarted can’t communicate with the pods that haven’t restarted yet, presumably because the certificates.

- A (bad) “workaround” I found is that if it set the below, then all 3 pods will restart and come up heathy. But this obviously has the downside of a short downtime.

```yaml
updateStrategy:
    managed:
      maxPodsUnavailable: 3
      maxShardReplicasUnavailable: 3
```

I don’t think the active certs would have been expired, as Lets Encrypt certs have 90day duration and they renewed at 60days (which is the default for cert-manager of 2/3 of the duration). https://cert-manager.io/docs/usage/certificate/#renewal & https://cert-manager.io/docs/faq/#if-renewbefore-or-duration-is-not-defined-what-will-be-the-default-value

I wonder if the act of cert-manager renewing the certificates invalidates the active one though? I can’t find this specifically in the docs. This would be a problem then. It also would explain why I was still seeing this problem when triggering a manual renewal using the cert-manager cmctl https://cert-manager.io/docs/usage/cmctl/#renew cli. If this is the case, we would need `restartOnTLSSecretUpdate` to be able to ignore the `updateStrategy. managed. max*Unavailable` settings.

Sort of related issue: https://github.com/jetstack/cert-manager/issues/1168, but Solr has `restartOnTLSSecretUpdate` for our pods.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

auto certificate renewal with restartOnTLSSecretUpdate and cert-manager fails #390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions