During the initial deployment of the production namespace to the Kubernetes cluster, @jsnshrmn made configuration changes to the load balancers outside of kubectl, and did not update the certificates specified in the Kubernetes service templates afterwards.
After reorganizing the services templates weeks later, @jsnshrmn redeployed all services, and by doing that applied the incorrect certificates to staging and production end points.
All times in CDT.
During the initial deployment of the production namespace to the Kubernetes cluster, we (@danielballan, @jsnshrmn, @Mr0grog) hit a snag with getting appropriate TLS certificates provisioned, due to our DNS configuration. @jsnshrmn carelessly adds a permanent exception to allow the invalid certificate in his browser while troubleshooting the issue.
We realized that @danielballan had manually obtained the appropriate certificates in advance. @jsnshrmn manually applies those certificates to the load balancers in the AWS console, and does not update the certificates specified in the Kubernetes service templates. Services are operational, but the conditions for failure have been set for the next deployment of services.
Weeks later, @jsnshrmn deploys reorganized service configuration to staging (and the incorrect certs along with it), and performs a sanity check by visiting the staging ui and api endpoints in his browser. This looks fine on the surface, because his browser had a permanent exception to allow the invalid certificate and isn't looking very closely at the security status for the endpoints.
@jsnshrmn deploys reorganized service configuration to production (and the incorrect certs along with it), and performs a sanity check by visiting the production ui and api endpoints in his browser. This looks fine on the surface, because his browser had a permanent exception to allow the invalid certificate and isn't looking very closely at the security status for the endpoints.
Sentry began sending alerts because versionista-scraper was encountering ssl errors while connecting to the production api endpoint.
@jsnshrmn reports the issue on slack and begins investigating, feeling suspicious that it was related to the service template changes, but confused by the apparent functional status of the endpoints in his browser. Upon performing curl
against the end points, he discovers the source of the false negative of his earlier checks. He removes the certificate exceptions from his browser.
@jsnshrmn manually updates the certificates specified for the staging and production load balancers for staging and production to the correct values. Services are restored.
@jsnshrmn corrects the certificates specified in the staging service template and redeploys. He verifies that they are truly correct after deployment.
@jsnshrmn corrects the certificates specified in the production service template and redeploys. He verifies that they are truly correct after deployment. He copies amended service configurations back to keybase and reports in on slack.
Incident resolved.
All infrastructure operated as documented.
@jsnshrmn did not use a reasonable operational procedure during the initial deployment of the production namespace, nor did he adequately test the success of changes to the service configuration templates. Additionally, there were no automated post-deployment tests nor endpoint monitoring to alert anyone of the issue until the versionista-scraper ran and failed later that evening. If any of these factors were not in play, this error would have:
- never happened, or
- been identified immediately after the staging deployment, or
- been identified immediately after the production deployment, instead of many hours later
- @jsnshrmn will identify and implement a basic service deployment test procedure
- @jsnshrmn will identify and implement a basic service monitoring and alerting process for our staging and production endpoints
- @jsnshrmn