ATC issues after a database failover #9314
Replies: 3 comments 1 reply
-
|
I suspect also some minor flaps in the database might have caused this, trying to confirm, but in any case does the ATC cache some stuff for fly execute/workers, because other builds seem to be fine. I wonder if somehow our endpoint switched due to failover and the database endpoint was cached for the webs somewhere. I still do not understand why this will cause errors only for certain endpoints, but yeah. |
Beta Was this translation helpful? Give feedback.
-
|
Okay so the RC was a DB failover, but I still wonder since nothing changed as configurations, why only those specific endpoints had problems. |
Beta Was this translation helpful? Give feedback.
-
|
A good start might be something similar #2209 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Occasionally, rather rarely we have seen some issues with Concourse ATC that we are still not able to reproduce in order to open an issue in the community.
fly wsto investigate the worker is there in a healthy and running stateThe mitigation/Fix is to restart the web VM's (since it is a bosh deployment) which resolves this, but I still suspect there is an underlying bug we are missing. I tried collecting some data, but since there is a lot of confidential data, I can share specific chunks if/when requested in order to redact some of the parts that might not be okay to upload.
The idea of this discussion was to see if other people have seen something similar and could provide some insights from their end. Could be bosh specific as well, although it looks like an application issue.
Some useful info.
Runtime:
guardian(currently switching tocontainerd, which might fix this magically if the issue is somehow not with the ATC)Env:
boshVersion:
v7.14.1Beta Was this translation helpful? Give feedback.
All reactions