-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With GoCD Version: 24.2.0 and Kubernetes Elastic Agent plugin: 4.1.0-541, agents are killed after 10 minutes, even when the job has not finished execution yet. #422
Comments
Whet does the job console show for the job? Did the pod register and start running tasks? Did it get rescheduled? What do the plugin specific logs on the server say?(There is a log file specific to the plugin in the logs folder, not just the server log file) What does kubernetes say happened to the pod? Did kubernetes kill the pod or the container? Is this a new setup or did something change before this started happening? (Cluster upgrade? GoCD upgrade?) |
kublet-output.log
plugin-output.log
|
Thanks!
You should be able to GPG-encrypt with my key at https://github.com/chadlwilson/chadlwilson/blob/main/gpg-public-key.asc if you'd prefer, but usually better to just redact. The bits I am interested in are the "yellow" stuff regarding the server's view of what happened to the job, the job allocation to a pod, and timestamps, really - not the specific task details when running. OK, so what I can see from that is that the pod was deleted from an API call based on
So the working theory here is perhaps that the GoCD server EKS elastic agent plugin deleted it? I'm not sure what other things could be That log for
That is 45 seconds later noting that it didn't receive a ping, but that is presumably after the pod has been deleted/killed so possibly just the plugin cleaning up. However this does seem a bit weird, as if the plugin decided to deleted the pod due to non-response, it shouldn't still be waiting for a ping. Hmm. What do the logs for the actual agent pod say, up until it is deleted/killed/stopped? You can redact the task-specific lines if you say it is running tasks, but it should at least tell us what was happening and if there were any comms issues back to the GoCD server that might cause it to lose contact, or run out of memory or something? Is there anything in the kube events for the pod, or the lifecycle changes for the pod that look interesting? e.g OOMKilleds or health or liveness check failures or pod sandbox problems? (some of those would normally be in the kubelet log too, but I can't recall OTOH so excuse any ignorance) As a sanity check:
|
Hi Chad, Thank you for your detailed analysis and the pointers. After a thorough investigation, we found the root cause of the problem. The issue was due to both dev and prod GoCD servers using the same namespace in the cluster, causing conflicts and unexpected pod deletions. Now everything looks like it is working as expected. Thanks for your help! Best regards, |
Yes, that makes sense. This could be better supported via the ideas at #118 |
Now we've encountered a new issue where the containers exceed their memory limits despite the resource constraints set within the Kubernetes configuration. No matter what we do—setting a limit at the namespace level, on the containers, etc.—the memory usage continues to exceed the specified limits. I suspect that this issue may be related to the Java application (tomcat). |
I assume you're launching tomcat inside a test or task? Java apps inside containers are tricky, especially with forked processes. While java processes can be 'container aware' (Java 11+) where there are multiple such processes that are not aware of each other and may request excessive memory. You'll need to control their heap usage when you launch them with This isn't really specific to GoCD, more the nature of multi-process containers with extra complexities for heap-based (garbage collected) applications and the way the JVM allocates (and doesn't always release) heap memory. There are some resources online for tuning containers for JVM ergonomics, but it does depend a lot on your application and JVM usage. |
We have identified the issue. There were duplicate metrics from different services left over from monitoring tests. This caused confusion in Grafana when summing the metrics, making it appear that there was a problem with the limits. Thanks again! |
We are experiencing an issue where agents are terminated after 10 minutes while jobs are still running. The setup is as follows:
Steps Taken:
EKS-Cluster-Profile:
Despite these adjustments, the issue persists. We would appreciate any guidance or suggestions for further troubleshooting.
The text was updated successfully, but these errors were encountered: