-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rule-evaluator doesn't get updated alertmanager pod ipv4s #866
Comments
Hey @parkedwards - thanks for reaching out and apologies for the delayed response. The So when it comes to BYO alertmanagers, the rule-evaluator is using the same underlying Kubernetes service-discovery as Prometheus does. Specifically, in our stack we take the same approach as prometheus-operator, and use endpoint-based service discovery to find the target for posting alerts to. Now if the alertmanager pod is rescheduled, presumably its Would you be able to check the state of the corresponding |
Hi @pintohutch, I just restarted the alert manager, and the current endpoints are:
Here are the logs from the rule evaluator:
It's still pointing out the previous endpoints (before the restart). The issue persists until we restart the rule evaluator. |
Interesting. Hey @volkanakcora - could you provide the rule-evaluator config so we can sanity check this on our side?
|
Hi @pintohutch, Please find our configs:
And here's our alert manager config:
Let me know if you need further infos. |
Hey @volkanakcora - do you have self-monitoring enabled? I'm curious what the value of |
Before Alert Manager Restart:
Restarting the Alert Manager: kubectl -n monitoring-alertmanager rollout restart statefulset alertmanager After Alert Manager Restart (No Change also during the restart):
Restarting the Rule Evaluator: kubectl rollout restart -n gmp-system deployment rule-evaluator After Rule Evaluator Restart (New Metrics):
|
Hey @volkanakcora - thanks for posting the metric details above. However, I'm more interested in the graph, which you posted in the subsequent comment. I would suspect in the broken case, that the line stays at 4 (i.e. the discovery isn't toggled for some reason).
I haven't personally used argo before. What does this do and why do you think it may fix the issue? |
Hi @pintohutch, That's actually the expected behavior, right? When the pod count drops to 3 and recovers to 4, it seems like ArgoCD's intervention functioned as intended (similar to a statefulset restart, but through ArgoCD's GitOps workflow). In my initial comment, I mentioned rescheduling the pods(via command line), but the rule evaluator itself remained unchanged, so I think we could stick with that problem if I understood it right. |
Oh wait - I actually think your debug logs from your second comment hold the clue. Specifically:
The discovery manager is not able to keep up with service discovery events (changes) in this case. Is your rule-evaluator resource-starved in any way? We would actually be able to track the delay through discovery metrics, but we don't enable those in our rule-evaluator (we should!) - filed #973 to do that. |
Hi @pintohutch , this is the known logs/errors we get after every alert manager rescheduling. Not sure if the rule-evaluator has the resource issue. I had thought about it as well, but did not change anything on the rule evaluator side. Please see the last 7 days metrics for rule evaluator: Do you suggest us to increase the rule evaluator resources, and test it again? Volkan. |
Yes please, but it also depends what limits you currently have. |
Hi @bwplotka , |
Hi @bwplotka , @pintohutch , I have boosted the resources for rule evaluator, alert manager, gmp-operator. However, the result is still the same: |
Restarting the entire alert manager application by deleting and recreating the statefulset seems to resolve the issue, but it's not a guaranteed fix. Sometimes, even this approach fails. |
Ok thanks for trying and letting us know @volkanakcora. It looks like you're running managed collection on GKE. Are you ok if we take a look from our side to help debug? We can open an internal ticket to track the support work. I wonder if it's related to prometheus/prometheus#13676... |
Hi @pintohutch , it's OK for us, thank you. It could be related to it, I'm checking it. |
hello - we're currently using Managed Prometheus and a self-hosted Alertmanager deployment. This has been functioning properly for over a year. We're currently on this version of
rule-evaluator
our
rule-evaluator
sends events to a self-managed Alertmanager statefulset, which lives in a separate namespace. We configure this via theOperatorConfig
CRD:in the last month or so, we've noticed that the
rule-evaluator
will be unable to resolve the downstream Alertmanager address after the Alertmanager pod is rescheduled.From there, we'll see the
rule-evaluator
log this out:this can go on for an hour - we have pages set up to notify us when the
rule-evaluator
stops pinging Alertmanager through a custom heartbeat rule. The only way to resolve this is by restarting therule-evaluator
deploymentthis suggests that the
rule-evaluator
is not reconciling downstream ip addresses after startup, since we provide the k8s DNS components in theOperatorConfig
for the Alertmanager receiverThe text was updated successfully, but these errors were encountered: