-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design] Dynamic resource usage for GMP operator #793
Comments
This weekend Previous issue with rules-evaluator
I see that the rules-evaluator VPA hasn't moved either. I can live lacking VPAs, but having the webhook fail closed for gmp-operator was excruciating. |
Thanks for reporting @clearclaw - I apologize for the frustration. With the operator being such a mission-critical binary in managed-collection, we can also explore failing open for our webhooks. Would that have helped in this situation over a VPA specifically? |
Yes, in that deploys would have worked despite the failure. Being unable
to eg ship a critical security fix due to GMP indigestion is not right.
Additionally, the lack of default observability/alerting around GMP
remains a concern (gmp-operator or rules-evaluator previously here,
but there's more components). If the observability stack fails, all the
alarms should be going off. It would be really nice to see/have
default Rules objects for all of GMP. (Why are there no podmonitoring
objects for the GMP processes?)
|
+1, some way of handling this would be important, let's prioritize it. VPA/limits is one thing, but perhaps additionally it would be to prioritize the code optimization phase, to check for low hanging fruits and start to queue things up (trading latency vs memory).
For managed collection we don't do this, because we have our own alerting pipeline (however oriented on more fleet-wide situations and unknown situations). On top of that we historically didn't deploy those self-observability resources by default as everyone on GKE would need to pay for those extra metrics. We ship some example configuration anyone can deploy (on GKE std), though missing operator. To sum up, we have a bit to improve here, thanks for reporting & ideas. Perhaps it's time to add that |
Making self-observation opt-in is sensible. Something as simple as a doc
page or reference pointing at reference pod monitors and possible alert
rules under https://g.co/kgs/dDdszEW would help a lot.
In short, make it easy for people to do good things.
-- JCL
…On Tue, 20 Aug 2024, 01:05 Bartlomiej Plotka, ***@***.***> wrote:
+1, some way of handling this would be important, let's prioritize it.
VPA/limits is one thing, but perhaps additionally it would be to
prioritize the code optimization phase, to check for low hanging fruits and
start to queue things up (trading latency vs memory).
Additionally, the lack of default observability/alerting around GMP
remains a concern (gmp-operator or rules-evaluator previously here,
but there's more components). If the observability stack fails, all the
alarms should be going off. It would be really nice to see/have
default Rules objects for all of GMP. (Why are there no podmonitoring
objects for the GMP processes?)
For managed collection we don't do this, because we have our own alerting
pipeline (however oriented on more fleet-wide situations and unknown
situations). On top of that we historically didn't deploy those
self-observability resources by default as everyone on GKE would need to
pay for those extra metrics. We ship some example configuration
<https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/examples/self-pod-monitoring.yaml>
anyone can deploy (on GKE std), though missing operator
<#1123>.
To sum up, we have a bit to improve here, thanks for reporting & ideas.
Perhaps it's time to add that self-monitoring feature/option to
OperatorConfig but likely opt-in.
—
Reply to this email directly, view it on GitHub
<#793 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJYHIK4AM26DO4XMNYMHQ3ZSL2C3AVCNFSM6AAAAABMYJKNZOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJYGIZDONJWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
The gmp-operator exposes Prometheus metrics that can be helpful to debug issues with managed-collection. This is also the case with the managed alertmanager. We have examples of how to scrape metrics from other components, but not the operator or the alertmanager. So here we provide examples to supplement the self-monitoring exporter documented in https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en. Partly addresses #793. Signed-off-by: Danny Clark <[email protected]>
The gmp-operator exposes Prometheus metrics that can be helpful to debug issues with managed-collection. This is also the case with the managed alertmanager. We have examples of how to scrape metrics from other components, but not the operator or the alertmanager. So here we provide examples to supplement the self-monitoring exporter documented in https://cloud.google.com/stackdriver/docs/managed-prometheus/exporters/prometheus?hl=en. Partly addresses #793.
Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a VPA?
Acceptance criteria:
The text was updated successfully, but these errors were encountered: