-
Notifications
You must be signed in to change notification settings - Fork 4.8k
OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see [1]), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100). [1]: https://github.com/prometheus-operator/prometheus-operator/blob/d0ae00fdedc656a5a1a290d9839b84d860f15428/pkg/prometheus/common.go#L56-L59 Signed-off-by: Pranshu Srivastava <[email protected]>
@rexagod: This pull request references Jira Issue OCPBUGS-62703, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rexagod The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
name: "PrometheusReadinessProbeErrorsDuringUpgrades", | ||
locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{ | ||
monitorapi.LocatorNamespaceKey: regexp.MustCompile(`^` + statefulSetNamespace + `$`), | ||
monitorapi.LocatorStatefulSetKey: regexp.MustCompile(`^` + statefulSetName + `$`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using a StatefulSet
locator key here since that's how we deploy prometheus, however, the errors are reported pod-wise (see below), so I'm wondering if I should change this as well as the testIntervals
conditional below to work with Pod
keys instead of StatefulSet
ones?
event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }
I'll trigger a small number of upgrade jobs to see if this works as is.
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance 3
|
@rexagod: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fbee5c0-a7bd-11f0-983c-93024592481b-0 |
/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance |
@rexagod: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0d1c020-a809-11f0-8a17-a2fb9ee66422-0 |
@rexagod: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).