OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

rexagod · 2025-10-12T20:58:00Z

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see [1]), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100). [1]: https://github.com/prometheus-operator/prometheus-operator/blob/d0ae00fdedc656a5a1a290d9839b84d860f15428/pkg/prometheus/common.go#L56-L59 Signed-off-by: Pranshu Srivastava <[email protected]>

openshift-ci-robot · 2025-10-12T20:58:08Z

@rexagod: This pull request references Jira Issue OCPBUGS-62703, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Overrides the duplicate readiness error events' limit for Prometheus during upgrades. Since Prometheus needs some time to wind down (see 1), it was causing Kubelet to exhibit readiness error events during the time span it took to terminate. This ignores those pings to a limit (100).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-10-12T20:58:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rexagod
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rexagod · 2025-10-12T21:05:20Z

pkg/monitortestlibrary/pathologicaleventlibrary/duplicated_event_patterns.go

+		name: "PrometheusReadinessProbeErrorsDuringUpgrades",
+		locatorKeyRegexes: map[monitorapi.LocatorKey]*regexp.Regexp{
+			monitorapi.LocatorNamespaceKey:   regexp.MustCompile(`^` + statefulSetNamespace + `$`),
+			monitorapi.LocatorStatefulSetKey: regexp.MustCompile(`^` + statefulSetName + `$`),


I'm using a StatefulSet locator key here since that's how we deploy prometheus, however, the errors are reported pod-wise (see below), so I'm wondering if I should change this as well as the testIntervals conditional below to work with Pod keys instead of StatefulSet ones?

event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }

I'll trigger a small number of upgrade jobs to see if this works as is.

rexagod · 2025-10-12T22:47:57Z

/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance 3

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance has the highest count based on these results in the last couple of days, so I'm going with that here.

openshift-ci · 2025-10-12T22:48:00Z

@rexagod: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7fbee5c0-a7bd-11f0-983c-93024592481b-0

rexagod · 2025-10-13T07:54:41Z

/payload-job periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
/payload-job periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
/payload-job periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
/payload-job periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

openshift-ci · 2025-10-13T07:54:45Z

@rexagod: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.21-e2e-agent-ha-dualstack-conformance
periodic-ci-openshift-multiarch-master-nightly-4.20-upgrade-from-stable-4.19-ocp-e2e-upgrade-aws-ovn-multi-a-a
periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.20-upgrade-from-stable-4.19-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0d1c020-a809-11f0-8a17-a2fb9ee66422-0

openshift-ci · 2025-10-17T12:49:51Z

@rexagod: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/go-verify-deps	`b5321f7`	link	true	`/test go-verify-deps`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from p0lyn0mial and sjenning October 12, 2025 20:58

rexagod commented Oct 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

Uh oh!

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci-robot commented Oct 12, 2025

Uh oh!

openshift-ci bot commented Oct 12, 2025

Uh oh!

rexagod Oct 12, 2025

Uh oh!

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci bot commented Oct 12, 2025

Uh oh!

rexagod commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

Are you sure you want to change the base?

OCPBUGS-62703: Relax duplicate events detection for Prometheus #30372

Uh oh!

Conversation

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci-robot commented Oct 12, 2025

Uh oh!

openshift-ci bot commented Oct 12, 2025

Uh oh!

rexagod Oct 12, 2025

Choose a reason for hiding this comment

Uh oh!

rexagod commented Oct 12, 2025

Uh oh!

openshift-ci bot commented Oct 12, 2025

Uh oh!

rexagod commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 13, 2025

Uh oh!

openshift-ci bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants