NO-JIRA: add recommended etcd threshold for alerts #1495

tjungblu · 2025-10-09T12:38:57Z

Setting the thresholds to the KCS recommended values.

openshift-ci-robot · 2025-10-09T12:39:01Z

@tjungblu: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2025-10-09T12:39:43Z

Walkthrough

Adjusts etcd alert thresholds and definitions, adds a new etcdHighFsyncDurations alert, removes older fsync alerts from the PrometheusRule, adds the new alert to the jsonnet exclusion list, and updates a dependency SHA in jsonnetfile.lock.json.

Changes

Cohort / File(s)	Summary of Changes
Prometheus alerts: etcd thresholds and rules `jsonnet/custom.libsonnet`, `manifests/0000_90_etcd-operator_03_prometheusrule.yaml`	Lowered `etcdHighCommitDurations` threshold from `> 0.5` to `> 0.025`. Added new `etcdHighFsyncDurations` alert (`etcd_disk_wal_fsync_duration_seconds_bucket > 0.010`, `for: 10m`, `severity: critical`) with annotations and runbook URL. Removed older fsync alert variants from the PrometheusRule.
Jsonnet wiring: alert exclusions `jsonnet/main.jsonnet`	Added `etcdHighFsyncDurations` to the `excludedAlerts` list; no other logic changes.
Dependencies lockfile update `jsonnet/jsonnetfile.lock.json`	Updated a dependency entry SHA (first dependency version changed); lockfile sums/other fields unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "add recommended etcd threshold for alerts" directly and accurately reflects the main objective of the changeset. The changes involve updating alert thresholds to recommended values (specifically updating etcdHighCommitDurations from > 0.5 to > 0.025, adding a new etcdHighFsyncDurations alert with > 0.010, and related configuration updates). The title is concise, clear, and specific enough for a teammate to understand the primary change without being overly detailed.
Description Check	✅ Passed	The pull request description "Setting the thresholds to the KCS recommended values" is directly related to the changeset. The description conveys meaningful information about the purpose of the changes—updating alert thresholds to KCS (Kubernetes Community Standards or similar) recommended values—which aligns with the actual modifications made to the etcd alert configuration across multiple files.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2025-10-09T12:39:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-10-09T16:07:46Z

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`f2f7037`	link	true	`/test e2e-aws-ovn-single-node`
ci/prow/unit	`f2f7037`	link	true	`/test unit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tjungblu · 2025-10-22T13:11:58Z

/payload blocking

openshift-ci · 2025-10-22T13:12:00Z

@tjungblu: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

tjungblu · 2025-10-22T13:13:07Z

/payload 4.21 nightly blocking

openshift-ci · 2025-10-22T13:13:17Z

@tjungblu: trigger 13 job(s) of type blocking for the nightly release of OCP 4.21

periodic-ci-openshift-release-master-ci-4.21-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips
periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade
periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aws-ovn-conformance
periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2
periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-1of3
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3
periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3
periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-bm
periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/da014db0-af48-11f0-92d5-626d0ee0248b-0

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f2f7037 and 41821e3.

📒 Files selected for processing (4)

jsonnet/custom.libsonnet (2 hunks)
jsonnet/jsonnetfile.lock.json (1 hunks)
jsonnet/main.jsonnet (1 hunks)
manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

jsonnet/jsonnetfile.lock.json
jsonnet/custom.libsonnet

🔇 Additional comments (2)

jsonnet/main.jsonnet (1)

9-9: LGTM! Alert exclusion added in correct alphabetical order.

The addition of etcdHighFsyncDurations to the exclusion list correctly maintains alphabetical ordering and aligns with the new alert definition in the PrometheusRule.

manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1)

109-109: Verify the 25ms commit threshold is intentional and appropriate for all platforms.

The web search confirms your concern: industry standard warning thresholds are 0.25s (250ms) for etcd commit p99 latency. The operational guidance from etcd documentation recommends aiming for backend/apply p99 ~25ms, which is an operational target, not an alert threshold.

Your PR sets the alert at 0.025s (25ms)—aligning with the operational target rather than industry-standard alert thresholds. This 10x reduction below common practice could generate frequent alerts on healthy systems.

Please confirm:

This threshold was intentionally chosen (not an accidental unit conversion)

It has been validated to avoid excessive false-positives across AWS, Azure, GCP, bare metal, and other supported platforms

Whether this aligns with OpenShift product requirements or customer SLOs

coderabbitai · 2025-10-22T13:15:49Z

manifests/0000_90_etcd-operator_03_prometheusrule.yaml

+    - alert: etcdHighFsyncDurations
+      annotations:
+        description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
+        runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
+        summary: etcd cluster 99th percentile fsync durations are too high.
+      expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.010
+      for: 10m
+      labels:
+        severity: critical


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical severity with 10ms fsync threshold may cause excessive alerts.

The new etcdHighFsyncDurations alert uses a 10ms threshold with critical severity. This is extremely aggressive compared to:

The removed alerts which used 0.5s and 1s thresholds

Typical etcd recommendations of 100ms-1s for fsync warnings

Storage performance characteristics on cloud providers and network-attached storage

While 10ms is appropriate for local NVMe SSDs, many OpenShift deployments use cloud block storage or other configurations where 10ms fsync at 99th percentile is unrealistic. This could result in:

Frequent critical pages causing alert fatigue

Operational burden investigating false positives

Difficulty distinguishing real storage issues from infrastructure limitations

Please verify:

This threshold has been validated not to cause excessive alerts across all supported platforms

Whether this should remain at critical severity or be downgraded to warning

Consider the etcdGRPCRequestsSlow alert (lines 90-104) which uses platform-specific thresholds (1.5s for Azure, 1s for others) - should fsync also account for platform differences?

Check current etcd fsync recommendations:

🌐 Web query:

What are the recommended fsync duration thresholds for etcd alerts?

💡 Result:

Recommended thresholds (based on common etcd / Prometheus alerting rules):

Healthy target: 99th‑percentile etcd WAL fsync duration < ~0.01 s (10 ms). [1][2]

Warning alert: 99th‑percentile fsync duration ≥ 0.5 s (500 ms) for a sustained window (typ. 2–10 min). [1][3][4]

Critical/stricter alert: some rules use ≥ 1 s (1000 ms) for more severe alerting. [5][4]

Typical PromQL used:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 (warning). [1][3]

Sources: [1] Prometheus Operator etcd runbook; [2] KubeSphere/OpenShift guidance; [3] community alert rule collections (awesome-prometheus-alerts); [4] common example rules (etcd fsync 0.5s / commit 0.25s).

Severity miscategorized: 10ms fsync alert should be warning, not critical.

Based on etcd alerting standards, a 10ms (0.01s) threshold represents the healthy performance target for 99th percentile fsync—not a degradation threshold. The actual problem is the severity level:

Recommended thresholds:

Healthy target: < 10ms

Warning: ≥ 0.5s

Critical: ≥ 1s

The current alert fires as critical whenever fsync slightly exceeds the healthy baseline, which will cause excessive paging and alert fatigue.

Recommended fix: Either downgrade this alert to warning severity, or restructure with tiered alerts (warning at 0.5s, critical at 1s) to match standard etcd alerting practices. Also verify whether this should account for platform differences, similar to the etcdGRPCRequestsSlow alert (lines 90-104).

🤖 Prompt for AI Agents

In manifests/0000_90_etcd-operator_03_prometheusrule.yaml around lines 113-121, the etcdHighFsyncDurations alert uses a 0.01s threshold but is labeled severity: critical; change this to severity: warning OR replace with two tiered alerts (one alert for warning at expr >= 0.5s and another for critical at expr >= 1s using the same histogram_quantile(rate(...)[5m]) pattern), and ensure labels/annotations match the existing etcdGRPCRequestsSlow structure (lines 90-104) to keep platform differences consistent.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 9, 2025

openshift-ci bot requested review from dusk125 and ironcladlou October 9, 2025 12:39

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2025

NO-JIRA: add recommended etcd threshold for alerts

41821e3

tjungblu force-pushed the real_thresholds branch from f2f7037 to 41821e3 Compare October 22, 2025 13:10

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2025

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NO-JIRA: add recommended etcd threshold for alerts #1495

NO-JIRA: add recommended etcd threshold for alerts #1495

tjungblu commented Oct 9, 2025 •

edited

Loading

Uh oh!

openshift-ci-robot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

tjungblu commented Oct 22, 2025

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

tjungblu commented Oct 22, 2025

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NO-JIRA: add recommended etcd threshold for alerts #1495

Are you sure you want to change the base?

NO-JIRA: add recommended etcd threshold for alerts #1495

Conversation

tjungblu commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

openshift-ci bot commented Oct 9, 2025

Uh oh!

tjungblu commented Oct 22, 2025

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

tjungblu commented Oct 22, 2025

Uh oh!

openshift-ci bot commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tjungblu commented Oct 9, 2025 •

edited

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading