Skip to content

Conversation

@tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Oct 9, 2025

Setting the thresholds to the KCS recommended values.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 9, 2025
@openshift-ci-robot
Copy link

@tjungblu: This pull request explicitly references no jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from dusk125 and ironcladlou October 9, 2025 12:39
@coderabbitai
Copy link

coderabbitai bot commented Oct 9, 2025

Walkthrough

Adjusts etcd alert thresholds and definitions, adds a new etcdHighFsyncDurations alert, removes older fsync alerts from the PrometheusRule, adds the new alert to the jsonnet exclusion list, and updates a dependency SHA in jsonnetfile.lock.json.

Changes

Cohort / File(s) Summary of Changes
Prometheus alerts: etcd thresholds and rules
jsonnet/custom.libsonnet, manifests/0000_90_etcd-operator_03_prometheusrule.yaml
Lowered etcdHighCommitDurations threshold from > 0.5 to > 0.025. Added new etcdHighFsyncDurations alert (etcd_disk_wal_fsync_duration_seconds_bucket > 0.010, for: 10m, severity: critical) with annotations and runbook URL. Removed older fsync alert variants from the PrometheusRule.
Jsonnet wiring: alert exclusions
jsonnet/main.jsonnet
Added etcdHighFsyncDurations to the excludedAlerts list; no other logic changes.
Dependencies lockfile update
jsonnet/jsonnetfile.lock.json
Updated a dependency entry SHA (first dependency version changed); lockfile sums/other fields unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "add recommended etcd threshold for alerts" directly and accurately reflects the main objective of the changeset. The changes involve updating alert thresholds to recommended values (specifically updating etcdHighCommitDurations from > 0.5 to > 0.025, adding a new etcdHighFsyncDurations alert with > 0.010, and related configuration updates). The title is concise, clear, and specific enough for a teammate to understand the primary change without being overly detailed.
Description Check ✅ Passed The pull request description "Setting the thresholds to the KCS recommended values" is directly related to the changeset. The description conveys meaningful information about the purpose of the changes—updating alert thresholds to KCS (Kubernetes Community Standards or similar) recommended values—which aligns with the actual modifications made to the etcd alert configuration across multiple files.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 9, 2025

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node f2f7037 link true /test e2e-aws-ovn-single-node
ci/prow/unit f2f7037 link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 10, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 22, 2025
@tjungblu
Copy link
Contributor Author

/payload blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 22, 2025

@tjungblu: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@tjungblu
Copy link
Contributor Author

/payload 4.21 nightly blocking

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 22, 2025

@tjungblu: trigger 13 job(s) of type blocking for the nightly release of OCP 4.21

  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-master-ci-4.21-upgrade-from-stable-4.20-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.21-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-bm
  • periodic-ci-openshift-release-master-nightly-4.21-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/da014db0-af48-11f0-92d5-626d0ee0248b-0

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f2f7037 and 41821e3.

📒 Files selected for processing (4)
  • jsonnet/custom.libsonnet (2 hunks)
  • jsonnet/jsonnetfile.lock.json (1 hunks)
  • jsonnet/main.jsonnet (1 hunks)
  • manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • jsonnet/jsonnetfile.lock.json
  • jsonnet/custom.libsonnet
🔇 Additional comments (2)
jsonnet/main.jsonnet (1)

9-9: LGTM! Alert exclusion added in correct alphabetical order.

The addition of etcdHighFsyncDurations to the exclusion list correctly maintains alphabetical ordering and aligns with the new alert definition in the PrometheusRule.

manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1)

109-109: Verify the 25ms commit threshold is intentional and appropriate for all platforms.

The web search confirms your concern: industry standard warning thresholds are 0.25s (250ms) for etcd commit p99 latency. The operational guidance from etcd documentation recommends aiming for backend/apply p99 ~25ms, which is an operational target, not an alert threshold.

Your PR sets the alert at 0.025s (25ms)—aligning with the operational target rather than industry-standard alert thresholds. This 10x reduction below common practice could generate frequent alerts on healthy systems.

Please confirm:

  1. This threshold was intentionally chosen (not an accidental unit conversion)
  2. It has been validated to avoid excessive false-positives across AWS, Azure, GCP, bare metal, and other supported platforms
  3. Whether this aligns with OpenShift product requirements or customer SLOs

Comment on lines +113 to +121
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.010
for: 10m
labels:
severity: critical
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical severity with 10ms fsync threshold may cause excessive alerts.

The new etcdHighFsyncDurations alert uses a 10ms threshold with critical severity. This is extremely aggressive compared to:

  • The removed alerts which used 0.5s and 1s thresholds
  • Typical etcd recommendations of 100ms-1s for fsync warnings
  • Storage performance characteristics on cloud providers and network-attached storage

While 10ms is appropriate for local NVMe SSDs, many OpenShift deployments use cloud block storage or other configurations where 10ms fsync at 99th percentile is unrealistic. This could result in:

  • Frequent critical pages causing alert fatigue
  • Operational burden investigating false positives
  • Difficulty distinguishing real storage issues from infrastructure limitations

Please verify:

  1. This threshold has been validated not to cause excessive alerts across all supported platforms
  2. Whether this should remain at critical severity or be downgraded to warning
  3. Consider the etcdGRPCRequestsSlow alert (lines 90-104) which uses platform-specific thresholds (1.5s for Azure, 1s for others) - should fsync also account for platform differences?

Check current etcd fsync recommendations:


🌐 Web query:

What are the recommended fsync duration thresholds for etcd alerts?

💡 Result:

Recommended thresholds (based on common etcd / Prometheus alerting rules):

  • Healthy target: 99th‑percentile etcd WAL fsync duration < ~0.01 s (10 ms). [1][2]
  • Warning alert: 99th‑percentile fsync duration ≥ 0.5 s (500 ms) for a sustained window (typ. 2–10 min). [1][3][4]
  • Critical/stricter alert: some rules use ≥ 1 s (1000 ms) for more severe alerting. [5][4]

Typical PromQL used:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 (warning). [1][3]

Sources: [1] Prometheus Operator etcd runbook; [2] KubeSphere/OpenShift guidance; [3] community alert rule collections (awesome-prometheus-alerts); [4] common example rules (etcd fsync 0.5s / commit 0.25s).


Severity miscategorized: 10ms fsync alert should be warning, not critical.

Based on etcd alerting standards, a 10ms (0.01s) threshold represents the healthy performance target for 99th percentile fsync—not a degradation threshold. The actual problem is the severity level:

Recommended thresholds:

  • Healthy target: < 10ms
  • Warning: ≥ 0.5s
  • Critical: ≥ 1s

The current alert fires as critical whenever fsync slightly exceeds the healthy baseline, which will cause excessive paging and alert fatigue.

Recommended fix: Either downgrade this alert to warning severity, or restructure with tiered alerts (warning at 0.5s, critical at 1s) to match standard etcd alerting practices. Also verify whether this should account for platform differences, similar to the etcdGRPCRequestsSlow alert (lines 90-104).

🤖 Prompt for AI Agents
In manifests/0000_90_etcd-operator_03_prometheusrule.yaml around lines 113-121,
the etcdHighFsyncDurations alert uses a 0.01s threshold but is labeled severity:
critical; change this to severity: warning OR replace with two tiered alerts
(one alert for warning at expr >= 0.5s and another for critical at expr >= 1s
using the same histogram_quantile(rate(...)[5m]) pattern), and ensure
labels/annotations match the existing etcdGRPCRequestsSlow structure (lines
90-104) to keep platform differences consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants