-
Notifications
You must be signed in to change notification settings - Fork 147
NO-JIRA: add recommended etcd threshold for alerts #1495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@tjungblu: This pull request explicitly references no jira issue. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdjusts etcd alert thresholds and definitions, adds a new etcdHighFsyncDurations alert, removes older fsync alerts from the PrometheusRule, adds the new alert to the jsonnet exclusion list, and updates a dependency SHA in jsonnetfile.lock.json. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@tjungblu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
f2f7037 to
41821e3
Compare
|
/payload blocking |
|
/payload 4.21 nightly blocking |
|
@tjungblu: trigger 13 job(s) of type blocking for the nightly release of OCP 4.21
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/da014db0-af48-11f0-92d5-626d0ee0248b-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
jsonnet/custom.libsonnet(2 hunks)jsonnet/jsonnetfile.lock.json(1 hunks)jsonnet/main.jsonnet(1 hunks)manifests/0000_90_etcd-operator_03_prometheusrule.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- jsonnet/jsonnetfile.lock.json
- jsonnet/custom.libsonnet
🔇 Additional comments (2)
jsonnet/main.jsonnet (1)
9-9: LGTM! Alert exclusion added in correct alphabetical order.The addition of
etcdHighFsyncDurationsto the exclusion list correctly maintains alphabetical ordering and aligns with the new alert definition in the PrometheusRule.manifests/0000_90_etcd-operator_03_prometheusrule.yaml (1)
109-109: Verify the 25ms commit threshold is intentional and appropriate for all platforms.The web search confirms your concern: industry standard warning thresholds are 0.25s (250ms) for etcd commit p99 latency. The operational guidance from etcd documentation recommends aiming for backend/apply p99 ~25ms, which is an operational target, not an alert threshold.
Your PR sets the alert at 0.025s (25ms)—aligning with the operational target rather than industry-standard alert thresholds. This 10x reduction below common practice could generate frequent alerts on healthy systems.
Please confirm:
- This threshold was intentionally chosen (not an accidental unit conversion)
- It has been validated to avoid excessive false-positives across AWS, Azure, GCP, bare metal, and other supported platforms
- Whether this aligns with OpenShift product requirements or customer SLOs
| - alert: etcdHighFsyncDurations | ||
| annotations: | ||
| description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.' | ||
| runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md | ||
| summary: etcd cluster 99th percentile fsync durations are too high. | ||
| expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.010 | ||
| for: 10m | ||
| labels: | ||
| severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Critical severity with 10ms fsync threshold may cause excessive alerts.
The new etcdHighFsyncDurations alert uses a 10ms threshold with critical severity. This is extremely aggressive compared to:
- The removed alerts which used 0.5s and 1s thresholds
- Typical etcd recommendations of 100ms-1s for fsync warnings
- Storage performance characteristics on cloud providers and network-attached storage
While 10ms is appropriate for local NVMe SSDs, many OpenShift deployments use cloud block storage or other configurations where 10ms fsync at 99th percentile is unrealistic. This could result in:
- Frequent critical pages causing alert fatigue
- Operational burden investigating false positives
- Difficulty distinguishing real storage issues from infrastructure limitations
Please verify:
- This threshold has been validated not to cause excessive alerts across all supported platforms
- Whether this should remain at critical severity or be downgraded to warning
- Consider the etcdGRPCRequestsSlow alert (lines 90-104) which uses platform-specific thresholds (1.5s for Azure, 1s for others) - should fsync also account for platform differences?
Check current etcd fsync recommendations:
🌐 Web query:
What are the recommended fsync duration thresholds for etcd alerts?
💡 Result:
Recommended thresholds (based on common etcd / Prometheus alerting rules):
- Healthy target: 99th‑percentile etcd WAL fsync duration < ~0.01 s (10 ms). [1][2]
- Warning alert: 99th‑percentile fsync duration ≥ 0.5 s (500 ms) for a sustained window (typ. 2–10 min). [1][3][4]
- Critical/stricter alert: some rules use ≥ 1 s (1000 ms) for more severe alerting. [5][4]
Typical PromQL used:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 (warning). [1][3]
Sources: [1] Prometheus Operator etcd runbook; [2] KubeSphere/OpenShift guidance; [3] community alert rule collections (awesome-prometheus-alerts); [4] common example rules (etcd fsync 0.5s / commit 0.25s).
Severity miscategorized: 10ms fsync alert should be warning, not critical.
Based on etcd alerting standards, a 10ms (0.01s) threshold represents the healthy performance target for 99th percentile fsync—not a degradation threshold. The actual problem is the severity level:
Recommended thresholds:
- Healthy target: < 10ms
- Warning: ≥ 0.5s
- Critical: ≥ 1s
The current alert fires as critical whenever fsync slightly exceeds the healthy baseline, which will cause excessive paging and alert fatigue.
Recommended fix: Either downgrade this alert to warning severity, or restructure with tiered alerts (warning at 0.5s, critical at 1s) to match standard etcd alerting practices. Also verify whether this should account for platform differences, similar to the etcdGRPCRequestsSlow alert (lines 90-104).
🤖 Prompt for AI Agents
In manifests/0000_90_etcd-operator_03_prometheusrule.yaml around lines 113-121,
the etcdHighFsyncDurations alert uses a 0.01s threshold but is labeled severity:
critical; change this to severity: warning OR replace with two tiered alerts
(one alert for warning at expr >= 0.5s and another for critical at expr >= 1s
using the same histogram_quantile(rate(...)[5m]) pattern), and ensure
labels/annotations match the existing etcdGRPCRequestsSlow structure (lines
90-104) to keep platform differences consistent.
Setting the thresholds to the KCS recommended values.