Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion jsonnet/custom.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
},
{
alert: 'etcdHighCommitDurations',
expr: 'histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5',
expr: 'histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.025',
'for': '10m',
labels: {
severity: 'warning',
Expand All @@ -71,6 +71,19 @@
summary: 'etcd cluster 99th percentile commit durations are too high.',
},
},
{
alert: 'etcdHighFsyncDurations',
expr: 'histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.010',
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.',
summary: 'etcd cluster 99th percentile fsync durations are too high.',
runbook_url: 'https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md'
},
},
{
alert: 'etcdHighNumberOfFailedGRPCRequests',
expr: |||
Expand Down
2 changes: 1 addition & 1 deletion jsonnet/jsonnetfile.lock.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"subdir": "contrib/mixin"
}
},
"version": "e4d6a05f8f1ec972384e24a83c420f707a6644f2",
"version": "48f735d13e83dea5b1901ea627eddd75d05ee8d5",
"sum": "XmXkOCriQIZmXwlIIFhqlJMa0e6qGWdxZD+ZDYaN0Po="
},
{
Expand Down
2 changes: 1 addition & 1 deletion jsonnet/main.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ local promRules = if std.objectHasAll(etcdMixin, 'prometheusRules') then etcdMix

// Exclude rules that are either OpenShift specific or do not work for OpenShift.
// List should be ordered!
local excludedAlerts = ['etcdDatabaseQuotaLowSpace', 'etcdGRPCRequestsSlow', 'etcdHighCommitDurations', 'etcdHighNumberOfFailedGRPCRequests', 'etcdHighNumberOfLeaderChanges', 'etcdInsufficientMembers', 'etcdMembersDown'];
local excludedAlerts = ['etcdDatabaseQuotaLowSpace', 'etcdGRPCRequestsSlow', 'etcdHighCommitDurations', 'etcdHighFsyncDurations', 'etcdHighNumberOfFailedGRPCRequests', 'etcdHighNumberOfLeaderChanges', 'etcdInsufficientMembers', 'etcdMembersDown'];
local excludeRules = std.map(
function(group) group {
rules: std.filter(
Expand Down
33 changes: 10 additions & 23 deletions manifests/0000_90_etcd-operator_03_prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,28 +39,6 @@ spec:
for: 15m
labels:
severity: warning
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.5
for: 10m
labels:
severity: warning
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1
for: 10m
labels:
severity: critical
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Predicting running out of disk space in the next four hours, based on write observations within the past four hours on etcd instance {{ $labels.instance }}, please check as it might be disruptive.'
Expand Down Expand Up @@ -128,10 +106,19 @@ spec:
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations {{ $value }}s on etcd instance {{ $labels.instance }}.'
summary: etcd cluster 99th percentile commit durations are too high.
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.5
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.025
for: 10m
labels:
severity: warning
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations are {{ $value }}s on etcd instance {{ $labels.instance }}.'
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-etcd-operator/etcdHighFsyncDurations.md
summary: etcd cluster 99th percentile fsync durations are too high.
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m])) > 0.010
for: 10m
labels:
severity: critical
Comment on lines +113 to +121
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical severity with 10ms fsync threshold may cause excessive alerts.

The new etcdHighFsyncDurations alert uses a 10ms threshold with critical severity. This is extremely aggressive compared to:

  • The removed alerts which used 0.5s and 1s thresholds
  • Typical etcd recommendations of 100ms-1s for fsync warnings
  • Storage performance characteristics on cloud providers and network-attached storage

While 10ms is appropriate for local NVMe SSDs, many OpenShift deployments use cloud block storage or other configurations where 10ms fsync at 99th percentile is unrealistic. This could result in:

  • Frequent critical pages causing alert fatigue
  • Operational burden investigating false positives
  • Difficulty distinguishing real storage issues from infrastructure limitations

Please verify:

  1. This threshold has been validated not to cause excessive alerts across all supported platforms
  2. Whether this should remain at critical severity or be downgraded to warning
  3. Consider the etcdGRPCRequestsSlow alert (lines 90-104) which uses platform-specific thresholds (1.5s for Azure, 1s for others) - should fsync also account for platform differences?

Check current etcd fsync recommendations:


🌐 Web query:

What are the recommended fsync duration thresholds for etcd alerts?

💡 Result:

Recommended thresholds (based on common etcd / Prometheus alerting rules):

  • Healthy target: 99th‑percentile etcd WAL fsync duration < ~0.01 s (10 ms). [1][2]
  • Warning alert: 99th‑percentile fsync duration ≥ 0.5 s (500 ms) for a sustained window (typ. 2–10 min). [1][3][4]
  • Critical/stricter alert: some rules use ≥ 1 s (1000 ms) for more severe alerting. [5][4]

Typical PromQL used:
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 (warning). [1][3]

Sources: [1] Prometheus Operator etcd runbook; [2] KubeSphere/OpenShift guidance; [3] community alert rule collections (awesome-prometheus-alerts); [4] common example rules (etcd fsync 0.5s / commit 0.25s).


Severity miscategorized: 10ms fsync alert should be warning, not critical.

Based on etcd alerting standards, a 10ms (0.01s) threshold represents the healthy performance target for 99th percentile fsync—not a degradation threshold. The actual problem is the severity level:

Recommended thresholds:

  • Healthy target: < 10ms
  • Warning: ≥ 0.5s
  • Critical: ≥ 1s

The current alert fires as critical whenever fsync slightly exceeds the healthy baseline, which will cause excessive paging and alert fatigue.

Recommended fix: Either downgrade this alert to warning severity, or restructure with tiered alerts (warning at 0.5s, critical at 1s) to match standard etcd alerting practices. Also verify whether this should account for platform differences, similar to the etcdGRPCRequestsSlow alert (lines 90-104).

🤖 Prompt for AI Agents
In manifests/0000_90_etcd-operator_03_prometheusrule.yaml around lines 113-121,
the etcdHighFsyncDurations alert uses a 0.01s threshold but is labeled severity:
critical; change this to severity: warning OR replace with two tiered alerts
(one alert for warning at expr >= 0.5s and another for critical at expr >= 1s
using the same histogram_quantile(rate(...)[5m]) pattern), and ensure
labels/annotations match the existing etcdGRPCRequestsSlow structure (lines
90-104) to keep platform differences consistent.

- alert: etcdHighNumberOfFailedGRPCRequests
annotations:
description: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
Expand Down