You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-exemplo"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-exemplo"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-exemplo"
How do I find out who the tenant is?
???????
Expected behavior
Get information from the config and load it for configuration!
Environment
Infrastructure: [Kubernetes,EKS]
Deployment tool: [helm]
Additional Context
How can I load the settings?
How can I obtain tenants?
Values.yaml is not imported
prometheusRule:
annotations: {}
enabled: true
groups:
- name: mimir_dev_alerts
rules:
- alert: MimirIngesterUnhealthy
annotations:
message: Mimir cluster {{ $labels.cluster }}/{{ $labels.namespace }} has
{{ printf "%f" $value }} unhealthy ingester(s).
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterunhealthy
expr: |
min by (cluster, namespace) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0
for: 15m
labels:
severity: critical
- alert: MimirRequestErrors
annotations:
message: |
The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
expr: |
# The following 5xx errors considered as non-error:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
/
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
) * 100 > 1
for: 15m
labels:
severity: critical
- alert: MimirRequestLatency
annotations:
message: |
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequestlatency
expr: |
cluster_namespace_job_route:cortex_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|ready|/schedulerpb.SchedulerForFrontend/FrontendLoop|/schedulerpb.SchedulerForQuerier/QuerierLoop|debug_pprof"}
>
2.5
for: 15m
labels:
severity: warning
- alert: MimirInconsistentRuntimeConfig
annotations:
message: |
An inconsistent runtime config file is used across cluster {{ $labels.cluster }}/{{ $labels.namespace }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirinconsistentruntimeconfig
expr: |
count(count by(cluster, namespace, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1
for: 1h
labels:
severity: critical
- alert: MimirBadRuntimeConfig
annotations:
message: |
{{ $labels.job }} failed to reload runtime config.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbadruntimeconfig
expr: |
# The metric value is reset to 0 on error while reloading the config at runtime.
cortex_runtime_config_last_reload_successful == 0
for: 5m
labels:
severity: critical
- alert: MimirFrontendQueriesStuck
annotations:
message: |
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirfrontendqueriesstuck
expr: |
sum by (cluster, namespace, job) (min_over_time(cortex_query_frontend_queue_length[1m])) > 0
for: 5m
labels:
severity: critical
- alert: MimirSchedulerQueriesStuck
annotations:
message: |
There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirschedulerqueriesstuck
expr: |
sum by (cluster, namespace, job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0
for: 7m
labels:
severity: critical
- alert: MimirCacheRequestErrors
annotations:
message: |
The cache {{ $labels.name }} used by Mimir {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimircacherequesterrors
expr: |
(
sum by(cluster, namespace, name, operation) (
rate(thanos_memcached_operation_failures_total[1m])
or
rate(thanos_cache_operation_failures_total[1m])
)
/
sum by(cluster, namespace, name, operation) (
rate(thanos_memcached_operations_total[1m])
or
rate(thanos_cache_operations_total[1m])
)
) * 100 > 5
for: 5m
labels:
severity: warning
- alert: MimirIngesterRestarts
annotations:
message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has restarted {{ printf "%.2f" $value }} times in the last 30 mins.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterrestarts
expr: |
(
sum by(cluster, namespace, pod) (
increase(kube_pod_container_status_restarts_total{container=~"(ingester|mimir-write)"}[30m])
)
>= 2
)
and
(
count by(cluster, namespace, pod) (cortex_build_info) > 0
)
labels:
severity: warning
- alert: MimirKVStoreFailure
annotations:
message: |
Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is failing to talk to the KV store {{ $labels.kv_name }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirkvstorefailure
expr: |
(
sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
/
sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
)
# We want to get alerted only in case there's a constant failure.
== 1
for: 5m
labels:
severity: critical
- alert: MimirMemoryMapAreasTooHigh
annotations:
message: '{{ $labels.job }}/{{ $labels.pod }} has a number of mmap-ed areas
close to the limit.'
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirmemorymapareastoohigh
expr: |
process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} > 0.8
for: 5m
labels:
severity: critical
- alert: MimirIngesterInstanceHasNoTenants
annotations:
message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has no tenants assigned.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterinstancehasnotenants
expr: |
(min by(cluster, namespace, pod) (cortex_ingester_memory_users) == 0)
and on (cluster, namespace)
# Only if there are more timeseries than would be expected due to continuous testing load
(
( # Classic storage timeseries
sum by(cluster, namespace) (cortex_ingester_memory_series)
/
max by(cluster, namespace) (cortex_distributor_replication_factor)
)
or
( # Ingest storage timeseries
sum by(cluster, namespace) (
max by(ingester_id, cluster, namespace) (
label_replace(cortex_ingester_memory_series,
"ingester_id", "$1",
"pod", ".*-([0-9]+)$"
)
)
)
)
) > 100000
for: 1h
labels:
severity: warning
- alert: MimirRulerInstanceHasNoRuleGroups
annotations:
message: Mimir ruler {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has no rule groups assigned.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerinstancehasnorulegroups
expr: |
# Alert on ruler instances in microservices mode that have no rule groups assigned,
min by(cluster, namespace, pod) (cortex_ruler_managers_total{pod=~"(.*mimir-)?ruler.*"}) == 0
# but only if other ruler instances of the same cell do have rule groups assigned
and on (cluster, namespace)
(max by(cluster, namespace) (cortex_ruler_managers_total) > 0)
# and there are more than two instances overall
and on (cluster, namespace)
(count by (cluster, namespace) (cortex_ruler_managers_total) > 2)
for: 1h
labels:
severity: warning
- alert: MimirIngestedDataTooFarInTheFuture
annotations:
message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
}} has ingested samples with timestamps more than 1h in the future.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesteddatatoofarinthefuture
expr: |
max by(cluster, namespace, pod) (
cortex_ingester_tsdb_head_max_timestamp_seconds - time()
and
cortex_ingester_tsdb_head_max_timestamp_seconds > 0
) > 60*60
for: 5m
labels:
severity: warning
- alert: MimirStoreGatewayTooManyFailedOperations
annotations:
message: Mimir store-gateway in {{ $labels.cluster }}/{{ $labels.namespace
}} is experiencing {{ $value | humanizePercentage }} errors while doing
{{ $labels.operation }} on the object storage.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirstoregatewaytoomanyfailedoperations
expr: |
sum by(cluster, namespace, operation) (rate(thanos_objstore_bucket_operation_failures_total{component="store-gateway"}[1m])) > 0
for: 5m
labels:
severity: warning
- alert: MimirRingMembersMismatch
annotations:
message: |
Number of members in Mimir ingester hash ring does not match the expected number in {{ $labels.cluster }}/{{ $labels.namespace }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirringmembersmismatch
expr: |
(
avg by(cluster, namespace) (sum by(cluster, namespace, pod) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"}))
!= sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"})
)
and
(
count by(cluster, namespace) (cortex_build_info) > 0
)
for: 15m
labels:
component: ingester
severity: warning
- name: mimir_dev_instance_limits_alerts
rules:
- alert: MimirIngesterReachingSeriesLimit
annotations:
message: |
Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
expr: |
(
(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
and ignoring (limit)
(cortex_ingester_instance_limits{limit="max_series"} > 0)
) > 0.8
for: 3h
labels:
severity: warning
- alert: MimirIngesterReachingSeriesLimit
annotations:
message: |
Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
expr: |
(
(cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
and ignoring (limit)
(cortex_ingester_instance_limits{limit="max_series"} > 0)
) > 0.9
for: 5m
labels:
severity: critical
- alert: MimirIngesterReachingTenantsLimit
annotations:
message: |
Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
expr: |
(
(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
and ignoring (limit)
(cortex_ingester_instance_limits{limit="max_tenants"} > 0)
) > 0.7
for: 5m
labels:
severity: warning
- alert: MimirIngesterReachingTenantsLimit
annotations:
message: |
Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
expr: |
(
(cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
and ignoring (limit)
(cortex_ingester_instance_limits{limit="max_tenants"} > 0)
) > 0.8
for: 5m
labels:
severity: critical
- alert: MimirReachingTCPConnectionsLimit
annotations:
message: |
Mimir instance {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its TCP connections limit for {{ $labels.protocol }} protocol.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirreachingtcpconnectionslimit
expr: |
cortex_tcp_connections / cortex_tcp_connections_limit > 0.8 and
cortex_tcp_connections_limit > 0
for: 5m
labels:
severity: critical
- alert: MimirDistributorReachingInflightPushRequestLimit
annotations:
message: |
Distributor {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its inflight push request limit.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirdistributorreachinginflightpushrequestlimit
expr: |
(
(cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"})
and ignoring (limit)
(cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0)
) > 0.8
for: 5m
labels:
severity: critical
- name: mimir_dev-rollout-alerts
rules:
- alert: MimirRolloutStuck
annotations:
message: |
The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
expr: |
(
max without (revision) (
sum without(statefulset) (label_replace(kube_statefulset_status_current_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
unless
sum without(statefulset) (label_replace(kube_statefulset_status_update_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
)
*
(
sum without(statefulset) (label_replace(kube_statefulset_replicas, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
!=
sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
)
) and (
changes(sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
==
0
)
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
for: 30m
labels:
severity: warning
workload_type: statefulset
- alert: MimirRolloutStuck
annotations:
message: |
The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
expr: |
(
sum without(deployment) (label_replace(kube_deployment_spec_replicas, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
!=
sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
) and (
changes(sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
==
0
)
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
for: 30m
labels:
severity: warning
workload_type: deployment
- alert: RolloutOperatorNotReconciling
annotations:
message: |
Rollout operator is not reconciling the rollout group {{ $labels.rollout_group }} in {{ $labels.cluster }}/{{ $labels.namespace }}.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#rolloutoperatornotreconciling
expr: |
max by(cluster, namespace, rollout_group) (time() - rollout_operator_last_successful_group_reconcile_timestamp_seconds) > 600
for: 5m
labels:
severity: critical
- name: mimir_dev-provisioning
rules:
- alert: MimirAllocatingTooMuchMemory
annotations:
message: |
Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
expr: |
(
# We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
# See: https://github.com/grafana/mimir/issues/2466
container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
/
( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
)
# Match only Mimir namespaces.
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
> 0.65
for: 15m
labels:
severity: warning
- alert: MimirAllocatingTooMuchMemory
annotations:
message: |
Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
expr: |
(
# We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
# See: https://github.com/grafana/mimir/issues/2466
container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
/
( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
)
# Match only Mimir namespaces.
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
> 0.8
for: 15m
labels:
severity: critical
- name: ruler_alerts
rules:
#labels:
# release: prometheus
mimirAlerts: true
mimirRules: true
namespace: mimir-distributed-dev
serviceMonitor:
enabled: true
#labels:
# release: prometheus
metadata-cache:
enabled: true
The text was updated successfully, but these errors were encountered:
Describe the bug
I can't load the settings for mimir alertmanager, looking at the documentation, it is instructed to load with mimirtool alertmanager load
To play
Steps to reproduce the behavior:
Doc - https://grafana.com/docs/mimir/latest/references/architecture/components/alertmanager/
???????
Expected behavior
Get information from the config and load it for configuration!
Environment
Additional Context
How can I load the settings?
How can I obtain tenants?
Values.yaml is not imported
The text was updated successfully, but these errors were encountered: