Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config is not imported Prometheus/Alertmanager #9085

Open
flowramps opened this issue Aug 23, 2024 · 1 comment
Open

Config is not imported Prometheus/Alertmanager #9085

flowramps opened this issue Aug 23, 2024 · 1 comment

Comments

@flowramps
Copy link

Describe the bug

I can't load the settings for mimir alertmanager, looking at the documentation, it is instructed to load with mimirtool alertmanager load

To play

Steps to reproduce the behavior:

  1. I port-forwarded the service - kubectl port-forward svc/mimir-dev-alertmanager-headless 8080:8080 -n mimir-distributed-dev
  2. Commands used for reproduction,

Doc - https://grafana.com/docs/mimir/latest/references/architecture/components/alertmanager/

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager --id="eks-exemplo"

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/alerts --id="eks-exemplo"

mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="anonymous"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-dev"
mimirtool alertmanager get --address=http://127.0.0.1:8080/alertmanager/api/v2/rules --id="eks-exemplo"

image

image

  1. How do I find out who the tenant is?

???????

Expected behavior

Get information from the config and load it for configuration!

Environment

  • Infrastructure: [Kubernetes,EKS]
  • Deployment tool: [helm]

Additional Context

  1. How can I load the settings?

  2. How can I obtain tenants?

  3. Values.yaml is not imported

 prometheusRule:
    annotations: {}
    enabled: true
    groups:
    - name: mimir_dev_alerts
      rules:
      - alert: MimirIngesterUnhealthy
        annotations:
          message: Mimir cluster {{ $labels.cluster }}/{{ $labels.namespace }} has
            {{ printf "%f" $value }} unhealthy ingester(s).
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterunhealthy
        expr: |
          min by (cluster, namespace) (cortex_ring_members{state="Unhealthy", name="ingester"}) > 0
        for: 15m
        labels:
          severity: critical
      - alert: MimirRequestErrors
        annotations:
          message: |
            The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
        expr: |
          # The following 5xx errors considered as non-error:
          # - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
          # - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
          (
            sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
            /
            sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
          ) * 100 > 1
        for: 15m
        labels:
          severity: critical
      - alert: MimirRequestLatency
        annotations:
          message: |
            {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequestlatency
        expr: |
          cluster_namespace_job_route:cortex_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|ready|/schedulerpb.SchedulerForFrontend/FrontendLoop|/schedulerpb.SchedulerForQuerier/QuerierLoop|debug_pprof"}
             >
          2.5
        for: 15m
        labels:
          severity: warning
      - alert: MimirInconsistentRuntimeConfig
        annotations:
          message: |
            An inconsistent runtime config file is used across cluster {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirinconsistentruntimeconfig
        expr: |
          count(count by(cluster, namespace, job, sha256) (cortex_runtime_config_hash)) without(sha256) > 1
        for: 1h
        labels:
          severity: critical
      - alert: MimirBadRuntimeConfig
        annotations:
          message: |
            {{ $labels.job }} failed to reload runtime config.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirbadruntimeconfig
        expr: |
          # The metric value is reset to 0 on error while reloading the config at runtime.
          cortex_runtime_config_last_reload_successful == 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirFrontendQueriesStuck
        annotations:
          message: |
            There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirfrontendqueriesstuck
        expr: |
          sum by (cluster, namespace, job) (min_over_time(cortex_query_frontend_queue_length[1m])) > 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirSchedulerQueriesStuck
        annotations:
          message: |
            There are {{ $value }} queued up queries in {{ $labels.cluster }}/{{ $labels.namespace }} {{ $labels.job }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirschedulerqueriesstuck
        expr: |
          sum by (cluster, namespace, job) (min_over_time(cortex_query_scheduler_queue_length[1m])) > 0
        for: 7m
        labels:
          severity: critical
      - alert: MimirCacheRequestErrors
        annotations:
          message: |
            The cache {{ $labels.name }} used by Mimir {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors for {{ $labels.operation }} operation.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimircacherequesterrors
        expr: |
          (
            sum by(cluster, namespace, name, operation) (
              rate(thanos_memcached_operation_failures_total[1m])
              or
              rate(thanos_cache_operation_failures_total[1m])
            )
            /
            sum by(cluster, namespace, name, operation) (
              rate(thanos_memcached_operations_total[1m])
              or
              rate(thanos_cache_operations_total[1m])
            )
          ) * 100 > 5
        for: 5m
        labels:
          severity: warning
      - alert: MimirIngesterRestarts
        annotations:
          message: Mimir {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has restarted {{ printf "%.2f" $value }} times in the last 30 mins.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterrestarts
        expr: |
          (
            sum by(cluster, namespace, pod) (
              increase(kube_pod_container_status_restarts_total{container=~"(ingester|mimir-write)"}[30m])
            )
            >= 2
          )
          and
          (
            count by(cluster, namespace, pod) (cortex_build_info) > 0
          )
        labels:
          severity: warning
      - alert: MimirKVStoreFailure
        annotations:
          message: |
            Mimir {{ $labels.pod }} in  {{ $labels.cluster }}/{{ $labels.namespace }} is failing to talk to the KV store {{ $labels.kv_name }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirkvstorefailure
        expr: |
          (
            sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count{status_code!~"2.+"}[1m]))
            /
            sum by(cluster, namespace, pod, status_code, kv_name) (rate(cortex_kv_request_duration_seconds_count[1m]))
          )
          # We want to get alerted only in case there's a constant failure.
          == 1
        for: 5m
        labels:
          severity: critical
      - alert: MimirMemoryMapAreasTooHigh
        annotations:
          message: '{{ $labels.job }}/{{ $labels.pod }} has a number of mmap-ed areas
            close to the limit.'
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirmemorymapareastoohigh
        expr: |
          process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} > 0.8
        for: 5m
        labels:
          severity: critical
      - alert: MimirIngesterInstanceHasNoTenants
        annotations:
          message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has no tenants assigned.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterinstancehasnotenants
        expr: |
          (min by(cluster, namespace, pod) (cortex_ingester_memory_users) == 0)
          and on (cluster, namespace)
          # Only if there are more timeseries than would be expected due to continuous testing load
          (
            ( # Classic storage timeseries
              sum by(cluster, namespace) (cortex_ingester_memory_series)
              /
              max by(cluster, namespace) (cortex_distributor_replication_factor)
            )
            or
            ( # Ingest storage timeseries
              sum by(cluster, namespace) (
                max by(ingester_id, cluster, namespace) (
                  label_replace(cortex_ingester_memory_series,
                    "ingester_id", "$1",
                    "pod", ".*-([0-9]+)$"
                  )
                )
              )
            )
          ) > 100000
        for: 1h
        labels:
          severity: warning
      - alert: MimirRulerInstanceHasNoRuleGroups
        annotations:
          message: Mimir ruler {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has no rule groups assigned.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerinstancehasnorulegroups
        expr: |
          # Alert on ruler instances in microservices mode that have no rule groups assigned,
          min by(cluster, namespace, pod) (cortex_ruler_managers_total{pod=~"(.*mimir-)?ruler.*"}) == 0
          # but only if other ruler instances of the same cell do have rule groups assigned
          and on (cluster, namespace)
          (max by(cluster, namespace) (cortex_ruler_managers_total) > 0)
          # and there are more than two instances overall
          and on (cluster, namespace)
          (count by (cluster, namespace) (cortex_ruler_managers_total) > 2)
        for: 1h
        labels:
          severity: warning
      - alert: MimirIngestedDataTooFarInTheFuture
        annotations:
          message: Mimir ingester {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace
            }} has ingested samples with timestamps more than 1h in the future.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesteddatatoofarinthefuture
        expr: |
          max by(cluster, namespace, pod) (
              cortex_ingester_tsdb_head_max_timestamp_seconds - time()
              and
              cortex_ingester_tsdb_head_max_timestamp_seconds > 0
          ) > 60*60
        for: 5m
        labels:
          severity: warning
      - alert: MimirStoreGatewayTooManyFailedOperations
        annotations:
          message: Mimir store-gateway in {{ $labels.cluster }}/{{ $labels.namespace
            }} is experiencing {{ $value | humanizePercentage }} errors while doing
            {{ $labels.operation }} on the object storage.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirstoregatewaytoomanyfailedoperations
        expr: |
          sum by(cluster, namespace, operation) (rate(thanos_objstore_bucket_operation_failures_total{component="store-gateway"}[1m])) > 0
        for: 5m
        labels:
          severity: warning
      - alert: MimirRingMembersMismatch
        annotations:
          message: |
            Number of members in Mimir ingester hash ring does not match the expected number in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirringmembersmismatch
        expr: |
          (
            avg by(cluster, namespace) (sum by(cluster, namespace, pod) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"}))
            != sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"})
          )
          and
          (
            count by(cluster, namespace) (cortex_build_info) > 0
          )
        for: 15m
        labels:
          component: ingester
          severity: warning
    - name: mimir_dev_instance_limits_alerts
      rules:
      - alert: MimirIngesterReachingSeriesLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
        expr: |
          (
              (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_series"} > 0)
          ) > 0.8
        for: 3h
        labels:
          severity: warning
      - alert: MimirIngesterReachingSeriesLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its series limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingserieslimit
        expr: |
          (
              (cortex_ingester_memory_series / ignoring(limit) cortex_ingester_instance_limits{limit="max_series"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_series"} > 0)
          ) > 0.9
        for: 5m
        labels:
          severity: critical
      - alert: MimirIngesterReachingTenantsLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
        expr: |
          (
              (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_tenants"} > 0)
          ) > 0.7
        for: 5m
        labels:
          severity: warning
      - alert: MimirIngesterReachingTenantsLimit
        annotations:
          message: |
            Ingester {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its tenant limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterreachingtenantslimit
        expr: |
          (
              (cortex_ingester_memory_users / ignoring(limit) cortex_ingester_instance_limits{limit="max_tenants"})
              and ignoring (limit)
              (cortex_ingester_instance_limits{limit="max_tenants"} > 0)
          ) > 0.8
        for: 5m
        labels:
          severity: critical
      - alert: MimirReachingTCPConnectionsLimit
        annotations:
          message: |
            Mimir instance {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its TCP connections limit for {{ $labels.protocol }} protocol.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirreachingtcpconnectionslimit
        expr: |
          cortex_tcp_connections / cortex_tcp_connections_limit > 0.8 and
          cortex_tcp_connections_limit > 0
        for: 5m
        labels:
          severity: critical
      - alert: MimirDistributorReachingInflightPushRequestLimit
        annotations:
          message: |
            Distributor {{ $labels.job }}/{{ $labels.pod }} has reached {{ $value | humanizePercentage }} of its inflight push request limit.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirdistributorreachinginflightpushrequestlimit
        expr: |
          (
              (cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"})
              and ignoring (limit)
              (cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0)
          ) > 0.8
        for: 5m
        labels:
          severity: critical
    - name: mimir_dev-rollout-alerts
      rules:
      - alert: MimirRolloutStuck
        annotations:
          message: |
            The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
        expr: |
          (
            max without (revision) (
              sum without(statefulset) (label_replace(kube_statefulset_status_current_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
                unless
              sum without(statefulset) (label_replace(kube_statefulset_status_update_revision, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
            )
              *
            (
              sum without(statefulset) (label_replace(kube_statefulset_replicas, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
                !=
              sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))
            )
          ) and (
            changes(sum without(statefulset) (label_replace(kube_statefulset_status_replicas_updated, "rollout_group", "$1", "statefulset", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
              ==
            0
          )
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
        for: 30m
        labels:
          severity: warning
          workload_type: statefulset
      - alert: MimirRolloutStuck
        annotations:
          message: |
            The {{ $labels.rollout_group }} rollout is stuck in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrolloutstuck
        expr: |
          (
            sum without(deployment) (label_replace(kube_deployment_spec_replicas, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
              !=
            sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))
          ) and (
            changes(sum without(deployment) (label_replace(kube_deployment_status_replicas_updated, "rollout_group", "$1", "deployment", "(.*?)(?:-zone-[a-z])?"))[15m:1m])
              ==
            0
          )
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
        for: 30m
        labels:
          severity: warning
          workload_type: deployment
      - alert: RolloutOperatorNotReconciling
        annotations:
          message: |
            Rollout operator is not reconciling the rollout group {{ $labels.rollout_group }} in {{ $labels.cluster }}/{{ $labels.namespace }}.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#rolloutoperatornotreconciling
        expr: |
          max by(cluster, namespace, rollout_group) (time() - rollout_operator_last_successful_group_reconcile_timestamp_seconds) > 600
        for: 5m
        labels:
          severity: critical
    - name: mimir_dev-provisioning
      rules:
      - alert: MimirAllocatingTooMuchMemory
        annotations:
          message: |
            Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
        expr: |
          (
            # We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
            # See: https://github.com/grafana/mimir/issues/2466
            container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
              /
            ( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
          )
          # Match only Mimir namespaces.
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
          > 0.65
        for: 15m
        labels:
          severity: warning
      - alert: MimirAllocatingTooMuchMemory
        annotations:
          message: |
            Instance {{ $labels.pod }} in {{ $labels.cluster }}/{{ $labels.namespace }} is using too much memory.
          runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
        expr: |
          (
            # We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
            # See: https://github.com/grafana/mimir/issues/2466
            container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
              /
            ( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
          )
          # Match only Mimir namespaces.
          * on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
          > 0.8
        for: 15m
        labels:
          severity: critical
    - name: ruler_alerts
      rules:
    #labels:
    #  release: prometheus
    mimirAlerts: true
    mimirRules: true
    namespace: mimir-distributed-dev
  serviceMonitor:
    enabled: true
    #labels:
    #  release: prometheus
metadata-cache:
  enabled: true

@flowramps
Copy link
Author

Updates

I was able to insert rules and alerts with the following commands!

  1. kubectl port-forward svc/mimir-dev-ruler 8080:8080 -n mimir-distributed-dev
    mimirtool rules load rules.yaml --address=http://127.0.0.1:8080/ --id="anonymous"

image

image

  1. kubectl port-forward svc/mimir-dev-alertmanager-headless 8080:8080 -n mimir-distributed-dev
    mimirtool alertmanager load alertmanager-config.yaml alerts.yaml --address=http://127.0.0.1:8080/ --id="anonymous"
    image

ATTENTION !!!!!!!!!!!!!!

I can access the API and view the rules created and one of these rules has an active alert, but the alerts don't arrive in my alertmanager!

image

mimirtool alertmanager load alertmanager-config.yaml alerts.yaml

  1. Another point that I was unable to understand is how I can validate after applying the alerts.yaml file configuration ?

  2. I can view the alertmanager-config.yaml file in the configd via UI and see that it was applied

image

If anyone knows how I should proceed, I would be very grateful to complete my configuration and understanding of the ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant