Flapping alerts

danielmotaleite · May 18, 2021, 1:41pm

Hi

in a setup with prometheus 2.26.0 in a k8s cluster and central alertmanager 0.21.0 on another cluster where alerts are created on prometheus and stay active all the time and then on alertmanager they keep flapping, being open and closed.

Here is a log of a alert that i created that is always active, next with a ngrep for the traffic:

T 2021/05/18 12:07:39.996663 172.26.55.170:14564 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:11:39.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:07:39.996Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]

T 2021/05/18 12:08:55.000341 172.26.55.170:36204 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:12:54.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:08:55.013Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]

T 2021/05/18 12:10:09.995726 172.26.55.170:12648 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:14:09.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:10:09.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:10:10.005Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
level=debug ts=2021-05-18T12:10:11.314Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]

T 2021/05/18 12:11:24.996825 172.26.55.170:36258 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:15:24.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:11:24.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]

T 2021/05/18 12:12:39.996392 172.26.55.170:12646 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:16:39.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:12:39.996Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]

T 2021/05/18 12:13:54.995927 172.26.55.170:48240 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:17:54.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:13:54.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:13:54.997Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]

T 2021/05/18 12:15:09.998517 172.26.55.170:12658 -> 172.27.33.232:80 [AP]
  POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
  [{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
  :"2021-05-18T12:19:09.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
  "labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
  "monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:15:09.999Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:15:10.005Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
level=debug ts=2021-05-18T12:15:11.314Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][resolved]]

Prometheus is sending the alert every ~75s to alertmanager, yet alertmanager still marked the alert as resolved!
i have been trying to debug this, but i’m unable to find why the alert is being resolved. Another prometheus in the same kubernetes is working fine, so this is even more weird

So anyone have any hint on what is going on here?

Alertmanager config:

   image: prom/alertmanager:v0.21.0
  containers:
  - args:
    - --config.file=/etc/alertmanager/config.yml
    - --web.external-url=https://alertmanager.example.com
    - --cluster.peer=alertmanager-all:9094
    - --log.level=debug

apiVersion: v1
data:
  config.yml: |+
    global:
      resolve_timeout: 5m
      http_config: {}
      smtp_hello: localhost
      smtp_require_tls: true
      pagerduty_url: https://events.pagerduty.com/v2/enqueue
      opsgenie_api_url: https://api.opsgenie.com/
      wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
      victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/

    templates:
    - /etc/alertmanager/template/*.tmpl

    route:
      receiver: blackhole
      group_by:
      - alertname
      - cluster
      - service
      - alert_target
      routes:
      - receiver: blackhole
        match_re:
          alertname: (QuietHours|QuietDays|live2env_running)
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 1h

On the prometheus side:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    kcluster: k8s.example.internal
    monitor: exporter-metrics


alerting:
  alertmanagers:
  - follow_redirects: true
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets:
      # nginx-ingress for alertmanager
      - alertmanager-internal.example.com.

thanks for the help
Daniel

Topic		Replies	Views
Alerts are not sent from Prometheus to Alert Manager Prometheus server	2	531	October 31, 2021
Unable to debug prometheus to alertmanager connection Prometheus server	4	2047	May 12, 2021
Making configuration changes and restart General Help/Support	0	290	January 26, 2023
Prometheus-alert Prometheus server	0	406	February 20, 2023
Prometheus AlertManager alerts not being forwarded to Slack channels Prometheus server	0	330	January 20, 2024

Flapping alerts

Related topics