Hi
in a setup with prometheus 2.26.0 in a k8s cluster and central alertmanager 0.21.0 on another cluster where alerts are created on prometheus and stay active all the time and then on alertmanager they keep flapping, being open and closed.
Here is a log of a alert that i created that is always active, next with a ngrep for the traffic:
T 2021/05/18 12:07:39.996663 172.26.55.170:14564 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:11:39.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:07:39.996Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
T 2021/05/18 12:08:55.000341 172.26.55.170:36204 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:12:54.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:08:55.013Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
T 2021/05/18 12:10:09.995726 172.26.55.170:12648 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:14:09.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:10:09.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:10:10.005Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
level=debug ts=2021-05-18T12:10:11.314Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
T 2021/05/18 12:11:24.996825 172.26.55.170:36258 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:15:24.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:11:24.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
T 2021/05/18 12:12:39.996392 172.26.55.170:12646 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:16:39.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:12:39.996Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
T 2021/05/18 12:13:54.995927 172.26.55.170:48240 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:17:54.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:13:54.997Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:13:54.997Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
T 2021/05/18 12:15:09.998517 172.26.55.170:12658 -> 172.27.33.232:80 [AP]
POST /api/v2/alerts HTTP/1.1..Host: alertmanager-internal.ops.example.com...User-Agent: Prometheus/2.26.0..Content-Length: 910..Content-Type: application/json....
[{"annotations":{"action":"Check the server logs ","dashboard":"https://grafana.example.com/","description":"172.26.55.170:9009 test","impact":"Services may ","priority":"P5","summary":"test ignore"},"endsAt"
:"2021-05-18T12:19:09.988Z","startsAt":"2021-05-18T03:27:39.988Z","generatorURL":"https://prometheus.%7D+%3D%3D+1%29\u0026g0.tab=1",
"labels":{"address":"172.26.55.170:9009","alert_target":"HostsMetrics 172.26.55.170:9009","alertname":"Instance_test","instance":"172.26.55.170:9009","job":"HostsMetrics","kcluster":"k8s.example.internal",
"monitor":"exporter-metrics","replica":"ip-172-26-55-170.live","severity":"page"}}]
level=debug ts=2021-05-18T12:15:09.999Z caller=dispatch.go:138 component=dispatcher msg="Received alert" alert=Instance_test[4a706ed][active]
level=debug ts=2021-05-18T12:15:10.005Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][active]]
level=debug ts=2021-05-18T12:15:11.314Z caller=dispatch.go:473 component=dispatcher aggrGroup="{}/{severity=\"page\"}:{alert_target=\"HostsMetrics 172.26.55.170:9009\", alertname=\"Instance_test\"}" msg=flushing alerts=[Instance_test[4a706ed][resolved]]
Prometheus is sending the alert every ~75s to alertmanager, yet alertmanager still marked the alert as resolved!
i have been trying to debug this, but i’m unable to find why the alert is being resolved. Another prometheus in the same kubernetes is working fine, so this is even more weird
So anyone have any hint on what is going on here?
Alertmanager config:
image: prom/alertmanager:v0.21.0
containers:
- args:
- --config.file=/etc/alertmanager/config.yml
- --web.external-url=https://alertmanager.example.com
- --cluster.peer=alertmanager-all:9094
- --log.level=debug
apiVersion: v1
data:
config.yml: |+
global:
resolve_timeout: 5m
http_config: {}
smtp_hello: localhost
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/v2/enqueue
opsgenie_api_url: https://api.opsgenie.com/
wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
templates:
- /etc/alertmanager/template/*.tmpl
route:
receiver: blackhole
group_by:
- alertname
- cluster
- service
- alert_target
routes:
- receiver: blackhole
match_re:
alertname: (QuietHours|QuietDays|live2env_running)
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
On the prometheus side:
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
kcluster: k8s.example.internal
monitor: exporter-metrics
alerting:
alertmanagers:
- follow_redirects: true
scheme: http
timeout: 10s
api_version: v2
static_configs:
- targets:
# nginx-ingress for alertmanager
- alertmanager-internal.example.com.
thanks for the help
Daniel