I have a simple alert rule for my certificate expiration which goes like this (metrics come from telegraf):
groups:
- name: ssl-alerts
rules:
- alert: cert_expiry_critical
expr: x509_cert_expiry < 604800
for: 60m
annotations:
summary: 'SSL/TLS certificate expires in less than a week'
description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
labels:
severity: "critical"
env: test
Now, I’ve realized that alerts are too frequent thus I changed FOR clause to be 24h to reduce alert notification spamming. Changed alert looks like this:
groups:
- name: ssl-alerts
rules:
- alert: cert_expiry_critical
expr: x509_cert_expiry < 604800
for: 24h
annotations:
summary: 'SSL/TLS certificate expires in less than a week'
description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
labels:
severity: "critical"
env: test
but I still receive alerts on 60m or so. Already restarted prometheus and alertmanager daemons but no avail. Can anyone please explain why is that and how to properly increase alerts delaying in this case?
The for: in your alert rule just determines how long a condition has to be true before an alert starts firing. So it would delay the initial alert, but wouldn’t have anything to do with repeated notifications for alerts that keep firing.
Prometheus just generates alerts. It is up to Alertmanager to handle any notifications, including decisions around if or when to send additional notifications for alerts that have been firing for a while. So you will need to look at your Alertmanager config.
Great, thanks for the explanation stuart - much clearer now.
It would be perfect if I could set repeat_interval as per individual alert, but seems like thats not supported.
So, on the alertmanager side, should I define inhibit_rule for the alert from the above to make it fire less frequently? Is that a best practice? I would like to only delay a single alert without affecting others. Please let me know if there is a better approach.
Inhibition rules are something different - they are to totally prevent an alert firing if another one is. The example use case would be a if there is say a central database failure having hundreds of alerts from each application would just obscure the actual issue and slow down response, so inhibit all the app alerts and just get the database alert.
For your use case you want an additional route with a custom repeat_interval value. You can then send things to that route via something like a label - set a “don’t repeat” label in the alert and set the route matching for that value.
Thanks @stuart for clarifying that out once again, but I couldn’t make it work using additional route, alerts are still being fired after 1h + group_interval - here are my configs (decreased values from 24h to 3h for testing):