Prometheus alert rule doesn't honor changed FOR clause

civa · October 31, 2022, 8:39am

Hi all,

I have a simple alert rule for my certificate expiration which goes like this (metrics come from telegraf):

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 60m
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: "critical"
          env: test

Now, I’ve realized that alerts are too frequent thus I changed FOR clause to be 24h to reduce alert notification spamming. Changed alert looks like this:

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 24h
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: "critical"
          env: test

but I still receive alerts on 60m or so. Already restarted prometheus and alertmanager daemons but no avail. Can anyone please explain why is that and how to properly increase alerts delaying in this case?

Thanks

stuart · October 31, 2022, 9:26am

The for: in your alert rule just determines how long a condition has to be true before an alert starts firing. So it would delay the initial alert, but wouldn’t have anything to do with repeated notifications for alerts that keep firing.

Prometheus just generates alerts. It is up to Alertmanager to handle any notifications, including decisions around if or when to send additional notifications for alerts that have been firing for a while. So you will need to look at your Alertmanager config.

civa · October 31, 2022, 11:00am

Great, thanks for the explanation stuart - much clearer now.

It would be perfect if I could set repeat_interval as per individual alert, but seems like thats not supported.

So, on the alertmanager side, should I define inhibit_rule for the alert from the above to make it fire less frequently? Is that a best practice? I would like to only delay a single alert without affecting others. Please let me know if there is a better approach.

Thanks again!

stuart · October 31, 2022, 11:15am

Inhibition rules are something different - they are to totally prevent an alert firing if another one is. The example use case would be a if there is say a central database failure having hundreds of alerts from each application would just obscure the actual issue and slow down response, so inhibit all the app alerts and just get the database alert.

For your use case you want an additional route with a custom repeat_interval value. You can then send things to that route via something like a label - set a “don’t repeat” label in the alert and set the route matching for that value.

civa · October 31, 2022, 3:21pm

Thanks @stuart for clarifying that out once again, but I couldn’t make it work using additional route, alerts are still being fired after 1h + group_interval - here are my configs (decreased values from 24h to 3h for testing):

alertmanager.yml

route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  group_by: ['alertname']
  receiver: 'slack-alerts'

  routes:
    # additional route for delaying alerts
    - match:
        env: test
        severity: warning|critical
        postpone: true 
        delay: 3h
      receiver: 'slack-alerts'
      repeat_interval: 3h
      continue: false

    - match:
        env: test
        severity: warning|critical
      receiver: 'slack-alerts'

receivers:
...

And related prometheus.yml:

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 60m
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.san }} ({{ $labels.source }}) expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: 'critical'
          postpone: true  # <- added
          delay: 3h      # <- added
          env: test

Am I doing something wrong here?

Thanks!

civa · November 8, 2022, 10:35am

Can someone help with this? Pretty please?

Topic		Replies	Views
The prometheus alert rule cannot be changed from the Pending state to the Firing state Prometheus server	2	568	November 17, 2021
Promethes how to config alert rule of monitor event General Help/Support	0	421	May 31, 2022
How to grow frequency of alerts in Prometheus? General Help/Support	0	377	December 20, 2021
Are alerts resent after silence is expired General Help/Support	0	468	July 6, 2021
How to remove information about old firing alerts in notification? General Help/Support	2	1078	March 28, 2023

Prometheus alert rule doesn't honor changed FOR clause

Related topics