Prometheus alert rule doesn't honor changed FOR clause

Hi all,

I have a simple alert rule for my certificate expiration which goes like this (metrics come from telegraf):

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 60m
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: "critical"
          env: test

Now, I’ve realized that alerts are too frequent thus I changed FOR clause to be 24h to reduce alert notification spamming. Changed alert looks like this:

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 24h
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.host }} expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: "critical"
          env: test

but I still receive alerts on 60m or so. Already restarted prometheus and alertmanager daemons but no avail. Can anyone please explain why is that and how to properly increase alerts delaying in this case?

Thanks

The for: in your alert rule just determines how long a condition has to be true before an alert starts firing. So it would delay the initial alert, but wouldn’t have anything to do with repeated notifications for alerts that keep firing.

Prometheus just generates alerts. It is up to Alertmanager to handle any notifications, including decisions around if or when to send additional notifications for alerts that have been firing for a while. So you will need to look at your Alertmanager config.

Great, thanks for the explanation stuart - much clearer now.

It would be perfect if I could set repeat_interval as per individual alert, but seems like thats not supported.

So, on the alertmanager side, should I define inhibit_rule for the alert from the above to make it fire less frequently? Is that a best practice? I would like to only delay a single alert without affecting others. Please let me know if there is a better approach.

Thanks again!

Inhibition rules are something different - they are to totally prevent an alert firing if another one is. The example use case would be a if there is say a central database failure having hundreds of alerts from each application would just obscure the actual issue and slow down response, so inhibit all the app alerts and just get the database alert.

For your use case you want an additional route with a custom repeat_interval value. You can then send things to that route via something like a label - set a “don’t repeat” label in the alert and set the route matching for that value.

Thanks @stuart for clarifying that out once again, but I couldn’t make it work using additional route, alerts are still being fired after 1h + group_interval - here are my configs (decreased values from 24h to 3h for testing):

alertmanager.yml

route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  group_by: ['alertname']
  receiver: 'slack-alerts'

  routes:
    # additional route for delaying alerts
    - match:
        env: test
        severity: warning|critical
        postpone: true 
        delay: 3h
      receiver: 'slack-alerts'
      repeat_interval: 3h
      continue: false

    - match:
        env: test
        severity: warning|critical
      receiver: 'slack-alerts'

receivers:
...

And related prometheus.yml:

groups:
  - name: ssl-alerts
    rules:
      - alert: cert_expiry_critical
        expr: x509_cert_expiry < 604800
        for: 60m
        annotations:
          summary: 'SSL/TLS certificate expires in less than a week'
          description: 'SSL/TLS certificate for {{ $labels.san }} ({{ $labels.source }}) expires in {{ .Value | humanizeDuration }}'
        labels:
          severity: 'critical'
          postpone: true  # <- added
          delay: 3h      # <- added
          env: test 

Am I doing something wrong here?

Thanks!

Can someone help with this? Pretty please?