I am trying to use the following alert rules:
Alert 1) “Blackbox probe failed”
expr: probe_success == 0
Alert 2) “Blackbox slow probe”
avg_over_time( probe_duration_seconds[1m] ) > 1
The trouble is, every time a probe fails, I am getting the slow alert too. I would like to mask that slow alert out, but the avg_over_time() is introducing a delay which makes it difficult. Besides, I do not know enough about Prometheus yet.
I believe that, when a black box probe fails, its probe_duration_seconds equals the probe’s timeout.
I tried to modify the avg_over_time rule to leave out those periods where probe_success == 0:
probe_success and ( avg_over_time( probe_duration_seconds[30m] ) >= 3 )
But that fails to parse. I do know enough to work this out with avg_over_time().
The idea is that only those values in probe_duration_seconds when probe_success == 1 should be taken into consideration.
Maybe we could completely ignore those 30m periods where some probes failed at all. After all, we want to measure performance only on those periods where everything is working normally.
Can someone help write a proper alert rule to that effect?
Thanks in advance,