Exclude samples with probe_success == 0 from avg_over_time( probe_duration_seconds )

rdiez · March 2, 2022, 1:20pm

Hi all:

I am trying to use the following alert rules:

Alert 1) “Blackbox probe failed”
expr: probe_success == 0

Alert 2) “Blackbox slow probe”
avg_over_time( probe_duration_seconds[1m] ) > 1

The trouble is, every time a probe fails, I am getting the slow alert too. I would like to mask that slow alert out, but the avg_over_time() is introducing a delay which makes it difficult. Besides, I do not know enough about Prometheus yet.

I believe that, when a black box probe fails, its probe_duration_seconds equals the probe’s timeout.

I tried to modify the avg_over_time rule to leave out those periods where probe_success == 0:

probe_success and ( avg_over_time( probe_duration_seconds[30m] ) >= 3 )

But that fails to parse. I do know enough to work this out with avg_over_time().

The idea is that only those values in probe_duration_seconds when probe_success == 1 should be taken into consideration.

Maybe we could completely ignore those 30m periods where some probes failed at all. After all, we want to measure performance only on those periods where everything is working normally.

Can someone help write a proper alert rule to that effect?

Thanks in advance,
rdiez

Topic		Replies	Views
Blackbox query for probe metrics PromQL	0	936	April 6, 2021
Promethes how to config alert rule of monitor event General Help/Support	0	421	May 31, 2022
Unit Testing Prometheus Alerts PromQL	0	480	December 3, 2022
PromQL - cumulative SLO with maintenance window PromQL	1	452	September 20, 2022
Expression in Prometheus to determine constant growth over time PromQL	0	365	April 15, 2022

Exclude samples with probe_success == 0 from avg_over_time( probe_duration_seconds )

Related topics