Need help in writing alert rule for batch jobs

zulh · January 9, 2024, 3:53am

I have some batch jobs running in three different regions. Each of them will:

run every 30 minutes; run time varies between each executions depends on various factors
return 200 if no error during execution
return non-200 when something went wrong e.g. 300 for network error, 400 for API error and etc*

* I have the full error code to error description mapping - it’s text based and can easily be converted to JSON, CSV and etc if needed

These metrics are being pushed to a PushGateway instance (a shared instance for all regions). Every 30 seconds, my Prometheus instance will scrape all these metrics.

# Example metrics for batch job A
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="B"} 300
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="C"} 400
batch_job_a_status{run_type="full",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="full",instance="",job="batch_job",region="B"} 100
batch_job_a_status{run_type="full",instance="",job="batch_job",region="C"} 200

# Example metrics for batch job B
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="A"} 300
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="B"} 100
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="C"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="A"} 400
batch_job_b_status{run_type="full",instance="",job="batch_job",region="B"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="C"} 300

Now the problem is, since my batch jobs are not long running processes like a web server where metrics usually available at /metrics endpoint, all metrics above are only available for short period of time in my Prometheus. There are a lot of “gaps” or “blanks” in my Prometheus which blocked me from achieving what I want.

What I want are:

To be alerted when there are more than two failed runs in past hour (grouped by run_type and region)
To show (in alert annotations area) what are the recent errors based on the metrics return value above

Please tell me if it’s possible and how the alert rules should look like. If it’s not possible with current metrics type and labels, please tell me the correct metrics type I need to use and what labels should be added. Since I own and wrote the “exporter” part, I can change it.

zulh · January 9, 2024, 8:55am

FWIW, I’m using this Golang function to push metrics from my batch job to the PushGateway instance.

zulh · January 10, 2024, 3:57am

This is what I meant by “gaps” or “blanks”

Topic		Replies	Views
Alert only if something fails at least twice General Help/Support	0	6	August 5, 2025
Alerts are not sent from Prometheus to Alert Manager Prometheus server	2	537	October 31, 2021
Promethes how to config alert rule of monitor event General Help/Support	0	422	May 31, 2022
Vector contains metrics with the same labelset after applying rule labels PromQL	0	1570	July 25, 2023
The prometheus alert rule cannot be changed from the Pending state to the Firing state Prometheus server	2	568	November 17, 2021

Need help in writing alert rule for batch jobs

Related topics