I have some batch jobs running in three different regions. Each of them will:
- run every 30 minutes; run time varies between each executions depends on various factors
- return 200 if no error during execution
- return non-200 when something went wrong e.g. 300 for network error, 400 for API error and etc*
* I have the full error code to error description mapping - it’s text based and can easily be converted to JSON, CSV and etc if needed
These metrics are being pushed to a PushGateway instance (a shared instance for all regions). Every 30 seconds, my Prometheus instance will scrape all these metrics.
# Example metrics for batch job A
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="B"} 300
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="C"} 400
batch_job_a_status{run_type="full",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="full",instance="",job="batch_job",region="B"} 100
batch_job_a_status{run_type="full",instance="",job="batch_job",region="C"} 200
# Example metrics for batch job B
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="A"} 300
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="B"} 100
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="C"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="A"} 400
batch_job_b_status{run_type="full",instance="",job="batch_job",region="B"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="C"} 300
Now the problem is, since my batch jobs are not long running processes like a web server where metrics usually available at /metrics
endpoint, all metrics above are only available for short period of time in my Prometheus. There are a lot of “gaps” or “blanks” in my Prometheus which blocked me from achieving what I want.
What I want are:
- To be alerted when there are more than two failed runs in past hour (grouped by
run_type
andregion
) - To show (in alert
annotations
area) what are the recent errors based on the metrics return value above
Please tell me if it’s possible and how the alert rules should look like. If it’s not possible with current metrics type and labels, please tell me the correct metrics type I need to use and what labels should be added. Since I own and wrote the “exporter” part, I can change it.