Need help in writing alert rule for batch jobs

I have some batch jobs running in three different regions. Each of them will:

  • run every 30 minutes; run time varies between each executions depends on various factors
  • return 200 if no error during execution
  • return non-200 when something went wrong e.g. 300 for network error, 400 for API error and etc*

* I have the full error code to error description mapping - it’s text based and can easily be converted to JSON, CSV and etc if needed

These metrics are being pushed to a PushGateway instance (a shared instance for all regions). Every 30 seconds, my Prometheus instance will scrape all these metrics.

# Example metrics for batch job A
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="B"} 300
batch_job_a_status{run_type="fast",instance="",job="batch_job",region="C"} 400
batch_job_a_status{run_type="full",instance="",job="batch_job",region="A"} 200
batch_job_a_status{run_type="full",instance="",job="batch_job",region="B"} 100
batch_job_a_status{run_type="full",instance="",job="batch_job",region="C"} 200

# Example metrics for batch job B
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="A"} 300
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="B"} 100
batch_job_b_status{run_type="fast",instance="",job="batch_job",region="C"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="A"} 400
batch_job_b_status{run_type="full",instance="",job="batch_job",region="B"} 200
batch_job_b_status{run_type="full",instance="",job="batch_job",region="C"} 300

Now the problem is, since my batch jobs are not long running processes like a web server where metrics usually available at /metrics endpoint, all metrics above are only available for short period of time in my Prometheus. There are a lot of “gaps” or “blanks” in my Prometheus which blocked me from achieving what I want.

What I want are:

  • To be alerted when there are more than two failed runs in past hour (grouped by run_type and region)
  • To show (in alert annotations area) what are the recent errors based on the metrics return value above

Please tell me if it’s possible and how the alert rules should look like. If it’s not possible with current metrics type and labels, please tell me the correct metrics type I need to use and what labels should be added. Since I own and wrote the “exporter” part, I can change it.

FWIW, I’m using this Golang function to push metrics from my batch job to the PushGateway instance.

This is what I meant by “gaps” or “blanks”