Vector contains metrics with the same labelset after applying rule labels

Golide · July 25, 2023, 9:14am

I have setup a Cronjob for a container that exits with a non-zero code. I have a PrometheusRule and AlertManagerConfig setup against this CronJob but the alert is not firing as expected. The alerting is based on this example.

This is the CronJob definition :

apiVersion: batch/v1
kind: CronJob
metadata:
  name: exitjob
  namespace: monitoring
spec:
  schedule: "*/4 * * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - command:
                - sh
                - exit
                - "1"
              image: alpine
              imagePullPolicy: Always
              name: main
          restartPolicy: Never
          terminationGracePeriodSeconds: 30

And this is the PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-job-alert
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name, namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring

        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job, namespace) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
        - alert: CronJobStatusFailed
          expr: |
            job_cronjob:kube_job_status_failed:sum
            * ON(job, namespace) GROUP_RIGHT()
            kube_cronjob_labels
            > 0
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

And the associated AlertManagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: cronjob-failure-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: cron-email
    routes:
      - matchers:
        - name: job
          value: cron-failure
        receiver: cron-email
  receivers:
    - name: cron-email
      emailConfigs:
        - to: 'etshuma@mycompany.com'
          from: 'devops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

I have checked the AlertManager logs but there is no error at all. It just shows the configuration reload event :

ts=2023-07-24T13:30:44.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T13:30:44.218Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.721Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T15:27:44.730Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.192Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml
ts=2023-07-24T16:57:45.229Z caller=coordinator.go:126 level=info component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config_out/alertmanager.env.yaml\

And in the PrometheusUI the alert shows as inactive :

The Prometheus logs however show an error :

job: cron-failure\n  namespace: monitoring\n  severity: critical\n" 

err="vector contains metrics with the same labelset after applying rule labels"
ts=2023-07-25T07:16:15.741Z caller=manager.go:663 level=warn 
component="rule manager" file=/etc/prometheus/rules/prometheus- 
prometheus-kube-prometheus-prometheus-rulefiles-0/monitoring- 
failing-job-alert-ce5cf237-c0fd-40a1-b337-91896471802a.yaml 
group=kube-cron name=job:kube_job_status_start_time:max index=0 
msg="Evaluating rule failed" rule="record: 
job:kube_job_status_start_time:max\nexpr: 
label_replace(label_replace(max by (job_name, owner_name, 
namespace) 
(kube_job_status_start_time\n  * on (job_name, namespace) 
group_right () kube_job_owner{owner_name!=\"\"}) == on\n  
(owner_name) group_left () max by (owner_name) 
(kube_job_status_start_time * on\n  (job_name, namespace) 
group_right () kube_job_owner{owner_name!=\"\"}), \"job\", 
\"$1\",\n  
\"job_name\", \"(.+)\"), \"cronjob\", \"$1\", \"owner_name\", \" 
(.+)\")\nlabels:\n  job: cron-failure\n  namespace: monitoring\n  
severity: critical\n" err="vector contains metrics with the same 
labelset after applying rule labels

What am I missing?

Topic		Replies	Views
The prometheus alert rule cannot be changed from the Pending state to the Firing state Prometheus server	2	571	November 17, 2021
Promethes how to config alert rule of monitor event General Help/Support	0	423	May 31, 2022
How to add custom labels to alerting rules Prometheus server	2	1720	April 13, 2021
Alertmanager specfic labels define Prometheus server	0	512	November 2, 2021
Alertmanager namespace label=prometheus-Operator General Help/Support	0	554	May 8, 2023

Vector contains metrics with the same labelset after applying rule labels

Related topics