AlertManagerConfig and PrometheusRule does not work for targets that are not in the monitoring namespace

Golide · September 6, 2023, 11:11am

0

I have an existing CronJob for which I have setup alerts. The alerts are working when the CronJob is scheduled in the ‘monitoring’ namespace. I am using Kube-Prometheus stack and it is also deployed in the ‘monitoring’ namespace. When I schedule the CronJob in another namespace, named ‘labs’ the alert does not fire and I am not receiving any email.

This is my configuration when I schedule the CronJob in the ‘monitoring’ namespace:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: failing-cronjob
  namespace: monitoring
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  schedule: "*/3 * * * *"  
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: job-container
            image: busybox
            imagePullPolicy: IfNotPresent
            command: ["/bin/sh", "-c"]
            args:
            - exit 1  
          restartPolicy: Never
          terminationGracePeriodSeconds: 10  
      backoffLimit: 0

The PrometheusRule :

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-cron-rule
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          
        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
        
        - alert: CronJobStatusFailed
          expr: |
            job:kube_job_status_failed:sum
            * ON(cronjob, namespace) GROUP_LEFT()
            (kube_cronjob_spec_suspend == 0)
          labels:
            severity: critical
            job: cron-failure
            namespace: monitoring
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

And the AlertManagerConfig:

apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: cronjob-failure-receiver
  namespace: monitoring
  labels:
    release: prometheus
spec:
  route:
    groupBy: ['alertname']
    groupWait: 30s
    groupInterval: 2m
    repeatInterval: 2m
    receiver: cron-email
    routes:
      - matchers:
        - name: job
          value: cron-failure
        receiver: cron-email
  receivers:
    - name: cron-email
      emailConfigs:
        - to: 'user@mycompany.com'
          from: 'ops@mycompany.com'
          smarthost: 'mail2.mycompany.com:25'
          requireTLS: false

This configuration is working and the alerts are firing and being delivered.

However, when I schedule the CronJob in the ‘labs’ namespace with the following configuration, the alerts are not firing and they are not being delivered:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: failing-cronjob
  namespace: labs
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  schedule: "*/3 * * * *"  
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: job-container
            image: busybox
            imagePullPolicy: IfNotPresent
            command: ["/bin/sh", "-c"]
            args:
            - exit 1  
          restartPolicy: Never
          terminationGracePeriodSeconds: 10  
      backoffLimit: 0

and the resultant PrometheusRule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: failing-cron-rule
  namespace: monitoring
  labels:
    release: prometheus
spec:
  groups:
    - name: kube-cron
      rules:
        - record: job:kube_job_status_start_time:max
          expr: |
            label_replace(
              label_replace(
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (job_name, owner_name, namespace)
                == ON(owner_name) GROUP_LEFT()
                max(
                  kube_job_status_start_time
                  * ON(job_name,namespace) GROUP_RIGHT()
                  kube_job_owner{owner_name!=""}
                )
                BY (owner_name),
              "job", "$1", "job_name", "(.+)"),
            "cronjob", "$1", "owner_name", "(.+)")
          
        - record: job:kube_job_status_failed:sum
          expr: |
            clamp_max(
              job:kube_job_status_start_time:max,1)
              * ON(job) GROUP_LEFT()
              label_replace(
                label_replace(
                  (kube_job_status_failed != 0),
                  "job", "$1", "job_name", "(.+)"),
                "cronjob", "$1", "owner_name", "(.+)")
        
        - alert: CronJobStatusFailed
          expr: |
            job:kube_job_status_failed:sum
            * ON(cronjob, namespace) GROUP_LEFT()
            (kube_cronjob_spec_suspend == 0)
          labels:
            severity: critical
            job: cron-failure
            namespace: labs
          for: 1m
          annotations:
            summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'

For now it appears as if the PrometheusRule will only discover targets that are in the ‘monitoring’ namespace.

I have checked both the Prometheus and AlertManager logs but they are no errors in both.

What am I missing ?

Topic	Replies	Views
Vector contains metrics with the same labelset after applying rule labels PromQL	1510	July 25, 2023
Alertmanager namespace label=prometheus-Operator General Help/Support	552	May 8, 2023
Prometheus-alert Prometheus server	409	February 20, 2023
Promethes how to config alert rule of monitor event General Help/Support	421	May 31, 2022
Alert Manager Rules problem General Help/Support	493	March 23, 2022

AlertManagerConfig and PrometheusRule does not work for targets that are not in the monitoring namespace

Related topics