0
I have an existing CronJob for which I have setup alerts. The alerts are working when the CronJob is scheduled in the ‘monitoring’ namespace. I am using Kube-Prometheus stack and it is also deployed in the ‘monitoring’ namespace. When I schedule the CronJob in another namespace, named ‘labs’ the alert does not fire and I am not receiving any email.
This is my configuration when I schedule the CronJob in the ‘monitoring’ namespace:
apiVersion: batch/v1
kind: CronJob
metadata:
name: failing-cronjob
namespace: monitoring
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
schedule: "*/3 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: job-container
image: busybox
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- exit 1
restartPolicy: Never
terminationGracePeriodSeconds: 10
backoffLimit: 0
The PrometheusRule :
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: failing-cron-rule
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: kube-cron
rules:
- record: job:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (job_name, owner_name, namespace)
== ON(owner_name) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (owner_name),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")
- record: job:kube_job_status_failed:sum
expr: |
clamp_max(
job:kube_job_status_start_time:max,1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")
- alert: CronJobStatusFailed
expr: |
job:kube_job_status_failed:sum
* ON(cronjob, namespace) GROUP_LEFT()
(kube_cronjob_spec_suspend == 0)
labels:
severity: critical
job: cron-failure
namespace: monitoring
for: 1m
annotations:
summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'
And the AlertManagerConfig:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
name: cronjob-failure-receiver
namespace: monitoring
labels:
release: prometheus
spec:
route:
groupBy: ['alertname']
groupWait: 30s
groupInterval: 2m
repeatInterval: 2m
receiver: cron-email
routes:
- matchers:
- name: job
value: cron-failure
receiver: cron-email
receivers:
- name: cron-email
emailConfigs:
- to: 'user@mycompany.com'
from: 'ops@mycompany.com'
smarthost: 'mail2.mycompany.com:25'
requireTLS: false
This configuration is working and the alerts are firing and being delivered.
However, when I schedule the CronJob in the ‘labs’ namespace with the following configuration, the alerts are not firing and they are not being delivered:
apiVersion: batch/v1
kind: CronJob
metadata:
name: failing-cronjob
namespace: labs
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
schedule: "*/3 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: job-container
image: busybox
imagePullPolicy: IfNotPresent
command: ["/bin/sh", "-c"]
args:
- exit 1
restartPolicy: Never
terminationGracePeriodSeconds: 10
backoffLimit: 0
and the resultant PrometheusRule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: failing-cron-rule
namespace: monitoring
labels:
release: prometheus
spec:
groups:
- name: kube-cron
rules:
- record: job:kube_job_status_start_time:max
expr: |
label_replace(
label_replace(
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (job_name, owner_name, namespace)
== ON(owner_name) GROUP_LEFT()
max(
kube_job_status_start_time
* ON(job_name,namespace) GROUP_RIGHT()
kube_job_owner{owner_name!=""}
)
BY (owner_name),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")
- record: job:kube_job_status_failed:sum
expr: |
clamp_max(
job:kube_job_status_start_time:max,1)
* ON(job) GROUP_LEFT()
label_replace(
label_replace(
(kube_job_status_failed != 0),
"job", "$1", "job_name", "(.+)"),
"cronjob", "$1", "owner_name", "(.+)")
- alert: CronJobStatusFailed
expr: |
job:kube_job_status_failed:sum
* ON(cronjob, namespace) GROUP_LEFT()
(kube_cronjob_spec_suspend == 0)
labels:
severity: critical
job: cron-failure
namespace: labs
for: 1m
annotations:
summary: '{{ $labels.cronjob }} last run has failed {{ $value }} times.'
For now it appears as if the PrometheusRule will only discover targets that are in the ‘monitoring’ namespace.
I have checked both the Prometheus and AlertManager logs but they are no errors in both.
What am I missing ?