Hi Team,
I defined two alarm rules in prometheus rules. These are the CPU usage and memory usage. And set the for statement in the rules. But when my memory usage rate is greater than 80%, the alert status has been in Pending, and it has not been transferred to the Firing status for a long time. I guarantee that the current memory usage of the node has been above 80%
In prometheus global settings, “scrape_interval=15s”, “evaluation_interval=15s”, the following is the configuration of each component
prometheus-server config file:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager.monitor.svc.cluster.local:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- /config/rules/*.yaml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
alert rules config file:
groups:
- name: "node_exporter"
rules:
- alert: "Memory usage is greater than 80%"
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 30s
labels:
severity: warning
now: "{{ $value }}"
annotations:
description: "Server{{$labels.instance}} Memory usage is greater than 80%"
- alert: "CPU usage is greater than 80%"
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{ mode="idle"}[1m])) * 100) > 80
for: 30s
labels:
severity: warning
now: "{{ $value }}"
annotations:
description: "Server{{$labels.instance}} CPU usage is greater than 80%"
alertmanager config file:
global:
route:
group_by: ['instance']
group_wait: 5s
group_interval: 5s
repeat_interval: 1h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://prometheus-alert-center.monitor.svc.cluster.local:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx'
send_resolved: true
The following is viewed in the prometheus ui interface
I have been wondering, when the for statement is configured in the alert rule, prometheus will not change from the Pending state to the Firing state. Ask everyone for help