I’m currently using the following to monitor the temperature on Intel boxes:
alert: SystemTempIsHigh
expr: avg
by(instance) (node_hwmon_temp_celsius{chip="platform_coretemp_0"}) >
60
for: 2m
This worked fine… until I introduced a Ryzen box. Turns out there are no coretemps on that platform. What does exist are the following:
$ sensors
amdgpu-pci-0600
Adapter: PCI adapter
vddgfx: 1.30 V
vddnb: 724.00 mV
edge: +45.0°C
PPT: 11.00 W
nvme-pci-0100
Adapter: PCI adapter
Composite: +51.9°C (low = -273.1°C, high = +75.8°C)
(crit = +84.8°C)
Sensor 1: +51.9°C (low = -273.1°C, high = +65261.8°C)
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +58.4°C
$ curl -s http://localhost:9100/metrics | egrep 'node_hwmon_(chip_names|sensor_label|temp_celsius)'
# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="0000:00:08_1_0000:06:00_0",chip_name="amdgpu"} 1
node_hwmon_chip_names{chip="nvme_nvme0",chip_name="nvme"} 1
node_hwmon_chip_names{chip="pci0000:00_0000:00:18_3",chip_name="k10temp"} 1
# HELP node_hwmon_sensor_label Label for given chip and sensor
# TYPE node_hwmon_sensor_label gauge
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="PPT",sensor="power1"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="edge",sensor="temp1"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="vddgfx",sensor="in0"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="vddnb",sensor="in1"} 1
node_hwmon_sensor_label{chip="nvme_nvme0",label="Composite",sensor="temp1"} 1
node_hwmon_sensor_label{chip="nvme_nvme0",label="Sensor 1",sensor="temp2"} 1
node_hwmon_sensor_label{chip="pci0000:00_0000:00:18_3",label="Tctl",sensor="temp1"} 1
# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="0000:00:08_1_0000:06:00_0",sensor="temp1"} 42
node_hwmon_temp_celsius{chip="nvme_nvme0",sensor="temp1"} 48.85
node_hwmon_temp_celsius{chip="nvme_nvme0",sensor="temp2"} 48.85
node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3",sensor="temp1"} 47.375
So node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3",sensor="temp1"}
seems to be what I want.
Rather than create a separate alert, is there a way to remap both the Intel temps (via the average coretemp as before, although I will probably move to package_id_0
if I can figure out the remapping) and the Ryzen temps (bonus: by looking up chip_names
and sensor_labels
) to have a new unified metric called generic_cpu_temp
or whatever is appropriate?