How to unify Intel and AMD hwmon CPU temps?

sshaikh · March 21, 2024, 10:58pm

I’m currently using the following to monitor the temperature on Intel boxes:

alert: SystemTempIsHigh
expr: avg
  by(instance) (node_hwmon_temp_celsius{chip="platform_coretemp_0"}) >
  60
for: 2m

This worked fine… until I introduced a Ryzen box. Turns out there are no coretemps on that platform. What does exist are the following:

$ sensors
amdgpu-pci-0600
Adapter: PCI adapter
vddgfx:        1.30 V
vddnb:       724.00 mV
edge:         +45.0°C
PPT:          11.00 W

nvme-pci-0100
Adapter: PCI adapter
Composite:    +51.9°C  (low  = -273.1°C, high = +75.8°C)
                       (crit = +84.8°C)
Sensor 1:     +51.9°C  (low  = -273.1°C, high = +65261.8°C)

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +58.4°C

$ curl -s http://localhost:9100/metrics | egrep 'node_hwmon_(chip_names|sensor_label|temp_celsius)'
# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="0000:00:08_1_0000:06:00_0",chip_name="amdgpu"} 1
node_hwmon_chip_names{chip="nvme_nvme0",chip_name="nvme"} 1
node_hwmon_chip_names{chip="pci0000:00_0000:00:18_3",chip_name="k10temp"} 1
# HELP node_hwmon_sensor_label Label for given chip and sensor
# TYPE node_hwmon_sensor_label gauge
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="PPT",sensor="power1"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="edge",sensor="temp1"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="vddgfx",sensor="in0"} 1
node_hwmon_sensor_label{chip="0000:00:08_1_0000:06:00_0",label="vddnb",sensor="in1"} 1
node_hwmon_sensor_label{chip="nvme_nvme0",label="Composite",sensor="temp1"} 1
node_hwmon_sensor_label{chip="nvme_nvme0",label="Sensor 1",sensor="temp2"} 1
node_hwmon_sensor_label{chip="pci0000:00_0000:00:18_3",label="Tctl",sensor="temp1"} 1
# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="0000:00:08_1_0000:06:00_0",sensor="temp1"} 42
node_hwmon_temp_celsius{chip="nvme_nvme0",sensor="temp1"} 48.85
node_hwmon_temp_celsius{chip="nvme_nvme0",sensor="temp2"} 48.85
node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3",sensor="temp1"} 47.375

So node_hwmon_temp_celsius{chip="pci0000:00_0000:00:18_3",sensor="temp1"} seems to be what I want.

Rather than create a separate alert, is there a way to remap both the Intel temps (via the average coretemp as before, although I will probably move to package_id_0 if I can figure out the remapping) and the Ryzen temps (bonus: by looking up chip_names and sensor_labels) to have a new unified metric called generic_cpu_temp or whatever is appropriate?

sshaikh · March 27, 2024, 4:17pm

I figured out a solution. First a recording rule (not strictly needed but I like to split things logically):

record: instance:node_hwmon_temp_named_celsius:multiplication
expr: label_replace(label_replace(node_hwmon_temp_celsius
  * ignoring(chip_name, sensor) group_left(chip_name) node_hwmon_chip_names, "chip_name",
  "unknown-chip-name", "chip_name", "") * ignoring(label,
  chip_name) group_left(label) node_hwmon_sensor_label, "label", "unknown-temp-sensor-name",
  "label", "")

This creates a new metric with both chip_names and sensor labels joined with the temperature metric. Then the rule is simple:

alert: SystemTempIsHigh
expr: (instance:node_hwmon_temp_named_celsius:multiplication{chip_name="coretemp",label=~".+ackage.+"}
  or instance:node_hwmon_temp_named_celsius:multiplication{chip_name="k10temp",label="Tctl"})
  > 60
for: 2m

I had to put in a match on package_id_0 because one of my exporters outputs “Package Id 0” so this works until that is resolved upstream.

Topic		Replies	Views
Monitoring Cpu Temperatures General Help/Support	1	1331	April 17, 2023
Monitoring AMD c-states node_exporter	0	240	February 8, 2024
Snmp text value to decimal temperature Exporters and Metrics	0	428	April 26, 2022
Lm_sensors exporter Exporters and Metrics	1	352	April 3, 2024
Node Exporter Does not Start on macOS Exporters and Metrics	1	422	November 1, 2023

How to unify Intel and AMD hwmon CPU temps?

Related topics