Troubleshoot CPU Busy Aggregate of a Cluster

gregGT · May 18, 2021, 4:00pm

Hello, I have a grid computing situation with hundreds to thousands of nodes in different clusters.

I’ve installed the Node Exporter Full Dashboard, and it includes a CPU Busy metric produced by this query. That query works fine to return the CPU Busy of any individual node.

(((count(count(node_cpu_seconds_total{job="$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode=‘idle’,job="$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{job="$job"}) by (cpu))

However, it produces incorrect CPU busy for the cluster as a whole, anywhere from - 295% and more negative (-1500%) depending on which job (cluster) I select to view. Job in my prometheus.yml, with all associated node targets within it equates to “job” in the query.

I’ve tried to modify this query in many ways to aggregate CPU Busy for the entire cluster (job), but the query fails.

Can anyone suggest how I can alter the query to accurately show the aggregated CPU Busy metric of all CPUs across a cluster (job) ?

Thanks for any guidance.

gregGT · May 27, 2021, 8:15pm

Any help here?

Topic		Replies	Views
Node exporter showing more then 100% cpu idle time node_exporter	0	729	April 18, 2022
Aggregate by job? PromQL	1	693	March 23, 2021
Prometheus negative cpu utilization with rate PromQL	1	1597	May 2, 2021
The network related indicators of the node_exporter are inaccurate General Help/Support	1	242	January 8, 2024
Should node exporters for different servers be on the same job or separate jobs? node_exporter	0	86	December 12, 2024

Troubleshoot CPU Busy Aggregate of a Cluster

Related topics