Hello, I have a grid computing situation with hundreds to thousands of nodes in different clusters.
I’ve installed the Node Exporter Full Dashboard, and it includes a CPU Busy metric produced by this query. That query works fine to return the CPU Busy of any individual node.
(((count(count(node_cpu_seconds_total{job="$job"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode=‘idle’,job="$job"}[5m])))) * 100) / count(count(node_cpu_seconds_total{job="$job"}) by (cpu))
However, it produces incorrect CPU busy for the cluster as a whole, anywhere from - 295% and more negative (-1500%) depending on which job (cluster) I select to view. Job in my prometheus.yml, with all associated node targets within it equates to “job” in the query.
I’ve tried to modify this query in many ways to aggregate CPU Busy for the entire cluster (job), but the query fails.
Can anyone suggest how I can alter the query to accurately show the aggregated CPU Busy metric of all CPUs across a cluster (job) ?
Thanks for any guidance.