Prometheus for wide clusters about 1000 nodes

plee · September 6, 2021, 9:13am

Hi,

We have a problem with large cluster using only 1 prometheus. (cost)

Cluster: 1000+ nodes
Prometheus: 1 node with around 60GB memory (Still OOM) when doing select in Kiali and Grafana
Thanos: Thanos-Sidecar
Istio, Kiali, Grafana and Jaeger

Problem:
Large cluster (1000 nodes+) with 1 Prometheus gets OOM due to large metrics.

Question:
We are planning to implement below:

Design approach:
Fire up multiple prometheus for each namespace with same scrape configs and just use thanos-query for grafana (deduplication)

Issue with this approach:
Kiali gets data directly from Prometheus (not thanos) (which in this case multiple prometheus and duplicates metrics)

Sadly, removing unnecessary metrics are not yet as an option right now in the organization.

Appreciate any suggestion for better approach to take.

ElkRom · March 18, 2022, 9:50am

Hi @plee ,
How did you deal with such case?

Topic		Replies	Views
Prometheus system requirements Prometheus server	0	1794	December 11, 2023
Number of metrics one Prometheus server can handle? Prometheus server	3	2818	May 25, 2021
One cluster with Prometheus to scrape multiple clusters Prometheus server	4	1291	February 15, 2023
Reducing memory usage in homelab deployment General Help/Support	1	160	February 17, 2025
Is it wise to use single Prometheus Server to scrape metrics from different kubernetes clusters on cloud? Prometheus server	0	614	September 5, 2022

Prometheus for wide clusters about 1000 nodes

Related topics