Hi,
We have a problem with large cluster using only 1 prometheus. (cost)
Cluster: 1000+ nodes
Prometheus: 1 node with around 60GB memory (Still OOM) when doing select in Kiali and Grafana
Thanos: Thanos-Sidecar
Istio, Kiali, Grafana and Jaeger
Problem:
Large cluster (1000 nodes+) with 1 Prometheus gets OOM due to large metrics.
Question:
We are planning to implement below:
Design approach:
Fire up multiple prometheus for each namespace with same scrape configs and just use thanos-query for grafana (deduplication)
Issue with this approach:
Kiali gets data directly from Prometheus (not thanos) (which in this case multiple prometheus and duplicates metrics)
Sadly, removing unnecessary metrics are not yet as an option right now in the organization.
Appreciate any suggestion for better approach to take.