Hi! I’m using Prometheus in my homelab Kubernetes cluster and I noticed that it has quite high memory usage (~1GB). Yesterday, I spent some time trying to optimise it by deleting metrics I wasn’t using, mostly default Kubernetes ones. I was able to achieve a significant reduction in what “Top 10 series count by metric names” reports (and I also ran topk(20, count by (__name__, job)({__name__=~".+"}))
from here) and then deleted the metrics and even called the tombstone API to ensure it gets cleaned up. This dropped the number of reported series by name quite a bit and I expected to see a reduction in memory usage as a result.
Specifically I added this to my config for both the node & apiserver scraper:
metric_relabel_configs:
# These metrics use a lot of memory and I don't need it
- source_labels: [ __name__ ]
regex: 'apiserver_.*'
action: drop
- source_labels: [ __name__ ]
regex: 'etcd_.*'
action: drop
- source_labels: [ __name__ ]
regex: 'scheduler_.*'
action: drop
So today, 24h later I come back to check and Prometheus is using the exact same amount of memory. It reports many fewer time series counts using that query yet memory usage always peaks at up to 1GB. I can’t find any docs on this, it seems almost like Prometheus aims to hit that value:
Given the size of my setup I would not expect memory requirements that make it the largest container in my small cluster by memory footprint.
What am I doing wrong that prometheus grows so large?
Coincidentally, I’ve set the resource limit to 1Gi for prometheus-server, so I wondered if it reads those values and basically just takes what’s allowed so I reduced that value to 300Mi, but then it would crash the moment Grafana tries to query it for anything.
Any tips?