Reducing memory usage in homelab deployment

Hi! I’m using Prometheus in my homelab Kubernetes cluster and I noticed that it has quite high memory usage (~1GB). Yesterday, I spent some time trying to optimise it by deleting metrics I wasn’t using, mostly default Kubernetes ones. I was able to achieve a significant reduction in what “Top 10 series count by metric names” reports (and I also ran topk(20, count by (__name__, job)({__name__=~".+"})) from here) and then deleted the metrics and even called the tombstone API to ensure it gets cleaned up. This dropped the number of reported series by name quite a bit and I expected to see a reduction in memory usage as a result.

Specifically I added this to my config for both the node & apiserver scraper:



        metric_relabel_configs:
          # These metrics use a lot of memory and I don't need it
          - source_labels: [ __name__ ]
            regex: 'apiserver_.*'
            action: drop
          - source_labels: [ __name__ ]
            regex: 'etcd_.*'
            action: drop
          - source_labels: [ __name__ ]
            regex: 'scheduler_.*'
            action: drop

So today, 24h later I come back to check and Prometheus is using the exact same amount of memory. It reports many fewer time series counts using that query yet memory usage always peaks at up to 1GB. I can’t find any docs on this, it seems almost like Prometheus aims to hit that value:

Given the size of my setup I would not expect memory requirements that make it the largest container in my small cluster by memory footprint.

What am I doing wrong that prometheus grows so large?

Coincidentally, I’ve set the resource limit to 1Gi for prometheus-server, so I wondered if it reads those values and basically just takes what’s allowed so I reduced that value to 300Mi, but then it would crash the moment Grafana tries to query it for anything.

Any tips?

Try running:

ps -eo pmem,pcpu,vsize,pid,cmd | sort -rnk 1 | head

to see what processes are using memory and how much.

I have grafana running with prometheus, a python metrics collector for Mikrotik; all in a 1GB LXC container.

root@grafana:~# ps -eo pmem,pcpu,vsize,pid,cmd | sort -rnk 1 | head
16.6  0.0 323856      48 /lib/systemd/systemd-journald
16.2  0.0 1611920    134 /usr/share/grafana/bin/grafana server --config=/etc/grafana/grafana.ini --pidfile=/run/grafana/grafana-server.pid --packaging=deb cfg:default.paths.logs=/var/log/grafana cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning
14.4  0.0 3536668    137 /usr/local/bin/prometheus --config.file=/etc/prometheus/prometheus-updated.yml --storage.tsdb.retention.time=2y --storage.tsdb.path=/etc/prometheus/data
 5.7  0.0 707204     129 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion MultiMinionProcessManager MinionProcessManager
 2.1  0.0 550012     135 /root/mktxp/bin/python3 /root/mktxp/bin/mktxp export
 1.9  0.0 129176     116 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion
 1.8  0.0 1238884    138 /root/snmp_exporter-0.26.0.linux-amd64/snmp_exporter
 1.8  0.0 106188     119 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
 1.7  0.0 286504    6245 /usr/libexec/packagekitd
 1.3  0.0  32572     283 /root/mktxp/bin/python3 /root/mktxp/bin/mktxp export
root@grafana:~# free -mt
               total        used        free      shared  buff/cache   available
Mem:            1024         297          27           0         698         726
Swap:            512          19         492
Total:          1536         317         520
root@grafana:~#