Prometheus Agent Running Out of Memory

Ran into an issue where we released prometheus agent successfully to all of our lower (dev/stg) environments successfully and we able to remote write successfully.

We got into production and started seeing an issue in 3 of our 6 regions. The issue was the agent k8s pod was running out of memory. We bumped up the memory multiple times from 3gb ~> 6gb ~> 12gb ~> 18gb. No matter how much we bumped it up it kept consuming all of the memory. Eventually, the agent pod started crash looping and we reverted to our old scrape method.

We were migrating service pods to be scraped by remote write a few at a time and the more we migrated the more memory we saw it consume.

We’re willing to go above 18gb of memory if necessary but we’re not convinced right now that it won’t just consume however much memory we give it.