Constantly increasing CPU usage on the Prometheus server

Hi everyone,

I need help with the constantly increasing CPU usage on our Prometheus server (v2.50.1), which I’ve been struggling with recently. In our organization, we run Prometheus for several tenants, deployed via Prometheus Operator in a Kubernetes cluster. One specific tenant is continuously growing, and we can’t seem to stop this growth in terms of resources. Each tenant has a pair of Prometheus servers in the cluster.

Prometheus setup:

  • Prometheus resources: request 8 CPU + 12Gi MEM; limits 32 CPU + 20Gi MEM
  • Enabled features:
    • auto-gomemlimit
    • auto-gomacprocs
    • new-service-discovery-manager
  • Setup:
    • Evaluation interval: 30s
    • Retention: 6h
    • Scrape interval: 30s
    • Args:
      • scrape.discovery-reload-interval=15s

What does Prometheus monitor?
Prometheus scrapes endpoints based on PodMonitors, of which there are around 300 in the cluster (101 PodMonitors x 3 namespaces). Each PodMonitor typically monitors 1-2 endpoints, but some monitor up to 6. On the prometheus:9090/service-discovery page, I see entries like (2/1037 active targets) for many items, and this occurs in over 100 records. These are often common application metrics, latencies, etc. Besides PodMonitors, ScrapeConfigs are used, which have about 60 node-exporter endpoints in their configuration.

How does Prometheus behave in operation?

  • CPU usage, according to the process_cpu_seconds_total metric, grows steadily and resets to a low value upon each restart, only to start significantly increasing again within about 12 hours. PromQL: max(increase(process_cpu_seconds_total{container="prometheus",}[5m])) by (cluster, pod) the value is about 450-500K
  • The prometheus_target_sync_length_seconds metric shows a long-term slowdown in loading endpoints. PromQL: max(max_over_time(prometheus_target_sync_length_seconds{quantile="0.5"}[5m])) by (cluster, prometheus)
  • The Kubernetes SD prometheus_sd_kubernetes_workqueue_work_duration_seconds_sum metric indicates increasing processing times over time. PromQL: rate(prometheus_sd_kubernetes_workqueue_work_duration_seconds_sum{queue_name="pod"}[5m])
  • Profiling shows some slowdown compared to other Prometheus servers we’ve subjected to similar tests, but the root cause remains unclear. Pprof.me

What have we tried?

  1. We set keep_dropped_targets=5 globally to reduce the number of endpoints Prometheus handles, aiming to minimize the amount of information it retains, but this had no impact on performance.
  2. We started reducing the number of PodMonitors, though we’re unsure if this will be effective. So far, a small reduction (about 30 PodMonitors) had no performance impact.
  3. We asked the tenant to clean up non-functional pods and jobs from cronjobs in the namespaces to ease service discovery. We removed about 200-300 dead pods, but this had no effect on Prometheus, even after a restart.
  4. We increased the scrape.discovery-reload-interval from default values, guided by issue #12481, but it didn’t help either.
  5. We compared Prometheus with similar timeseries counts, but the number of PodMonitors each tenant collects varies, making it difficult to find a similar instance and pinpoint the problem.
  6. We investigated performance-related issues #8392 and #8014, but they didn’t provide applicable solutions. The situations seemed somewhat different from ours.

At this point, we are unsure how to stabilize the situation and prevent Prometheus from endlessly growing. Any insights or suggestions would be greatly appreciated.