I have multiple VM’s (10k to 20k) which need to be monitored via prometheus . We already have installed the agents on them .
Additionally , we have group our vm’s in way so that there are multiple prometheus instance are available for different BU’s.
Also , we are using the Mimir as a long term storage for the metrics.
It would be great if i can get some clear picture on below points.
- How to figure out or Judge the capacity of single prometheus server? What are the parameters to calculate the scraping limit of the single prometheus server? . E.g. what if i have a datacenter for a BU having 1k VM servers then how to determine if all of these servers can be scraped via a single prometheus server .
I am aware of the below memory capacity formula but is this holds good to judge the scrape capacity?
Needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
-
In case of external remote write (Grafana Mimir) - Again, how to determine the capacity of single prometheus server to scrape the vm’s . → Any documentation or best practice.
-
How WAL works in case of external remote write presence and affects scrape capacity.
-
Prometheus cache works with the remote storage.