I’d like to find a definitive method for calculating how much storage space is required for our prometheus data. We’ll be monitoring approximately 3K RHEL nodes, via node_exporter. We want to keep 24 months of data. We’ll also be using the Infiniband Exporter (runs on one node in the IB fabric) and the exporter for Slurm on approximately 10 HPC schedulers.
On the prometheus storage page there’s a formula, tho not explained entirely:
needed_disk_space = retention_time_seconds * ingested_samples_per_second * bytes_per_sample
How does one determine the “ingested_samples_per_second” ?
At present, I have only a small portion of our total 3K nodes in the prometheus.yml.
Will this formula accurately calculate our storage requirement?
I found this stack exchange article, but I do not know if it is authoritative:
How to calculate disk space required by Prometheus v2.2? - DevOps Stack Exchange.
It references this prometheus.io link, but that’s now 404:
https://prometheus.io/docs/prometheus/2.2/storage/
This formula suggested at the end of the stackexchange article to calculate bytes of storage required is:
retention_time_seconds *
rate(prometheus_tsdb_head_samples_appended_total[2h]) *
(rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[2h]) / rate(prometheus_tsdb_compaction_chunk_samples_sum[2h]))
At present, I have only a small portion of our total 3K nodes in my prometheus.yml.
Will this formula accurately calculate our storage requirement?
Thanks for any help.
The Prometheus.io documentation does not give a simple formula for calculating your storage requirements, and, in truth, it is not possible to say that Prometheus will consume “X” GB of disk for “Y” months of retention. I will share how I used Robust Perception’s methodology to arrive at an estimate of the disk I will need.
-
There are so many different combinations of hardware, operating systems and exporters, that there is no way to estimate your Prometheus data generation without first scraping at least a sub-set of your targets. First, you will need to add all, or some of your targets to your prometheus.yml and begin scraping your targets using all exporters as are appropriate in your environment. The most accurate estimate will result from ALL of your targets being scraped.
-
Alternatively, you may have only a representative sample of your targets in your prometheus.yml, and being scraped. In this latter case, you will need to extrapolate the results obtained from the query, below, and calculate out the full storage required for all of your targets.
-
The Prometheus query I derived from Brian Brazil’s article is:
(rate(prometheus_tsdb_compaction_chunk_size_bytes_sum[1h])
/
rate(prometheus_tsdb_compaction_chunk_samples_sum[1h]))
*
rate(prometheus_tsdb_head_samples_appended_total[1h])
-
This will provide the storage, in KB / second, that you are currently scraping. I found it helpful to visualize this in Grafana, and let it run for several days in order to get a good idea what my data generation is.
-
You will need to add approximately 20% to account for “straddling blocks” of data. See the blog article for a full explanation of this factor.
-
Extrapolate the KB /sec to the retention period you need. An internet search will find time converters from second to days/months/years that you want to retain.
-
If you are already scraping all of your targets, your calculations are complete. On the other hand, if you will have more targets than you scrape at present, you will need to extrapolate your current data to the full number of targets you expect to have.
-
I found it helpful to put all of these steps in a spreadsheet to calculate my final answer and to be sure I missed no steps. A spreadsheet is also helpful to play “what-if’s” with data collection and retention periods.
-
Different exporters generate different amounts of data. Be sure you have all exporters in use when you are visualizing in Grafana.