Feedback on time series backup strategy. How do you do it?

drio · June 22, 2022, 11:56am

Hi,

I wrote a prototype to backup Prometheus time series data. I was hoping to get some feedback from the community.

The tool/server exposes two endpoints /metrics and /snap. When you hit the snap endpoint the server starts an asynchronous pipeline that:

triggers a snapshot against the prometheus server
creates a snapshot tarball
pushes the tarball to s3

You can then add a cronjob that hits the /snap endpoint to start snapshot backups when necessary. I also set my prometheus server to a very high retention value.

The server also works as an exporter. The value of the prom_backup_size gauge captures the size in bytes of the latest snapshot. A few moments (a configurable option) after the pipeline completes the process, we reset the gauge value to zero.

You can use this metric to monitor and alert on the status of the snapshots. I am not sure this metric is the best way to expose the healthiness of the system. I’d love to hear your thoughts.

What do you use to backup your prometheus servers? Is this a sensible approach?

Thank you!
-drd

drio · July 14, 2022, 8:32pm

I wanted to follow up on this.

I have a nice setup for a single prometheus server. Next, I wanted to incorporate high availability (HA) and fault tolerance (FT).

From what I have read, the best approach is to have multiple prometheus servers scrapping the same exporters, that way you have the same time series in all the servers.

My question is, is anyone using this setup? How does it work for you? Do you have other suggestions to achieve HA and FT for prometheus?

Thank you,
-drd

drio · July 15, 2022, 4:59pm

Another question I had with this approach: what happens if scraping fails for a period of time? Is prometheus going to request the missing data? Or is the data lost for that failure interval?

stuart · July 15, 2022, 5:30pm

If scrapes fail that data is lost - a scrape just returns data about “now”. However the idea of having a pair is that the likelihood of both failing to scrape is hopefully low. You generally then put some type of proxy in front of the two servers that deduplicates, or use a central store (e.g. Thanos) so that when you query you “fill in the gaps” and get continuous data.

Sergei_Shablovsky · December 24, 2023, 1:02am

Thank You for comment.

Could You be so please to explain in details this strategy?

Topic		Replies	Views
Snapshot backup of Prometheus TSDB Prometheus server	0	947	March 9, 2023
HA setup - what happens with data if one of the servers is down for a period of time Prometheus server	0	348	July 15, 2022
If i use Prometheus remote write integration in a high-availability (HA) configuration, how can i make sure my Prometheus servers not send multiple copies of the same metrics General Help/Support	1	1344	June 28, 2021
Federated hierarchical prometheus missing data replacement Prometheus server	0	378	January 13, 2022
Generate huge prometheus data (100GB) for testing backup time General Help/Support	2	1189	June 15, 2021

Feedback on time series backup strategy. How do you do it?

Related topics