Feedback on time series backup strategy. How do you do it?

Hi,

I wrote a prototype to backup Prometheus time series data. I was hoping to get some feedback from the community.

The tool/server exposes two endpoints /metrics and /snap. When you hit the snap endpoint the server starts an asynchronous pipeline that:

  1. triggers a snapshot against the prometheus server
  2. creates a snapshot tarball
  3. pushes the tarball to s3

You can then add a cronjob that hits the /snap endpoint to start snapshot backups when necessary. I also set my prometheus server to a very high retention value.

The server also works as an exporter. The value of the prom_backup_size gauge captures the size in bytes of the latest snapshot. A few moments (a configurable option) after the pipeline completes the process, we reset the gauge value to zero.

You can use this metric to monitor and alert on the status of the snapshots. I am not sure this metric is the best way to expose the healthiness of the system. I’d love to hear your thoughts.

What do you use to backup your prometheus servers? Is this a sensible approach?

Thank you!
-drd

I wanted to follow up on this.

I have a nice setup for a single prometheus server. Next, I wanted to incorporate high availability (HA) and fault tolerance (FT).

From what I have read, the best approach is to have multiple prometheus servers scrapping the same exporters, that way you have the same time series in all the servers.

My question is, is anyone using this setup? How does it work for you? Do you have other suggestions to achieve HA and FT for prometheus?

Thank you,
-drd

Another question I had with this approach: what happens if scraping fails for a period of time? Is prometheus going to request the missing data? Or is the data lost for that failure interval?

If scrapes fail that data is lost - a scrape just returns data about “now”. However the idea of having a pair is that the likelihood of both failing to scrape is hopefully low. You generally then put some type of proxy in front of the two servers that deduplicates, or use a central store (e.g. Thanos) so that when you query you “fill in the gaps” and get continuous data.

2 Likes

Thank You for comment.

Could You be so please to explain in details this strategy?