Feedback on time series backup strategy. How do you do it?


I wrote a prototype to backup Prometheus time series data. I was hoping to get some feedback from the community.

The tool/server exposes two endpoints /metrics and /snap. When you hit the snap endpoint the server starts an asynchronous pipeline that:

  1. triggers a snapshot against the prometheus server
  2. creates a snapshot tarball
  3. pushes the tarball to s3

You can then add a cronjob that hits the /snap endpoint to start snapshot backups when necessary. I also set my prometheus server to a very high retention value.

The server also works as an exporter. The value of the prom_backup_size gauge captures the size in bytes of the latest snapshot. A few moments (a configurable option) after the pipeline completes the process, we reset the gauge value to zero.

You can use this metric to monitor and alert on the status of the snapshots. I am not sure this metric is the best way to expose the healthiness of the system. I’d love to hear your thoughts.

What do you use to backup your prometheus servers? Is this a sensible approach?

Thank you!