Prometheus replication - Is federation the way to go?


I am trying to build a complete replica of our primary prometheus instance so that other applications, not under our complete control, could access the data without posing any risk to our primary node that we use internally ourselves, and that is production-critical.

To facilite this data replication, I used the Prometheus’ built-in federation solution. The replica’s complete configuration is as follows:

  scrape_interval: 4m
  scrape_timeout: 3m

    - job_name: 'federate'
        insecure_skip_verify: true

      honor_labels: true
      metrics_path: '/federate'
      scheme: 'https'

          - '{__name__=~".+"}'

        - targets:
          - '*master-hostname*'

As you can see, I had to set a quite large scrape interval values, as I’m asking the primary node for all the labels and values it has - Such a HTTP request takes anywhere between 2 and 3 minutes to complete. That is problematic, as the replica then provides fewer datapoints than the primary, resulting in less detail in the final graphs we plot from the datapoints.

Is there a better way to replicate data from one Prometheus node to another than federation? Even if through a 3rd party solution.

You could look at using remote write, or just scrape the underlying
targets directly from the second instance.

Federation is not meant for replication. Like @stuart said, enable remote write with --enable-feature=remote-write-receiver

Okay, thank you both for the suggestion. I am sorry it took me a while to reply, however, even after enabling and setting up the remote writes from one of our testing instances onto my prometheus instance, it still does not produce identical datasets.

I tried increasing the max_samples_per_send to something ridiculous (Like 25k) and Capacity to 10x that.
Attached is a comparison of the two instances - left is configured to remote-write samples to the right instance.

The current setting in use is:

    - url:
      max_samples_per_send: 25000
      capacity: 250000

The URL points to an stunnel4 tunnel leading to the replica (For security sake)

I am at a bit of a loss at how to do what I want - a 1:1 dataset replica. The two nodes are connected through a gigabit connection, meaning bandwidth shouldn’t be too much of a problem. And even if network issues were to arise, I don’t mind dropped samples (E.g.: Don’t need a HA solution), but I need to have ± the same datasets on the two nodes.