Prometheus replication - Is federation the way to go?

Aldar · July 9, 2021, 1:29pm

Hello,

I am trying to build a complete replica of our primary prometheus instance so that other applications, not under our complete control, could access the data without posing any risk to our primary node that we use internally ourselves, and that is production-critical.

To facilite this data replication, I used the Prometheus’ built-in federation solution. The replica’s complete configuration is as follows:

global:
  scrape_interval: 4m
  scrape_timeout: 3m

scrape_configs:
    - job_name: 'federate'
      tls_config:
        insecure_skip_verify: true

      honor_labels: true
      metrics_path: '/federate'
      scheme: 'https'

      params:
        'match[]':
          - '{__name__=~".+"}'

      static_configs:
        - targets:
          - '*master-hostname*'

As you can see, I had to set a quite large scrape interval values, as I’m asking the primary node for all the labels and values it has - Such a HTTP request takes anywhere between 2 and 3 minutes to complete. That is problematic, as the replica then provides fewer datapoints than the primary, resulting in less detail in the final graphs we plot from the datapoints.

Is there a better way to replicate data from one Prometheus node to another than federation? Even if through a 3rd party solution.

stuart · July 9, 2021, 2:32pm

You could look at using remote write, or just scrape the underlying
targets directly from the second instance.

SuperQ · July 9, 2021, 3:13pm

Federation is not meant for replication. Like @stuart said, enable remote write with --enable-feature=remote-write-receiver

Aldar · July 15, 2021, 4:07pm

Okay, thank you both for the suggestion. I am sorry it took me a while to reply, however, even after enabling and setting up the remote writes from one of our testing instances onto my prometheus instance, it still does not produce identical datasets.

I tried increasing the max_samples_per_send to something ridiculous (Like 25k) and Capacity to 10x that.
Attached is a comparison of the two instances - left is configured to remote-write samples to the right instance.

The current setting in use is:

remote_write:
    - url: http://127.0.0.1:19090/api/v1/write
      max_samples_per_send: 25000
      capacity: 250000

The URL points to an stunnel4 tunnel leading to the replica (For security sake)

I am at a bit of a loss at how to do what I want - a 1:1 dataset replica. The two nodes are connected through a gigabit connection, meaning bandwidth shouldn’t be too much of a problem. And even if network issues were to arise, I don’t mind dropped samples (E.g.: Don’t need a HA solution), but I need to have ± the same datasets on the two nodes.

Topic		Replies	Views
Pushing Prometheus metrics through several network layers Prometheus server	4	928	March 29, 2024
Federated hierarchical prometheus missing data replacement Prometheus server	0	378	January 13, 2022
50 Sensitive nodes behind firewall, want only one port 9100 hole Scaling / Clustering / Long-Term Storage	2	491	July 26, 2021
Prometheus in Agentmode not scrapping default openshift prometheus Prometheus server	2	232	February 22, 2024
How to monitor data collected with node exporter to another k8s cluster prometheus node_exporter	0	256	November 28, 2023

Prometheus replication - Is federation the way to go?

Related topics