Clustering Architecture

glennda37 · October 20, 2025, 7:39pm

Hi,

New around these parts. I am fairly new to prometheus, I have the basics sorted and currently monitoring some HA proxy load balancers with a single prometheus instance and grafana (single vm job). I am now looking at ways to scale this to a resillient platform as we want to role it out further across the estate, however I am confused.

I think from what I am reading - there is no such thing as a “cluster” of Prometheus servers, such as applications that have a shared db etc.

How does one go about ensuring that reporting is accurate? i.e if prom server 1 and 2 are monitoring endpoint a, but server 1 goes offline for say 10 minutes (outage/updates/reboot etc), when it comes back up it will be missing the data for the 10 minute winow, that will still be available on server 2. But if reporting/graphing is reading from server 1, it will be missing a 10 minute window.

Is the idea that you federate the both prometheus instances to a third server and then run the reporting/graphing from the third federated server i.e if server 1 is missing the 10 mins of data it doesn’t matter because server 2 will federate that data to the third instance?

Thanks in advance

dataexception · October 22, 2025, 2:28pm

You are correct. There is no clustering in the traditional sense. The TSDB is a very durable database designed to be used by a single Prometheus, and can not effectively be shared among multiple instances. Federation is the native “clustering” option available, and is a good option for using it as your query node/data source in Grafana. It does have its challenges and appropriate use cases, though.

If, as your post indicates, you’re ultimately looking to implement a resilient observability platform that is capable of scaling. If you would like to avoid the complexity of scaling using native Prometheus alone and/or storing the data for a longer term, you may want to look into incorporating something like Grafana Mimir to save yourself a lot of toil. There are other options, like Thanos, Victoria Metrics, and Cortex, but we found Mimir to be a very quick and simple implementation.

Hope this helps!

dataexception · October 22, 2025, 3:25pm

I just reread your last paragraph, and realized I didn’t actually answer your question.

Yes, the federation server would be a 3rd standalone Prometheus whose scrape config consists of scraping the first two’s /federate endpoint. You could then use server 3 as your Prometheus data source in Grafana to ensure the data displayed in your example would not show the 10 minute gap where server 1 was down.

Topic		Replies	Views
Prometheus Federation Prometheus server	1	445	April 12, 2021
Multi Cluster Prometheus Best Practise Prometheus server	1	974	February 25, 2024
Prometheus replication - Is federation the way to go? Prometheus server	3	1131	July 15, 2021
Federation as a way to "group" metrics and expose 1 interface instead of many Prometheus server	1	729	March 31, 2023
One cluster with Prometheus to scrape multiple clusters Prometheus server	4	1822	February 15, 2023

Clustering Architecture

Related topics