Hi,
New around these parts. I am fairly new to prometheus, I have the basics sorted and currently monitoring some HA proxy load balancers with a single prometheus instance and grafana (single vm job). I am now looking at ways to scale this to a resillient platform as we want to role it out further across the estate, however I am confused.
I think from what I am reading - there is no such thing as a “cluster” of Prometheus servers, such as applications that have a shared db etc.
How does one go about ensuring that reporting is accurate? i.e if prom server 1 and 2 are monitoring endpoint a, but server 1 goes offline for say 10 minutes (outage/updates/reboot etc), when it comes back up it will be missing the data for the 10 minute winow, that will still be available on server 2. But if reporting/graphing is reading from server 1, it will be missing a 10 minute window.
Is the idea that you federate the both prometheus instances to a third server and then run the reporting/graphing from the third federated server i.e if server 1 is missing the 10 mins of data it doesn’t matter because server 2 will federate that data to the third instance?
Thanks in advance
You are correct. There is no clustering in the traditional sense. The TSDB is a very durable database designed to be used by a single Prometheus, and can not effectively be shared among multiple instances. Federation is the native “clustering” option available, and is a good option for using it as your query node/data source in Grafana. It does have its challenges and appropriate use cases, though.
If, as your post indicates, you’re ultimately looking to implement a resilient observability platform that is capable of scaling. If you would like to avoid the complexity of scaling using native Prometheus alone and/or storing the data for a longer term, you may want to look into incorporating something like Grafana Mimir to save yourself a lot of toil. There are other options, like Thanos, Victoria Metrics, and Cortex, but we found Mimir to be a very quick and simple implementation.
Hope this helps!
I just reread your last paragraph, and realized I didn’t actually answer your question.
Yes, the federation server would be a 3rd standalone Prometheus whose scrape config consists of scraping the first two’s /federate endpoint. You could then use server 3 as your Prometheus data source in Grafana to ensure the data displayed in your example would not show the 10 minute gap where server 1 was down.