I’m considering using Prometheus for monitoring a kubernetes/microservices architecture where the number of nodes/pods varies and Prometheus discovery seems to be a very good fit.
External users of my kubernetes cluster want access to metrics exposed by the cluster, and I’d like to offer a single endpoint for metrics that gives metrics “about the whole cluster” which includes anything inside the cluster (autodiscovered by Prometheus).
In addition to this, I’m trying to not make any assumptions as to retention time or what to keep/drop, ideally without becoming responsible for keeping a big amount of data (if somebody wants to keep 4 weeks of metrics, that’s fine but I don’t want to become responsible for managing that myself).
All in all, I’m considering something like this:
External Prometheus (outside kubernetes) -> Internal Prometheus (inside kubernetes) -> microservices
The external Prometheus would use /federate endpoint on the internal instance and would be owned/managed by somebody external to the cluster. They would have full flexibility as to what to keep and how long to keep it for, as well as full responsibility for the storage space/replication implications of their choices.
The internal Prometheus would use autodiscovery and a reduced retention (e.g. 1 day?). This way, from a space point of view I shouldn’t have to worry too much about this, and from an access point of view I don’t need to open up my N microservice’s metrics outside the cluster. I would only have 1 endpoint (prometheus) where everything can be read and I can fully control.
I read (Federation, what is it good for? – Robust Perception | Prometheus Monitoring Experts) that using federation and “pulling everything” is not not a good idea, but I’d like to understand if it’s also a bad idea on the described scenario.
In terms of load or amount of data being scraped, I don’t think the number of metrics is big. We’re talking roughly less than 5 nodes and somewhere between 2 and 50 microservices.
The main goals for me try using federation are:
- expose 1 endpoint instead of N (N is different in various environments and I’d like to stick to just 1)
- expose the same amount of endpoints (1) always, instead of N changing when applications are added/removed from the kubernetes environment
- limit the amount of information I need to provide “outside users” of the metrics. If I tell them to scrape/use only 1 endpoint I don’t need to tell them to scrape N today then M tomorrow whenever the deployed microservices change.
Hopefully the question makes sense; sorry for the long post.
If anybody has opinions or more material to read I’d appreciate.