Pushing Prometheus metrics through several network layers

Assume a dataflow as in the picture. Data is being transported from a location to the cloud via a data gateway and bridged MQTT brokers to do Stuff™. The firewalls allows you to initialize communication from a lower network layer to a higher one, but not vice-versa. You can also not skip network layers.

The data gateway, OS, and MQTT brokers have /metrics endpoints that have metrics in the Prometheus format. I would like to push these metrics to the cloud so that I can monitor the data flow and servers from there.

I read about federation, and the push gateway. Using one of these would at least solve pushing metrics through one network layer. However the documentation recommends against using push gateway for services that are always running. It would also mean implementing the push model in all services, and opening more firewall ports.

I could create a set of services for replicating metrics in the network layer through the existing MQTT dataflow. But that would be a custom solution.

Is there a standard way of setting up Prometheus to push metrics through several network layers? Or existing tooling?

Prometheus is a pull based system rather than being push based. You can use remote write to push metrics from one Prometheus server (or agent) to another (or a different remote write store).

So the general design would be to co-locate a Prometheus server (or servers) with your brokers to do the local scraping. Then use remote write to push that data to a central Prometheus server, or something like Thanos.

Thank you. If I understand you and the documentation correctly Prometheus cannot remote write to Prometheus. But you can use any of the remote write destinations, like Thanos or Timescale, and use their functionality to solve this.

A Prometheus server can be used as a remote write destination.

If you are only needing those Prometheus servers to scrape and then deliver the metrics to Timescale (rather than having local storage or querying capabilities) then you can use “agent mode” which has a lower resource usage.