Serverless Computing, Prometheus & Push

No doubt bringing up the push vs pull debate here would be to beat a dead horse… however… with the increase in popularity of “serverless” compute platforms in recent years, such as AWS Lambda, Google Cloud’s Functions, Cloud Run and AppEngine, I wonder if this part of the problem space doesn’t need to be reconsidered?

One thing these platforms have in common is that individual instances aren’t generally discoverable / scrapeable, however they are not used exclusively for batch/ephemeral jobs. Autoscaling of instances is the norm, as is short-to-medium duration instance lifespan.

For developers trying to build an entirely “serverless” solution, who would generally prefer to use Prometheus because of the great client libraries, exposition format, query language and performance, they find themselves unable to do so.

We have the Pushgateway, but this itself needs to be a separate long-running and discoverable job, and is a potential bottleneck for large serverless workloads.

Is there any appetite to re-consider first class Push support in the Prometheus server, or greater investment in making Pushgateway a more scaleable solution?

My ideal outcome would be a SaaS provider offering the Push API in such a way that it can consume all the metric streams of a large organization and store them in Prometheus. Then developers could monitor applications with Prometheus without having to deploy an long-running daemon style infrastructure, with all its associated cost and complexity.

Thanks for considering!
-dgc

1 Like

[dgc] dgc https://discuss.prometheus.io/u/dgc
March 25

No doubt bringing up the push vs pull debate here would be to beat a
dead horse… however… with the increase in popularity of “serverless”
compute platforms in recent years, such as AWS Lambda, Google Cloud’s
Functions, Cloud Run and AppEngine, I wonder if this part of the
problem space doesn’t need to be reconsidered?

One thing these platforms have in common is that individual instances
aren’t generally discoverable / scrapeable, however they are not used
exclusively for batch/ephemeral jobs. Autoscaling of instances is the
norm, as is short-to-medium duration instance lifespan.

For developers trying to build an entirely “serverless” solution, who
would generally prefer to use Prometheus because of the great client
libraries, exposition format, query language and performance, they
find themselves unable to do so.

We have the Pushgateway, but this itself needs to be a separate
long-running and discoverable job, and is a potential bottleneck for
large serverless workloads.

Is there any appetite to re-consider first class Push support in the
Prometheus server, or greater investment in making Pushgateway a more
scaleable solution?

As you are probably aware the Push gateway is really designed for
individual short lived jobs (e.g. cron) rather than dynamic serverless
applications. In general for serverless/managed systems from the major
cloud providers the only option is the vendor’s logging & metrics
solution. For example if you want metrics for a (managed) AWS load
balancer you need to look at Cloudwatch, or if you want lambda logs look
at Cloudwatch Logs. While you can overlay overlay another
metrics/logging system to some extent (for example you could send lambda
logs to Elasticsearch & metrics to statsd) it would only cover certain
parts - it would be based on your code, so wouldn’t include anything
outside of that (for example no total function duration, network
traffic, cost, logs if it failed to start, etc.).

As a result you are basically forced to use the built in metrics/logging
solutions from the cloud provider. In the ideal world they would expose
this information as a scrapable Prometheus format endpoint (AWS MKS
actually has an option for this), but I’d not hold my breath.

“Push” doesn’t significantly help here - you would still only be able to
use that for a tidy selection of metrics from your code (e.g. a custom
business metric), meaning that you are still dependant on the vendor’s
metrics system.

Instead I think you should be looking at things like the Cloudwatch
Exporter to pull in metrics from the vendor system into the Prometheus
ecosystem - it is the only way for many useful metrics (serverless
networking information, etc.)

My ideal outcome would be a SaaS provider offering the Push API in
such a way that it can consume all the metric streams of a large
organization and store them in Prometheus. Then developers could
monitor applications with Prometheus without having to deploy an
long-running daemon style infrastructure, with all its associated cost
and complexity.

My ideal would be the vendor offering an endpoint from which you can
pull Prometheus metrics - push can be easier for some things, but also
has its own problems. For example it might only allow for a single
destination (meaning you couldn’t have HA servers or a distributed
system design, or have to run a distribution application) or only a
single interval (even if multiple destinations).

There is an aggregation gateway that is explicitly designed for push sample use case.

But the big thing I would love to have is native Prometheus/OpenMetrics support from these frameworks. But it’s difficult, because it takes away side-effect sales from the platform (AWS Lambda/ GCP Cloud Functions/etc) monitoring solutions (Cloudwatch/Stackdriver/etc)