Exporting AWS Service Metrics into Prometheus

Hi everyone!

I work in a place where we heavily leverage AWS infrastructure: DynamoDB, SQS, EC2, etc. For monitoring our services we are using a Prometheus-based solution.

We would like to have a one-stop-shop for monitoring and the Cloudwatch specific metrics have been a thorn in our side. EC2 and container-level metrics were fairly straight-forward to work around using node_exporter and cAdvisor, but the ones that are only published only through Cloudwatch have been a pain.

We are using YACE for ingesting these metrics and even though we are able to get the metrics ingested and available, their value is not great.
There are several shortcomings where no workaround is feasible:

  • It uses the Cloudwatch API and AWS adds latency in the metrics exposed through it in addition to the refresh latency configured in YACE. The larger the underlying metric footprint, the worse it gets. If I’m not mistaken all-in-all we have something like 15-25min latency for some metrics. Trying to reduce this can lead to gaps in the metrics.
  • If an AWS entity changes its tags and that change makes it so that metrics should be exported, YACE needs to be restarted.
  • The way metrics are translated from “Cloudwatch format” to Prometheus format makes it really hard to work with those metrics. All metrics are exposed as gauges which reduces their utility to look at them at different time intervals, get rates out of them, etc.

I assume this is not an isolated case. I wanted to ask the community, how is everyone else dealing with AWS Cloudwatch metrics in Prometheus? Are there any better exporters or just defaulting to use two different systems for monitoring?

Thanks in advance!

As you say where possible use other exporters (such as MySQL or node exporter) to get native Prometheus format metrics, however there are some which are only available via Cloudwatch Metrics. For those there is no choice other than to use YACE or the Cloudwatch exporter to fetch the metrics into Prometheus, as AWS doesn’t provide an endpoint to scrape with realtime Prometheus format metrics. Many of the limitations you are mentioning come from Cloudwatch metrics itself - my understanding is that the system isn’t designed for realtime data ingenstion and therefore delays up to 15 minutes are to be expected. There isn’t anything Prometheus can do to work around that.

Ideally AWS would provide Prometheus compatible endpoints we could scrape directly, which they are doing in a few places, but I’d imagine for some features that isn’t possible due to the underlying archetecture (if data is not handled in a realtime compatible maner).