I work in a place where we heavily leverage AWS infrastructure: DynamoDB, SQS, EC2, etc. For monitoring our services we are using a Prometheus-based solution.
We would like to have a one-stop-shop for monitoring and the Cloudwatch specific metrics have been a thorn in our side. EC2 and container-level metrics were fairly straight-forward to work around using
cAdvisor, but the ones that are only published only through Cloudwatch have been a pain.
We are using YACE for ingesting these metrics and even though we are able to get the metrics ingested and available, their value is not great.
There are several shortcomings where no workaround is feasible:
- It uses the Cloudwatch API and AWS adds latency in the metrics exposed through it in addition to the refresh latency configured in YACE. The larger the underlying metric footprint, the worse it gets. If I’m not mistaken all-in-all we have something like 15-25min latency for some metrics. Trying to reduce this can lead to gaps in the metrics.
- If an AWS entity changes its tags and that change makes it so that metrics should be exported, YACE needs to be restarted.
- The way metrics are translated from “Cloudwatch format” to Prometheus format makes it really hard to work with those metrics. All metrics are exposed as gauges which reduces their utility to look at them at different time intervals, get rates out of them, etc.
I assume this is not an isolated case. I wanted to ask the community, how is everyone else dealing with AWS Cloudwatch metrics in Prometheus? Are there any better exporters or just defaulting to use two different systems for monitoring?
Thanks in advance!