Hello, community!
I’m building a Grafana dashboard to monitor the latency of requests made to a specific API. The metrics are being collected via the Google Cloud Managed Service for Prometheus and accessed in Grafana using Google Cloud Monitoring as the data source via PromQL.
The queries work fine in certain situations, such as when calculating the average. However, when I try to calculate the percentile, I encounter performance issues, with the query taking too long and hitting the limit on the amount of data that can be returned. Below is the error I get when running the query:
Error:
resource_exhausted: expanding series: generic::resource_exhausted: Maximum response size has been reached. Got 200087438 bytes, limit is 200000000. Consider querying less data by increasing the step or interval, using more filters and aggregations, or limiting the time duration.
My query is too heavy, and I’m unable to calculate the percentile values. This is the query I am running:
quantile(0.99, coleta_online_request_latency_milliseconds{product_id=~'$product_id', base_name=~'$base_name'} [${__interval}])
This query works well in test environments, but in the production environment, with large data volumes, the issue occurs more frequently. I would like to know if there is a way to optimize this query while still returning percentile data.
I contacted Google Cloud support, and they informed me that the issue arises due to the large amount of data being queried. They suggested that a possible solution would be to optimize the query. However, I have already tried several approaches and haven’t been able to optimize the query effectively.
Limiting the query with filters works, but my goal is to obtain an overall view of latency time in percentiles, without losing the data scope. My metrics are being exported with the following structure:
# HELP coleta_online_request_latency_milliseconds gauge coleta online request latency milliseconds
# TYPE coleta_online_request_latency_milliseconds gauge
coleta_online_request_latency_milliseconds{product_id="509", base_name="Mock 1", request_status="success"} 272
coleta_online_request_latency_milliseconds{product_id="510", base_name="Mock 2", request_status="success"} 443
coleta_online_request_latency_milliseconds{product_id="521", base_name="Mock 1", request_status="error"} 588
coleta_online_request_latency_milliseconds{product_id="521", base_name="Mock 1", request_status="success"} 793
Has anyone in the community had any suggestions for solving this issue? Is there a way to better optimize these queries to calculate latency in percentiles?
Any suggestions or ideas will be greatly appreciated. I’ve tried several approaches for hours without getting a satisfactory result.
Thank you in advance for your help and support!
Best regards,
Reinan