Promehteus histogram_quatile() is producing inaccurate estimate of duration of requests


I am trying to wrap my head around Prometheus histograms and histogram_quantile() and estimate errors.

We’ve set up request duration metrics for a service, but I keep seeing inaccurate duration on our Grafana graph in comparison to our logs(which prints duration). In other words, I see one value passed to Prometheus collector and a another bigger value after a query in Grafana. When I say bigger, I mean the logs says request took 1.2sec, but Grafana shows it took somewhere around 2.3sec(for 99th percentile).

  • Here is the query(same one is used for .95 and .50):
histogram_quantile(0.99,sum(rate(service_request_duration_seconds_bucket{path="some path"}[$__rate_interval])) by (le))
  • Here are the buckets:
var durationTimeBucketsInSeconds = []float64{.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

I did some research and found out about Histograms and summaries | Prometheus

Is there anyway to estimate closer to the real value, like maybe by adjusting buckets?

What does the distribution of request durations look like? Are they spread throughout the various buckets or are they mostly between 1 and 2.5 seconds?

We don’t really now, hence the implementation of the metrics. But if it was between 1 and 2.5 seconds it would mean to get a better estimate we would put more buckets in that range?