0
I am trying to wrap my head around Prometheus histograms and histogram_quantile() and estimate errors.
We’ve set up request duration metrics for a service, but I keep seeing inaccurate duration on our Grafana graph in comparison to our logs(which prints duration). In other words, I see one value passed to Prometheus collector and a another bigger value after a query in Grafana. When I say bigger, I mean the logs says request took 1.2sec, but Grafana shows it took somewhere around 2.3sec(for 99th percentile).
- Here is the query(same one is used for .95 and .50):
histogram_quantile(0.99,sum(rate(service_request_duration_seconds_bucket{path="some path"}[$__rate_interval])) by (le))
- Here are the buckets:
var durationTimeBucketsInSeconds = []float64{.01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
I did some research and found out about Histograms and summaries | Prometheus
Is there anyway to estimate closer to the real value, like maybe by adjusting buckets?