I’m using Feign/Hystrix, micrometer, prometheus.
The following query works, but I need to modify it and I don’t know how:
sum by(group, key) (increase(hystrix_execution_total{event="exception_thrown"}[1m])) / sum by(group, key) (increase(hystrix_execution_terminal_total[1m])) * 100 >= 5
Basically exception_thrown
is one event emitted by Hystrix, but Hystrix also emits another metric bad_request
which signifies a client error (4…). To get an accurate measure of upstream server errors, I need to subtract two metrics, which have the same labels, but different values, and then get a rate per minute to alert on
The following doesn’t work, No datapoints found.
hystrix_execution_total{job="auth",key="authenticate",event="exception_thrown"} - hystrix_execution_total{job="auth",key="authenticate",event="bad_request"}
any ideas if it is possible to achieve this?
Here is a graph to help explain the problem, it’s hystrix calls related to authentication, most errors are users submitting wrong credentials (bad request), I need to distinguish those from our authentication server 5… errors