Hello!
I am monitoring the status (up or down) of my services using the default up
metric. I understand the value of this metric is 1 if the last scrape was successful, else it is 0. My metrics are available at my-domain.com/service-name/prometheus
, which is only down when the whole service is down.
I have the following queries:
up{instance="my-dev-domain.com", service_name=~"*service"}
up{instance="my-prod-domain.com", service_name=~"*service"}
The names of the services all end in service.
Problem
When I bring down any service in the dev environment, the second query also returns 0 for that service. This is an issue because it triggers alerting and a change in my Grafana visuals, even though the service was never down in prod to begin with (I would have noticed that). I don’t understand why Prometheus evaluates the second query to be 0, when the instance label is clearly different.
What I expected to happen
The second query should be independent from the first one and not return 0 for a given service when the first one does.
What I tried
I did a sanity check of the DNS records of the two domains and they are set up correctly. So Prometheus should be scraping different data for the two environments. I also tried adding a new label, environment=dev
or environment=prod
to each target, hoping that would create less ambiguity between queries (even though instance
was already there), but it didn’t make a difference.
It seems to me that Prometheus is mistaking the two jobs for each other somehow.
Configuration
I deploy Prometheus via Helm. Here’s a snippet from my config:
prometheus.yml:
scrape_configs:
- job_name: my-services
scrape_interval: 15s
basic_auth:
username_file: /etc/prometheus/secrets/basicauth/username
password_file: /etc/prometheus/secrets/basicauth/password
static_configs:
- targets:
- my-dev-domain.com
- my-prod-domain.com
labels:
service_name: backendservice
- targets:
- my-dev-domain.com
- my-prod-domain.com
labels:
service_name: frontendservice
relabel_configs:
- source_labels: [ service_name ]
target_label: __metrics_path__
replacement: /service/$1/prometheus
I am running Prometheus 3.5.0.
Thanks and let me know if something’s unclear. I put this post in the Prometheus server category as I feel the problem is perhaps not PromQL related.