Hello! For some reason it seems that prometheus is dropping cadvisor metrics for about half of our ~250 nodes. It appears that prometheus is scraping nodes but I cannot see cadvisor metrics for many of our nodes. What could be causing this? I would expect prometheus to either keep metrics from all nodes or drop metrics from all nodes.
scrape_samples_scraped{job="kube-cadvisor", kubernetes_io_hostname="node-missing-metrics"} = 0
scrape_samples_post_metric_relabeling{job="kube-cadvisor", kubernetes_io_hostname="node-missing-metrics"} = ~1500
Environment
- Kubernetes: 1.16.8
- Prometheus: 2.26.0
** Kube-cadvisor job from Prometheus.yml**
- Note - I checked, and our “samples_scraped” is under our sample_limit. I also don’t see any nodes labeled “prometheus_io_scrape=false” or anything similar.
job_name: 'kube-cadvisor'
sample_limit: 5000000
scheme: https
scrape_interval: 15s
metrics_path: /metrics/cadvisor
kubernetes_sd_configs:
- role: node
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
action: replace
regex: (.+):(?:\d+)
replacement: ${1}:10250
target_label: __address__
- source_labels: [__address__]
regex: (.+):(?:\d+)
target_label: instance
replacement: ${1}
- source_labels: [__meta_kubernetes_node_label_prometheus_io_scrape]
regex: 'false'
action: drop
metric_relabel_configs:
# cAdvisor
- source_labels: [pod_name]
regex: .+
action: keep
- source_labels: [__name__]
regex: container_tasks_state
action: drop
- source_labels: [__name__, interface]
regex: container_network.+;(?:tun.+|dummy.+|kube-.+|docker.+)
action: drop
- source_labels: [__name__, scope]
regex: container_memory_.+;hierarchy
action: drop
- source_labels: [__name__]
regex: container_spec_memory_(limit_bytes|swap_limit_bytes)
action: drop
- source_labels: [__name__]
regex: container_network_(tcp|udp)_usage_total
action: drop
- source_labels: [__name__]
regex: container_cpu_(user_seconds_total|system_seconds_total|load_average_10s|cfs_.+)
action: drop
- source_labels: [__name__]
regex: container_fs_(inodes_.+|limit_bytes|reads_merged_total|read_seconds_total|writes_merged_total|write_seconds_total|io_.+|sector_.+|usage_bytes|writes_total|reads_total)
action: drop
- source_labels: [__name__]
regex: container_memory_(mapped_file)
action: drop
- regex: 'beta_kubernetes_io_arch'
action: labeldrop
- regex: 'beta_kubernetes_io_os'
action: labeldrop
- regex: 'beta_kubernetes_io_instance_type'
action: labeldrop
- regex: 'failure_domain_beta_kubernetes_io_region'
action: labeldrop
- regex: 'failure_domain_beta_kubernetes_io_zone'
action: labeldrop
- source_labels: [container_name]
target_label: container
- source_labels: [pod_name]
target_label: pod