Context deadline exceeded when scraping node-exporter and cadvisor

Toofa · July 5, 2023, 12:09pm

I have a docker swarm consisting of two servers, deploys4 and deploys5. Prometheus is running on deploys4 (manager) and should scrape metrics from both servers (node exporter and cadvisor are deployed globally). However it is unable to scrape any metrics from containers on deploys5.

i am able to ping all containers above from within the prometheus container, and nc-vz returns open ports

/prometheus # nc -vz 10.0.117.6 9100
10.0.117.6 (10.0.117.6:9100) open
/prometheus # nc -vz 10.0.117.7 9100
10.0.117.7 (10.0.117.7:9100) open
/prometheus # traceroute 10.0.117.7
traceroute to 10.0.117.7 (10.0.117.7), 30 hops max, 46 byte packets
 1  monitoring_node-exporter.qtl1cg0qvc025bknkjtbehzxs.lwqkllxuj34yis8sx39sg3fp5.monitoring_swarm-monitoring (10.0.117.7)  0.183 ms  0.267 ms  0.296 ms
/prometheus # traceroute 10.0.117.6
traceroute to 10.0.117.6 (10.0.117.6), 30 hops max, 46 byte packets
 1  monitoring_node-exporter.erebx15hpi0ex5wofna3r4r4s.vuy4zbna0pkllb585kq7ea5vf.monitoring_swarm-monitoring (10.0.117.6)  0.009 ms  0.005 ms  0.003 ms

for the sake of testing i have tried turning off UFW and it didn’t change anything, i was still unable to scrape deploys5

this is my prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'cadvisor'
    dns_sd_configs:
    - names:
      - 'tasks.monitoring_cadvisor'
      type: 'A'
      port: 8080

  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.monitoring_node-exporter'
      type: 'A'
      port: 9100

and this is my stack file:

version: '3.3'
services:
  prometheus:
    image: prom/prometheus:latest
    configs:
      - source: prometheus.yml
        target: /etc/prometheus/prometheus.yml
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == manager
      labels:
        - traefik.enable=true
        - traefik.http.routers.prometheus.rule=Host(`url`)
        - traefik.http.routers.prometheus.entrypoints=https
        - traefik.http.routers.prometheus.tls=true
        - traefik.http.services.prometheus.loadbalancer.server.port=9090
        - traefik.http.routers.prometheus.middlewares=prometheus-auth
        - traefik.http.middlewares.prometheus-auth.basicauth.users=user:pass
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    networks:
      - swarm-monitoring
      - traefik-public

  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    deploy:
      mode: global  # Deploy on all nodes in the swarm
    networks:
      - swarm-monitoring
      
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /sys/fs/cgroup:/sys/fs/cgroup:ro
    deploy:
      mode: global  # Deploy on all nodes in the swarm
    networks:
      - swarm-monitoring

configs:
  prometheus.yml:
    external: true
    
networks:
  traefik-public:
    external: true
  swarm-monitoring:
    driver: overlay

it is also worth noting that this exact stack file and prometheus.yml were able to scrape properly on another experimental docker swarm

pixelrebel · September 5, 2023, 7:11pm

I have this same problem, and I think it’s a limitation of the DNS service discovery. In my case, 23 targets in my SRV record is enough to crash the scrapes. Even though my bind servers are serving all 23 records (1232 bytes), prometheus seems to be choking on the DNS answer. Once I add my 23rd SRV record, all the scrapes crash and prometheus will successfully scrape only 3-4 random targets at a time.

pixelrebel · September 5, 2023, 7:39pm

After investigating further, I found in the prometheus dns SD code that the limitation for DNS results is 64KiB, which is more than enough to serve 1000+ records. So I dug into this particular expoter, which is a custom-built probe-based exporter, similar to snmp-exporter. That web service had a connection limit. Once I increased the connection limit on the exporter service, everything cleared up!

SMUEric1127 · February 2, 2024, 6:18pm

Hello, I have the exact same issue but I really don’t understand your solution. Could you tell me what step I need to do to increase the connection limit ? Thanks

Topic		Replies	Views
prometheus isn't scraping node exporter General Help/Support	8	1378	June 16, 2021
Prometheus isn't scrapping cadvisor and node-exporter are DOWN Exporters and Metrics	0	778	June 16, 2021
Can't find docker_exporter node_exporter	1	17	June 27, 2025
Prometheus - node-exporter error connection refused although I can reach it with curl from other machines General Help/Support	0	1100	March 5, 2022
Get "http://xx.xx.xx.xx:11200/metrics": context deadline exceeded Prometheus server	4	2013	February 18, 2023

Context deadline exceeded when scraping node-exporter and cadvisor

Related topics