Context deadline exceeded when scraping node-exporter and cadvisor


I have a docker swarm consisting of two servers, deploys4 and deploys5. Prometheus is running on deploys4 (manager) and should scrape metrics from both servers (node exporter and cadvisor are deployed globally). However it is unable to scrape any metrics from containers on deploys5.

i am able to ping all containers above from within the prometheus container, and nc-vz returns open ports

/prometheus # nc -vz 10.0.117.6 9100
10.0.117.6 (10.0.117.6:9100) open
/prometheus # nc -vz 10.0.117.7 9100
10.0.117.7 (10.0.117.7:9100) open
/prometheus # traceroute 10.0.117.7
traceroute to 10.0.117.7 (10.0.117.7), 30 hops max, 46 byte packets
 1  monitoring_node-exporter.qtl1cg0qvc025bknkjtbehzxs.lwqkllxuj34yis8sx39sg3fp5.monitoring_swarm-monitoring (10.0.117.7)  0.183 ms  0.267 ms  0.296 ms
/prometheus # traceroute 10.0.117.6
traceroute to 10.0.117.6 (10.0.117.6), 30 hops max, 46 byte packets
 1  monitoring_node-exporter.erebx15hpi0ex5wofna3r4r4s.vuy4zbna0pkllb585kq7ea5vf.monitoring_swarm-monitoring (10.0.117.6)  0.009 ms  0.005 ms  0.003 ms

for the sake of testing i have tried turning off UFW and it didn’t change anything, i was still unable to scrape deploys5

this is my prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'cadvisor'
    dns_sd_configs:
    - names:
      - 'tasks.monitoring_cadvisor'
      type: 'A'
      port: 8080

  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.monitoring_node-exporter'
      type: 'A'
      port: 9100

and this is my stack file:

version: '3.3'
services:
  prometheus:
    image: prom/prometheus:latest
    configs:
      - source: prometheus.yml
        target: /etc/prometheus/prometheus.yml
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == manager
      labels:
        - traefik.enable=true
        - traefik.http.routers.prometheus.rule=Host(`url`)
        - traefik.http.routers.prometheus.entrypoints=https
        - traefik.http.routers.prometheus.tls=true
        - traefik.http.services.prometheus.loadbalancer.server.port=9090
        - traefik.http.routers.prometheus.middlewares=prometheus-auth
        - traefik.http.middlewares.prometheus-auth.basicauth.users=user:pass
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    networks:
      - swarm-monitoring
      - traefik-public

  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    deploy:
      mode: global  # Deploy on all nodes in the swarm
    networks:
      - swarm-monitoring
      
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /sys/fs/cgroup:/sys/fs/cgroup:ro
    deploy:
      mode: global  # Deploy on all nodes in the swarm
    networks:
      - swarm-monitoring

configs:
  prometheus.yml:
    external: true
    
networks:
  traefik-public:
    external: true
  swarm-monitoring:
    driver: overlay

it is also worth noting that this exact stack file and prometheus.yml were able to scrape properly on another experimental docker swarm

I have this same problem, and I think it’s a limitation of the DNS service discovery. In my case, 23 targets in my SRV record is enough to crash the scrapes. Even though my bind servers are serving all 23 records (1232 bytes), prometheus seems to be choking on the DNS answer. Once I add my 23rd SRV record, all the scrapes crash and prometheus will successfully scrape only 3-4 random targets at a time.

After investigating further, I found in the prometheus dns SD code that the limitation for DNS results is 64KiB, which is more than enough to serve 1000+ records. So I dug into this particular expoter, which is a custom-built probe-based exporter, similar to snmp-exporter. That web service had a connection limit. Once I increased the connection limit on the exporter service, everything cleared up!

Hello, I have the exact same issue but I really don’t understand your solution. Could you tell me what step I need to do to increase the connection limit ? Thanks