Metrics missing from all servers for ~1h after a single server restart

GIT-Tech · September 2, 2025, 11:57am

Hi everyone,

I’m running into a strange issue with my Prometheus setup and would appreciate some help.

Environment

OS: Debian on all servers
Setup:
- Each server runs its own Prometheus instance (local scrape jobs)
- Each local Prometheus sends data via remote write (with basic auth) to a central Prometheus “master” server
Time config: All servers use UTC timezone and are time-synced via NTP
Prometheus: Runs as a systemd service (latest version, not in Docker)
Scrape interval: 60s

The problem

Whenever a single server is restarted, metrics for all servers disappear in Grafana/Prometheus for about exactly 1 hour.
After that, the metrics reappear automatically without any manual intervention.

The problem sometimes occurs independently of restarts or other factors and simply happens randomly without anyone having done anything. It happens irregularly and there is no discernible logical connection.

Example log

On the client servers, I always see this message when the issue occurs:

Aug 25 09:28:31 hosting prometheus[505]: time=2025-08-25T09:28:31.071Z level=WARN source=scrape.go:1402 msg="Appending scrape report failed" component="scrape manager" scrape_pool=***" err="out of bounds"
Aug 25 09:28:37 hosting prometheus[505]: time=2025-08-25T09:28:37.065Z level=WARN source=scrape.go:1887 msg="Error on ingesting samples that are too old or are too far into the future" component="scrape manager" scrape_pool=*** num_dropped=636

Example node config

# my global config
global:
  scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'hosting.node_exporter'
    static_configs:
      - targets: ['localhost:9100']
    metric_relabel_configs:
      - source_labels: [__name__]
        target_label: __name__
        regex: '.*'
        replacement: 'hosting_$0'

remote_write:
  - url: "https://example.de:9090/api/v1/write"
    basic_auth:
      username: "XXX"
      password: "XXX"

Question:
Has anyone seen this kind of behavior before, where a single node causes missing data across all nodes for one hour?

Thanks in advance!

Topic		Replies	Views
Non-existent metrics Prometheus server	6	483	April 27, 2023
Expected equal, got \"INVALID\"" Prometheus server	0	735	January 27, 2022
Delay in data updates General Help/Support	0	49	December 30, 2024
Network lost wile sending data to Cloud service Scaling / Clustering / Long-Term Storage	2	259	September 19, 2023
Sudden unexplained drop in scrape_samples_scraped within every 24 hours Prometheus server	0	607	March 28, 2022

Metrics missing from all servers for ~1h after a single server restart

Environment

The problem

Example log

Example node config

Related topics