Metrics missing from all servers for ~1h after a single server restart

Hi everyone,

I’m running into a strange issue with my Prometheus setup and would appreciate some help.

Environment

  • OS: Debian on all servers

  • Setup:

    • Each server runs its own Prometheus instance (local scrape jobs)

    • Each local Prometheus sends data via remote write (with basic auth) to a central Prometheus “master” server

  • Time config: All servers use UTC timezone and are time-synced via NTP

  • Prometheus: Runs as a systemd service (latest version, not in Docker)

  • Scrape interval: 60s

The problem

Whenever a single server is restarted, metrics for all servers disappear in Grafana/Prometheus for about exactly 1 hour.
After that, the metrics reappear automatically without any manual intervention.

The problem sometimes occurs independently of restarts or other factors and simply happens randomly without anyone having done anything. It happens irregularly and there is no discernible logical connection.

Example log

On the client servers, I always see this message when the issue occurs:

Aug 25 09:28:31 hosting prometheus[505]: time=2025-08-25T09:28:31.071Z level=WARN source=scrape.go:1402 msg="Appending scrape report failed" component="scrape manager" scrape_pool=***" err="out of bounds"
Aug 25 09:28:37 hosting prometheus[505]: time=2025-08-25T09:28:37.065Z level=WARN source=scrape.go:1887 msg="Error on ingesting samples that are too old or are too far into the future" component="scrape manager" scrape_pool=*** num_dropped=636

Example node config

# my global config
global:
  scrape_interval: 60s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'hosting.node_exporter'
    static_configs:
      - targets: ['localhost:9100']
    metric_relabel_configs:
      - source_labels: [__name__]
        target_label: __name__
        regex: '.*'
        replacement: 'hosting_$0'

remote_write:
  - url: "https://example.de:9090/api/v1/write"
    basic_auth:
      username: "XXX"
      password: "XXX"


Question:
Has anyone seen this kind of behavior before, where a single node causes missing data across all nodes for one hour?

Thanks in advance!