Prometheus Remote-Write - Backward Uploads

Hi everyone,

I would like to ask for help about my situation. Our team had an incident about pushing metrics via remote-write to Mimir Distributor. During that issue Mimir Distributor replied to Prometheus multiple HTTP 500. This incident lasted more than 2 hours.

ts=2022-06-09T09:48:50.696Z caller=dedupe.go:112 component=remote level=warn remote_name=bbcb71 url=https://distributor.mimir/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: too many inflight push requests in distributor"

ts=2022-06-09T09:52:00.400Z caller=dedupe.go:112 component=remote level=warn remote_name=bbcb71 url=https://distributor.mimir/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: DoBatch: InstancesCount <= 0"

ts=2022-06-09T09:52:18.451Z caller=dedupe.go:112 component=remote level=warn remote_name=18ef4e url=https://distributor.mimir/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

After reconnecting, Prometheus sent metrics from the last hours to Mimir distributor.
The problem is that I cannot repeat this situation again.

Simulation environment

  • Prometheus-0 (a statefulset sending data via remote-write)
  • Prometheus-RW (deployment with enabled feature remote-write-receiver)
  • Nginx (deployment answering with HTTP 500 to prometheus endpoint)

Our deployments are in k8s.

  1. I deployer prometheus-rw to receive metrics
  2. I deployed prometheus for sending data… with current configuration, same as during the incident:
    remote_write:
    - url: http://prometheus-rw.svc:9090/api/v1/write
      queue_config:
        min_shards: 2
        retry_on_http_429: true
  1. I deployed another fake nginx, with this configuration:
    http {
      server {
           listen 9090;

           location /api/v1/write {
                return 500;
           }
      }
    }
  1. I checked, that prometheus-rw receives metrics.
  2. Next step was edit kubernetes service prometheus-rw and edit selector to nginx. To simulate HTTP 500 response for 2 hours.

Result & My question

After 2 hours I got metrics only from last one hour after the connection recovery and the next hour had a loss of metrics. How was possible to got metrics from more than 2 hours during our distributor incident and why I cant simulate same situation?

When one shard backs up and fills its queue, Prometheus will block reading from the WAL into any shards. Failures will be retried without loss of data unless the remote endpoint remains down for more than 2 hours. After 2 hours, the WAL will be compacted and data that has not been sent will be lost. Remote write tuning | Prometheus

Is really that possible to buffer metrics in my Prometheus (for sending data to remote write) more than 2 hours? Or was the incident with my mimir distributor just an anomaly?

Thank you for your time… :slight_smile: