I am hosting Prometheus v2.37.0 on AKS and receiving the Windows server metrics sending from Grafana agent via remote write.
Recently, I have delete Prometheus v2.37.0 and redeploy Prometheus v2.45.0 with remove all metrics data.
Deployment complete around 11 Mar 17:00 (GMT+8) and metrics received properly.
However, I found that from 12 Mar 01:00 to 09:00, none of metrics received and trying to wipe all the things but result are the same.
As research on google/github, i tried to add following line inside configmap
storage:
tsdb:
out_of_order_time_window: 8h
I have tried to add above line and adjust memory limit from orginial 10G to 16G and prometheus pod keep OOMKiller if above enabled
Any idea how to fix it by stable prometheus?
Logging
ts=2024-03-13T01:13:27.675Z caller=dedupe.go:112 component=remote level=error remote_name=7e4a4d url=https://prom.xxx.net:9390/api/v1/write msg="non-recoverable error" count=2309 exemplarCount=0 err="server returned HTTP status 400 Bad Request: out of order sample"
ts=2024-03-13T01:13:27.920Z caller=write_handler.go:109 level=error component=web msg="Out of order sample from remote write" err="out of order sample" series="{name="windows_logical_disk_read_bytes_total", agent_hostname="VM01", instance="VM01", job="integrations/windows_exporter", volume="HarddiskVolume3"}" timestamp=1710292337012
ts=2024-03-13T01:13:54.446Z caller=write_handler.go:109 level=error component=web msg="Out of order sample from remote write" err="out of order sample" series="{name="up", agent_hostname="VM02", instance="Vm02", job="integrations/windows_exporter"}" timestamp=1710294110255
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus2-conf
namespace: grafana
data:
prometheus.yml: |-
global:
scrape_interval: 60s
scrape_timeout: 60s
evaluation_interval: 60s
remote_write:
- url: "https://prom.xxx.net:9390/api/v1/write"
remote_timeout: 1m
basic_auth:
username: promadmin
password: *********
queue_config:
capacity: 25000
max_shards: 1000
min_shards: 4
max_samples_per_send: 3000
max_backoff: 30s
metadata_config:
max_samples_per_send: 3000
tls_config:
insecure_skip_verify: true
cert_file: /etc/prometheus/certs/prometheus.crt
key_file: /etc/prometheus/certs/prometheus.key
ca_file: /etc/prometheus/certs/prometheus-ca.crt
prometheus-https.yml: |-
tls_server_config:
cert_file: /etc/prometheus/certs/prometheus.crt
key_file: /etc/prometheus/certs/prometheus.key
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus2
namespace: grafana
spec:
selector:
matchLabels:
app: prometheus2
replicas: 1
template:
metadata:
labels:
app: prometheus2
spec:
nodeSelector:
nodeR: sec
containers:
- name: prometheus2
image: prom/prometheus:v2.45.0
resources:
limits:
memory: "16Gi"
requests:
memory: "8Gi"
cpu: "4"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--web.config.file=/etc/prometheus/prometheus-https.yml"
- "--storage.tsdb.retention.time=90d"
- "--web.max-connections=4096"
- "--web.enable-remote-write-receiver"
- "--log.level=info"
- "--web.listen-address=0.0.0.0:9390"
ports:
- name: prometheus2-ui
containerPort: 9390
protocol: TCP
volumeMounts:
- name: prometheus2-db-path
mountPath: /prometheus
- name: prometheus2-conf
mountPath: /etc/prometheus/prometheus.yml
subPath: prometheus.yml
- name: prometheus2-https
mountPath: /etc/prometheus/prometheus-https.yml
subPath: prometheus-https.yml
- name: prom-certs
mountPath: /etc/prometheus/certs
readOnly: true
volumes:
- name: prometheus2-db-path
persistentVolumeClaim:
claimName: prometheus2-db-pvc
- name: prometheus2-conf
configMap:
name: prometheus2-conf
- name: prometheus2-https
configMap:
name: prometheus2-conf
- name: prom-certs
secret:
secretName: prom-certs