I have inherited a bug in my team’s python prometheus client setup, where we use the client within a multiprocessing setup.
In our error logs, we have quite a few occurrences of the following error:
utf-8,b'["abcdefg_worker_request_data_transfer_time_s", "abcdefg_worker_request_data_transfer_time_s_bucket"\xfd\x00\x00\x00["abcdefg_worker_request_data_transfer_time_s", "abcdefg_worker_request_data_transfer_time_s_sum", {"app_name": "tyom", "client_id": "", "deploy_name": "", "name": "',100,101,invalid start byte
with this stacktrace:
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/asgi.py", line 24, in prometheus_app
status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression)
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/exposition.py", line 104, in _bake_output
output = encoder(registry)
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/openmetrics/exposition.py", line 21, in generate_latest
for metric in registry.collect():
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/registry.py", line 97, in collect
yield from collector.collect()
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 158, in collect
return self.merge(files, accumulate=True)
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 43, in merge
metrics = MultiProcessCollector._read_metrics(files)
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 71, in _read_metrics
for key, value, timestamp, _ in file_values:
File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/mmap_dict.py", line 46, in _read_all_values
yield encoded_key.decode('utf-8'), value, timestamp, pos
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 100: invalid start byte
}
To me, this looks like overlapping writes to the mmap’ed metrics files—which should not be possible due to files having pid-based filenames, and the client using a global lock.
I am trying to find out whether we are sharing metrics paths between docker images or something similar. However, we only ever get the same byte sequence for the same metric, so that hypothesis seems unlikely.
Can this happen when not wiping the metrics directory on startup? What else can this cause behaviour? It seems we’re “holding it wrong”, but I cannot point my finger to where that’s happening.
Our collector app is created like this:
collector_path.mkdir(parents=True, exist_ok=True)
collector_registry = CollectorRegistry()
collector_registry.register(REGISTRY)
MultiProcessCollector(collector_registry, path=str(collector_path))
app = make_asgi_app(registry=collector_registry)