UnicodeDecodeError in multiprocess collector

I have inherited a bug in my team’s python prometheus client setup, where we use the client within a multiprocessing setup.

In our error logs, we have quite a few occurrences of the following error:

utf-8,b'["abcdefg_worker_request_data_transfer_time_s", "abcdefg_worker_request_data_transfer_time_s_bucket"\xfd\x00\x00\x00["abcdefg_worker_request_data_transfer_time_s", "abcdefg_worker_request_data_transfer_time_s_sum", {"app_name": "tyom", "client_id": "", "deploy_name": "", "name": "',100,101,invalid start byte

with this stacktrace:

  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/asgi.py", line 24, in prometheus_app
    status, headers, output = _bake_output(registry, accept_header, accept_encoding_header, params, disable_compression)
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/exposition.py", line 104, in _bake_output
    output = encoder(registry)
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/openmetrics/exposition.py", line 21, in generate_latest
    for metric in registry.collect():
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/registry.py", line 97, in collect
    yield from collector.collect()
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 158, in collect
    return self.merge(files, accumulate=True)
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 43, in merge
    metrics = MultiProcessCollector._read_metrics(files)
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/multiprocess.py", line 71, in _read_metrics
    for key, value, timestamp, _ in file_values:
  File "/tmp/tmp.bQfMmCk0ub/venv/lib/python3.10/site-packages/prometheus_client/mmap_dict.py", line 46, in _read_all_values
    yield encoded_key.decode('utf-8'), value, timestamp, pos
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfd in position 100: invalid start byte
   }

To me, this looks like overlapping writes to the mmap’ed metrics files—which should not be possible due to files having pid-based filenames, and the client using a global lock.

I am trying to find out whether we are sharing metrics paths between docker images or something similar. However, we only ever get the same byte sequence for the same metric, so that hypothesis seems unlikely.

Can this happen when not wiping the metrics directory on startup? What else can this cause behaviour? It seems we’re “holding it wrong”, but I cannot point my finger to where that’s happening.

Our collector app is created like this:

    collector_path.mkdir(parents=True, exist_ok=True)
    collector_registry = CollectorRegistry()
    collector_registry.register(REGISTRY)
    MultiProcessCollector(collector_registry, path=str(collector_path))
    app = make_asgi_app(registry=collector_registry)