How to design metrics/labels

Hey.

I’m writing[0] a (hopefully proper) exporter for (HPE) Smar Array RAID controllers, that takes the input of (for now only) ssacli and exports Prometheus metrics.

The parsing is already mostly done and I’ll end up with some Python data structure like:

{
  "controllers": {
    "0": {
      "properties": {
        "slot": "0",
        "serial_number": "bla",
        "controller_status": "OK",
        "hardware_revision": "B",
        "firmware_version": "6.60",
        "rebuild_priority": "High",
        "cache_status": "OK",
        "battery_capacitor_status": "OK",
        "controller_temperature_celsius": 49.0,
        "cache_module_temperature_celsius": 38.0,
        "model": "Smart Array P840ar"
      },
      "sensors": {},
      "arrays": {
        "A": {
          "properties": {
            "unused_space_bytes": 4194304.0,
            "used_space_bytes": 239992035082.24,
            "status": "OK",
            "multidomain_status": "OK"
          },
          "logical_drives": {
            "1": {
              "properties": {
                "size_bytes": 120001386250.24,
                "raid_level": "1",
                "chunk_size_bytes": 262144,
                "stripe_size_bytes": 262144,
                "status": "OK",
                "unrecoverable_media_errors": "None",
                "multidomain_status": "OK",
                "caching": false,
                "device": "/dev/sdb",
                "logical_drive_label": "02FB5470PVYKH0BRH420GX53CC"
              }
            }
          },
          "physical_drives": {
            "2I:4:1": {
              "properties": {
                "port": "2I",
                "box": "4",
                "bay": "1",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "Solid State SATA",
                "size_bytes": 120000000000,
                "firmware_version": "4IWTHPG1",
                "serial_number": "bla",
                "model": "ATA VK0120GEYJP",
                "temperature_celsius": 32.0,
                "usage_remaining_percent": 95.7,
                "power_on_hours": 59236.0,
                "life_remaining_based_on_workload_to_date_days": 54931.0,
                "shingled_magnetic_recording_support": "None",
                "multi_actuator_drive": false
              }
            },
            "2I:4:2": {
              "properties": {
                "port": "2I",
                "box": "4",
                "bay": "2",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "Solid State SATA",
                "size_bytes": 120000000000,
                "firmware_version": "4IWTHPG1",
                "serial_number": "bla",
                "temperature_celsius": 33.0,
                "model": "ATA VK0120GEYJP",
                "usage_remaining_percent": 95.67,
                "power_on_hours": 59236.0,
                "life_remaining_based_on_workload_to_date_days": 54533.0,
                "shingled_magnetic_recording_support": "None",
                "multi_actuator_drive": false
              }
            }
          }
        },
        "B": {
          "properties": {
            "unused_space_bytes": 0.0,
            "used_space_bytes": 24002338834350.08,
            "status": "OK",
            "multidomain_status": "OK"
          },
          "logical_drives": {
            "2": {
              "properties": {
                "size_bytes": 15997894184140.8,
                "raid_level": "5",
                "chunk_size_bytes": 65536,
                "stripe_size_bytes": 131072,
                "status": "OK",
                "unrecoverable_media_errors": "None",
                "multidomain_status": "OK",
                "caching": true,
                "parity_initialization_status": "In Progress",
                "device": "/dev/sdc",
                "logical_drive_label": "055ACDA3PVYKH0BRH420GX48FD"
              }
            }
          },
          "physical_drives": {
            "2I:2:1": {
              "properties": {
                "port": "2I",
                "box": "2",
                "bay": "1",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "SATA",
                "size_bytes": 8000000000000,
                "rotational_speed_rpm": 7200,
                "firmware_version": "HPGB",
                "serial_number": "bla",
                "model": "ATA MB8000GEQUU",
                "temperature_celsius": 28.0,
                "shingled_magnetic_recording_support": "None",
                "multi_actuator_drive": false
              }
            },
            "2I:2:3": {
              "properties": {
                "port": "2I",
                "box": "2",
                "bay": "3",
                "status": "OK",
                "drive_role": "Data",
                "interface_type": "SATA",
                "size_bytes": 8000000000000,
                "rotational_speed_rpm": 7200,
                "firmware_version": "HPGB",
                "serial_number": "bla",
                "model": "ATA MB8000GEQUU",
                "temperature_celsius": 27.0,
                "shingled_magnetic_recording_support": "None",
                "multi_actuator_drive": false
              }
            }
          }
        },
      },
      "unassigned_physical_drives": {
        "1I:1:1": {
          "properties": {
            "port": "1I",
            "box": "1",
            "bay": "1",
            "status": "Predictive Failure",
            "drive_role": "Unassigned",
            "interface_type": "SATA",
            "size_bytes": 8000000000000,
            "rotational_speed_rpm": 7200,
            "firmware_version": "HPGB",
            "serial_number": "bla",
            "model": "ATA MB8000GEQUU",
            "temperature_celsius": 28.0,
            "shingled_magnetic_recording_support": "None",
            "multi_actuator_drive": false
          }
        },
        "1I:1:2": {
          "properties": {
            "port": "1I",
            "box": "1",
            "bay": "2",
            "status": "Predictive Failure",
            "drive_role": "Unassigned",
            "interface_type": "SATA",
            "size_bytes": 8000000000000,
            "rotational_speed_rpm": 7200,
            "firmware_version": "HPGB",
            "serial_number": "bla",
            "model": "ATA MB8000GEQUU",
            "temperature_celsius": 28.0,
            "shingled_magnetic_recording_support": "None",
            "multi_actuator_drive": false
          }
        },
        "1I:1:4": {
          "properties": {
            "port": "1I",
            "box": "1",
            "bay": "4",
            "status": "OK",
            "drive_role": "Unassigned",
            "interface_type": "SATA",
            "size_bytes": 8000000000000,
            "rotational_speed_rpm": 7200,
            "firmware_version": "HPGB",
            "serial_number": "bla",
            "model": "ATA MB8000GEQUU",
            "temperature_celsius": 29.0,
            "shingled_magnetic_recording_support": "None",
            "multi_actuator_drive": false
          }
        },
        "1I:1:6": {
          "properties": {
            "port": "1I",
            "box": "1",
            "bay": "6",
            "status": "Predictive Failure",
            "drive_role": "Unassigned",
            "interface_type": "SATA",
            "size_bytes": 8000000000000,
            "rotational_speed_rpm": 7200,
            "firmware_version": "HPGB",
            "serial_number": "bla",
            "model": "ATA MB8000GEQUU",
            "temperature_celsius": 28.0,
            "shingled_magnetic_recording_support": "None",
            "multi_actuator_drive": false
          }
        }
      }
    }
  }
}

Now for many things it’s probably rather obvious how to map them into metrics, e.g. I’d place stuff like the controller’s firmware_version, hardware_revision, model, rebuild_priority and serial_number into an info metric like so:

smartarray_controller_info{firmware_version="6.60",hardware_revision="B",model="Smart Array P840ar",rebuild_priority="High",serial_number="bla",slot="0"} 1.0

And I’d do a similar one for arrays as well logical and physical drives.

But for some others I wonder what the best way of mapping is.

  1. Temperatures (and analogously for things like used space, stripe size, etc. pp.)
    It’s clear that these must be gauges, but I wonder how I should do them.

    a) Like one smartarray_temperature_celcius for all kinds of temperatures, where labels give the type (like controller, controller_cache_module, controller_capacitor which are all possible temperatures kinds for controllers (that don’t show all up in the example above, and there are even more types), or pd_temperature), plus the slot and PD name necessary to identify the respective thing.

    b) Or rather one metric for controllers and physical drives, like smartarray_controller_temperature_celcius and smartarray_pd_temperature_celcius, again with a type label that gives which kind of temperature is meant.

    c) Or even more granular like, smartarray_controller_temperature_celcius, smartarray_controller_cache_module_temperature_celcius, smartarray_controller_capacitor_temperature_celcius, etc. pp. and no type label.
    Also, any temperature value is optional… a controller may have it or no, so especially in that very granular mode, metrics may “disappear”.

    But no type label at all will actually be difficult because the controllers give additional (variable numbers of) sensors like shown below, which I do parse already (they’re just not in the JSON above):

    Sensor ID: 0
       Location: Inlet Ambient
       Current Value (C): 43
       Max Value Since Power On: 46
    Sensor ID: 1
       Location: ASIC
       Current Value (C): 55
       Max Value Since Power On: 59
    Sensor ID: 2
       Location: Top 
       Current Value (C): 41
       Max Value Since Power On: 44
    

    So for these sensors I definitely would need some type (which would be the value of Location).
    d) Could be like (b) or (c) … with an extra smartarray_controller_sensors_temperature_celcius with a type label.

    So no idea what’s best.
    I guess the type field is in principle bad, because queries would need to always match it.

  1. The various statuses.
    First, I don’t know all possible values (like OK) that they may have.

    a) One way would be to add them as labels to smartarray_status. Again, plus labels that identify the controller, array, LD, PD (some of these would of course be empty, depending on the respective status type).

    b) Like above, one could do (a) with further levels of granularity (i.e. more metrics than just one).

    c) Another way would be to simply add them as label to smartarray_controller_info, smartarray_array_info, smartarray_ld_info, smartarray_pd_info.
    Users could then aggregate over these, match e.g. =="OK" or !="OK", etc. and count, alert, etc…

    d) Or one could do something like (a) or (b), but where the actual status (like OK) is not a label, but the metric itself like smartarray_pd_healthy, smartarray_pd_failed, smartarray_pd_rebuilding, smartarray_pd_predictive_failing, etc. and have as labels just the ones that identify the PD, controller, etc. pp.)
    The problem with that is – as said – that I don’t know all possible status values and that at any time new ones might get added which would then be lost in the recorded data.

    Again, not really sure which design would be good, with respect to Prometheus efficiency (both storage and querying)… and what makes queries easier for the user (in the sense of writing them).

Thanks,
Chris.

[0] I know that there are already some, but they seem either unmaintained or not as powerful as I wanted them to be.

PS: I should perhaps add that I would be neat if the following were possible.

Consider an alert fires (like a PD failed, or PD temperature too high)… in that case, the alert should ideally contain further information about the PD - at least information to identify it (like which controller, and then the PD name)… but perhaps even additional stuff like the serial number of the PD… or the role (data, parity, spare, unused) of the PD.

I haven’t looked so much in dept of alertmanager yet, not sure whether this would be only possible if these lables are also part of the metric that causes the alert? Or in other words… if I had e.g. temperature in one metric and serial number in some _info metric - could I still get the serial number into “additional” information in an alert?

Another question would be:
What should one do with values that are numbers that do however typically not change.

For example the rotational speed of a drive (like 7200 RPM or 5400 RPM, etc.), which truly never changes unless the drive is replaced.
Or, in a RAID, the chunk size (like 64KiB, 256KiB, etc.) … which may change but not normally.

Should these things be rather label values or the number of a metric… or even both?