Hey.
I’m writing[0] a (hopefully proper) exporter for (HPE) Smar Array RAID controllers, that takes the input of (for now only) ssacli and exports Prometheus metrics.
The parsing is already mostly done and I’ll end up with some Python data structure like:
{
"controllers": {
"0": {
"properties": {
"slot": "0",
"serial_number": "bla",
"controller_status": "OK",
"hardware_revision": "B",
"firmware_version": "6.60",
"rebuild_priority": "High",
"cache_status": "OK",
"battery_capacitor_status": "OK",
"controller_temperature_celsius": 49.0,
"cache_module_temperature_celsius": 38.0,
"model": "Smart Array P840ar"
},
"sensors": {},
"arrays": {
"A": {
"properties": {
"unused_space_bytes": 4194304.0,
"used_space_bytes": 239992035082.24,
"status": "OK",
"multidomain_status": "OK"
},
"logical_drives": {
"1": {
"properties": {
"size_bytes": 120001386250.24,
"raid_level": "1",
"chunk_size_bytes": 262144,
"stripe_size_bytes": 262144,
"status": "OK",
"unrecoverable_media_errors": "None",
"multidomain_status": "OK",
"caching": false,
"device": "/dev/sdb",
"logical_drive_label": "02FB5470PVYKH0BRH420GX53CC"
}
}
},
"physical_drives": {
"2I:4:1": {
"properties": {
"port": "2I",
"box": "4",
"bay": "1",
"status": "OK",
"drive_role": "Data",
"interface_type": "Solid State SATA",
"size_bytes": 120000000000,
"firmware_version": "4IWTHPG1",
"serial_number": "bla",
"model": "ATA VK0120GEYJP",
"temperature_celsius": 32.0,
"usage_remaining_percent": 95.7,
"power_on_hours": 59236.0,
"life_remaining_based_on_workload_to_date_days": 54931.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"2I:4:2": {
"properties": {
"port": "2I",
"box": "4",
"bay": "2",
"status": "OK",
"drive_role": "Data",
"interface_type": "Solid State SATA",
"size_bytes": 120000000000,
"firmware_version": "4IWTHPG1",
"serial_number": "bla",
"temperature_celsius": 33.0,
"model": "ATA VK0120GEYJP",
"usage_remaining_percent": 95.67,
"power_on_hours": 59236.0,
"life_remaining_based_on_workload_to_date_days": 54533.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
},
"B": {
"properties": {
"unused_space_bytes": 0.0,
"used_space_bytes": 24002338834350.08,
"status": "OK",
"multidomain_status": "OK"
},
"logical_drives": {
"2": {
"properties": {
"size_bytes": 15997894184140.8,
"raid_level": "5",
"chunk_size_bytes": 65536,
"stripe_size_bytes": 131072,
"status": "OK",
"unrecoverable_media_errors": "None",
"multidomain_status": "OK",
"caching": true,
"parity_initialization_status": "In Progress",
"device": "/dev/sdc",
"logical_drive_label": "055ACDA3PVYKH0BRH420GX48FD"
}
}
},
"physical_drives": {
"2I:2:1": {
"properties": {
"port": "2I",
"box": "2",
"bay": "1",
"status": "OK",
"drive_role": "Data",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"2I:2:3": {
"properties": {
"port": "2I",
"box": "2",
"bay": "3",
"status": "OK",
"drive_role": "Data",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 27.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
},
},
"unassigned_physical_drives": {
"1I:1:1": {
"properties": {
"port": "1I",
"box": "1",
"bay": "1",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:2": {
"properties": {
"port": "1I",
"box": "1",
"bay": "2",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:4": {
"properties": {
"port": "1I",
"box": "1",
"bay": "4",
"status": "OK",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 29.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:6": {
"properties": {
"port": "1I",
"box": "1",
"bay": "6",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
}
}
}
Now for many things it’s probably rather obvious how to map them into metrics, e.g. I’d place stuff like the controller’s firmware_version, hardware_revision, model, rebuild_priority and serial_number into an info metric like so:
smartarray_controller_info{firmware_version="6.60",hardware_revision="B",model="Smart Array P840ar",rebuild_priority="High",serial_number="bla",slot="0"} 1.0
And I’d do a similar one for arrays as well logical and physical drives.
But for some others I wonder what the best way of mapping is.
-
Temperatures (and analogously for things like used space, stripe size, etc. pp.)
It’s clear that these must be gauges, but I wonder how I should do them.a) Like one
smartarray_temperature_celciusfor all kinds of temperatures, where labels give the type (likecontroller,controller_cache_module,controller_capacitorwhich are all possible temperatures kinds for controllers (that don’t show all up in the example above, and there are even more types), orpd_temperature), plus theslotand PDnamenecessary to identify the respective thing.b) Or rather one metric for controllers and physical drives, like
smartarray_controller_temperature_celciusandsmartarray_pd_temperature_celcius, again with atypelabel that gives which kind of temperature is meant.c) Or even more granular like,
smartarray_controller_temperature_celcius,smartarray_controller_cache_module_temperature_celcius,smartarray_controller_capacitor_temperature_celcius, etc. pp. and notypelabel.
Also, any temperature value is optional… a controller may have it or no, so especially in that very granular mode, metrics may “disappear”.But no type label at all will actually be difficult because the controllers give additional (variable numbers of) sensors like shown below, which I do parse already (they’re just not in the JSON above):
Sensor ID: 0 Location: Inlet Ambient Current Value (C): 43 Max Value Since Power On: 46 Sensor ID: 1 Location: ASIC Current Value (C): 55 Max Value Since Power On: 59 Sensor ID: 2 Location: Top Current Value (C): 41 Max Value Since Power On: 44So for these sensors I definitely would need some
type(which would be the value ofLocation).
d) Could be like (b) or (c) … with an extrasmartarray_controller_sensors_temperature_celciuswith atypelabel.So no idea what’s best.
I guess thetypefield is in principle bad, because queries would need to always match it.
-
The various
statuses.
First, I don’t know all possible values (likeOK) that they may have.a) One way would be to add them as labels to
smartarray_status. Again, plus labels that identify the controller, array, LD, PD (some of these would of course be empty, depending on the respective status type).b) Like above, one could do (a) with further levels of granularity (i.e. more metrics than just one).
c) Another way would be to simply add them as label to
smartarray_controller_info,smartarray_array_info,smartarray_ld_info,smartarray_pd_info.
Users could then aggregate over these, match e.g.=="OK"or!="OK", etc. and count, alert, etc…d) Or one could do something like (a) or (b), but where the actual status (like
OK) is not a label, but the metric itself likesmartarray_pd_healthy,smartarray_pd_failed,smartarray_pd_rebuilding,smartarray_pd_predictive_failing, etc. and have as labels just the ones that identify the PD, controller, etc. pp.)
The problem with that is – as said – that I don’t know all possible status values and that at any time new ones might get added which would then be lost in the recorded data.Again, not really sure which design would be good, with respect to Prometheus efficiency (both storage and querying)… and what makes queries easier for the user (in the sense of writing them).
Thanks,
Chris.
[0] I know that there are already some, but they seem either unmaintained or not as powerful as I wanted them to be.