Hey.
I’m writing[0] a (hopefully proper) exporter for (HPE) Smar Array RAID controllers, that takes the input of (for now only) ssacli
and exports Prometheus metrics.
The parsing is already mostly done and I’ll end up with some Python data structure like:
{
"controllers": {
"0": {
"properties": {
"slot": "0",
"serial_number": "bla",
"controller_status": "OK",
"hardware_revision": "B",
"firmware_version": "6.60",
"rebuild_priority": "High",
"cache_status": "OK",
"battery_capacitor_status": "OK",
"controller_temperature_celsius": 49.0,
"cache_module_temperature_celsius": 38.0,
"model": "Smart Array P840ar"
},
"sensors": {},
"arrays": {
"A": {
"properties": {
"unused_space_bytes": 4194304.0,
"used_space_bytes": 239992035082.24,
"status": "OK",
"multidomain_status": "OK"
},
"logical_drives": {
"1": {
"properties": {
"size_bytes": 120001386250.24,
"raid_level": "1",
"chunk_size_bytes": 262144,
"stripe_size_bytes": 262144,
"status": "OK",
"unrecoverable_media_errors": "None",
"multidomain_status": "OK",
"caching": false,
"device": "/dev/sdb",
"logical_drive_label": "02FB5470PVYKH0BRH420GX53CC"
}
}
},
"physical_drives": {
"2I:4:1": {
"properties": {
"port": "2I",
"box": "4",
"bay": "1",
"status": "OK",
"drive_role": "Data",
"interface_type": "Solid State SATA",
"size_bytes": 120000000000,
"firmware_version": "4IWTHPG1",
"serial_number": "bla",
"model": "ATA VK0120GEYJP",
"temperature_celsius": 32.0,
"usage_remaining_percent": 95.7,
"power_on_hours": 59236.0,
"life_remaining_based_on_workload_to_date_days": 54931.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"2I:4:2": {
"properties": {
"port": "2I",
"box": "4",
"bay": "2",
"status": "OK",
"drive_role": "Data",
"interface_type": "Solid State SATA",
"size_bytes": 120000000000,
"firmware_version": "4IWTHPG1",
"serial_number": "bla",
"temperature_celsius": 33.0,
"model": "ATA VK0120GEYJP",
"usage_remaining_percent": 95.67,
"power_on_hours": 59236.0,
"life_remaining_based_on_workload_to_date_days": 54533.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
},
"B": {
"properties": {
"unused_space_bytes": 0.0,
"used_space_bytes": 24002338834350.08,
"status": "OK",
"multidomain_status": "OK"
},
"logical_drives": {
"2": {
"properties": {
"size_bytes": 15997894184140.8,
"raid_level": "5",
"chunk_size_bytes": 65536,
"stripe_size_bytes": 131072,
"status": "OK",
"unrecoverable_media_errors": "None",
"multidomain_status": "OK",
"caching": true,
"parity_initialization_status": "In Progress",
"device": "/dev/sdc",
"logical_drive_label": "055ACDA3PVYKH0BRH420GX48FD"
}
}
},
"physical_drives": {
"2I:2:1": {
"properties": {
"port": "2I",
"box": "2",
"bay": "1",
"status": "OK",
"drive_role": "Data",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"2I:2:3": {
"properties": {
"port": "2I",
"box": "2",
"bay": "3",
"status": "OK",
"drive_role": "Data",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 27.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
},
},
"unassigned_physical_drives": {
"1I:1:1": {
"properties": {
"port": "1I",
"box": "1",
"bay": "1",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:2": {
"properties": {
"port": "1I",
"box": "1",
"bay": "2",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:4": {
"properties": {
"port": "1I",
"box": "1",
"bay": "4",
"status": "OK",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 29.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
},
"1I:1:6": {
"properties": {
"port": "1I",
"box": "1",
"bay": "6",
"status": "Predictive Failure",
"drive_role": "Unassigned",
"interface_type": "SATA",
"size_bytes": 8000000000000,
"rotational_speed_rpm": 7200,
"firmware_version": "HPGB",
"serial_number": "bla",
"model": "ATA MB8000GEQUU",
"temperature_celsius": 28.0,
"shingled_magnetic_recording_support": "None",
"multi_actuator_drive": false
}
}
}
}
}
}
Now for many things it’s probably rather obvious how to map them into metrics, e.g. I’d place stuff like the controller’s firmware_version
, hardware_revision
, model
, rebuild_priority
and serial_number
into an info
metric like so:
smartarray_controller_info{firmware_version="6.60",hardware_revision="B",model="Smart Array P840ar",rebuild_priority="High",serial_number="bla",slot="0"} 1.0
And I’d do a similar one for arrays as well logical and physical drives.
But for some others I wonder what the best way of mapping is.
-
Temperatures (and analogously for things like used space, stripe size, etc. pp.)
It’s clear that these must be gauges, but I wonder how I should do them.a) Like one
smartarray_temperature_celcius
for all kinds of temperatures, where labels give the type (likecontroller
,controller_cache_module
,controller_capacitor
which are all possible temperatures kinds for controllers (that don’t show all up in the example above, and there are even more types), orpd_temperature
), plus theslot
and PDname
necessary to identify the respective thing.b) Or rather one metric for controllers and physical drives, like
smartarray_controller_temperature_celcius
andsmartarray_pd_temperature_celcius
, again with atype
label that gives which kind of temperature is meant.c) Or even more granular like,
smartarray_controller_temperature_celcius
,smartarray_controller_cache_module_temperature_celcius
,smartarray_controller_capacitor_temperature_celcius
, etc. pp. and notype
label.
Also, any temperature value is optional… a controller may have it or no, so especially in that very granular mode, metrics may “disappear”.But no type label at all will actually be difficult because the controllers give additional (variable numbers of) sensors like shown below, which I do parse already (they’re just not in the JSON above):
Sensor ID: 0 Location: Inlet Ambient Current Value (C): 43 Max Value Since Power On: 46 Sensor ID: 1 Location: ASIC Current Value (C): 55 Max Value Since Power On: 59 Sensor ID: 2 Location: Top Current Value (C): 41 Max Value Since Power On: 44
So for these sensors I definitely would need some
type
(which would be the value ofLocation
).
d) Could be like (b) or (c) … with an extrasmartarray_controller_sensors_temperature_celcius
with atype
label.So no idea what’s best.
I guess thetype
field is in principle bad, because queries would need to always match it.
-
The various
status
es.
First, I don’t know all possible values (likeOK
) that they may have.a) One way would be to add them as labels to
smartarray_status
. Again, plus labels that identify the controller, array, LD, PD (some of these would of course be empty, depending on the respective status type).b) Like above, one could do (a) with further levels of granularity (i.e. more metrics than just one).
c) Another way would be to simply add them as label to
smartarray_controller_info
,smartarray_array_info
,smartarray_ld_info
,smartarray_pd_info
.
Users could then aggregate over these, match e.g.=="OK"
or!="OK"
, etc. and count, alert, etc…d) Or one could do something like (a) or (b), but where the actual status (like
OK
) is not a label, but the metric itself likesmartarray_pd_healthy
,smartarray_pd_failed
,smartarray_pd_rebuilding
,smartarray_pd_predictive_failing
, etc. and have as labels just the ones that identify the PD, controller, etc. pp.)
The problem with that is – as said – that I don’t know all possible status values and that at any time new ones might get added which would then be lost in the recorded data.Again, not really sure which design would be good, with respect to Prometheus efficiency (both storage and querying)… and what makes queries easier for the user (in the sense of writing them).
Thanks,
Chris.
[0] I know that there are already some, but they seem either unmaintained or not as powerful as I wanted them to be.