Monitoring plugin check_smart 6.11 released: NVMe attributes with dots and order output by priority/criticality

Written by - 0 comments

Published on - Listed in Monitoring Hardware


A new version of check_smart, an open source monitoring plugin to monitor the health of hard drives, solid state drives and NVMe drives, is now available!

Release 6.11 adds two improvements to the plugin.

Handle dots in NVMe attribute names

This problem was reported in issue 62 on GitHub. Certain NVMe drives show attributes with a dot in the name.

Nvme_0 0 OK: Drive SAMSUNG MZVLB512HAJQ-00000 S/N XXX: no SMART errors detected. |Temperature=25 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=22 Data_Units_Read=35774467 Data_Units_Written=280451586 Host_Read_Commands=637677302 Host_Write_Commands=2270597693 Controller_Busy_Time=5846 Power_Cycles=22 Power_On_Hours=1268 Unsafe_Shutdowns=7 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=10 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=25 Temperature_Sensor_2=34

This may cause problems when automatically creating graphs from the performance data.

6.11 now internally removes the dots in the attribute names. Warning__Comp._Temperature_Time therefore becomes Warning__Comp_Temperature_Time.

Prioritize output by criticality

When running check_smart with the -g parameter (to check multiple drives at the same time), the plugin would simply return all drives with a "non-ok" state in the order they were parsed. This also means that the plugin did not differ between drives with a warning state and drives with a critical state.

As discussed in issue 70 with reporter Peter Newman, the best behaviour would be to first show all the "critical drives", then followed by "warning drives" and finally the "ok drives".

Version 6.11 internally handles the drives differently now. Instead of using "non-ok drives" and "ok drives", the non-ok drives are now split into "critical drives" and "warning drives". This allows a different priority and different sorting of these drives.

Another change was made for attributes which are using a warning threshold (using the -w parameter). If the threshold is not yet reached, the affected attribute is now handled as "notice". An example can be seen in the following case.

Before version 6.11.0, attributes would show up in their lookup order, even when different thresholds are given:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

Note that the attribute Reallocated_Sector_Ct has a threshold of 500, which is not yet reached. Yet this attribute shows up at the beginning of the plugin's output - because the output is in the same order as the list of attributes (use --debug parameter to see the list of attributes of the relevant drive).

Starting with 6.11.0, the output is now sorted. The Reallocated_Sector_Ct now shows up last, as it is considered as "notice" only:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder