Monitoring plugin check_smart 6.9.0 released: Allow PCI device paths, add Percent_Lifetime_Remain SSD check

Written by - 0 comments

Published on - last updated on April 13th 2021 - Listed in Hardware Monitoring


A new version of check_smart, an open source monitoring plugin to monitor the health of physical hard, solid state and nvme drives, is available!

The newest release 6.9.0 enhances the plugin with the following possibilities.

Allowing PCI device paths

In previous versions, check_smart used a colon character (:) to separate multiple devices given by the -g parameter. This internal way of handling the devices list prevented that device paths, which actually use a colon in the path, would not work to be monitored.

With the pull request #64 by Even Felix (thanks!) this internal handling has been changed from a colon to a pipe character. Therefore devices can now be monitored using the PCI device path. For example:

$ ./check_smart.pl -d /dev/disk/by-path/pci-0000\:00\:1f.2-ata-1 -i ata
OK: Drive  Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected. |Reallocated_Sector_Ct=0 Power_On_Hours=14511 Power_Cycle_Count=332 Wear_Leveling_Count=12 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=5 Total_LBAs_Written=15197779706

Introducing '-l' parameter for Percent_Lifetime_Remain attribute

Version 6.9.0 also introduces a new parameter pair: -l and its long counterpart --ssd-lifetime is a parameter which can be understood as boolean switch. If used, check_smart will additionally check the "Percent_Lifetime_Remain" attribute to the list of attributes to be checked.

check_smart monitoring SSD Percent_Lifetime_Remain

A default warning threshold of "90" (which needs to be understood as 90% lifetime used) is added in the background.

# ./check_smart.pl -d /dev/sda -i ata -l
WARNING: Drive  CT1000MX500SSD1 S/N XXX:  Reallocated_Event_Count is non-zero (12), Percent_Lifetime_Remain is non-zero (99)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=12 Power_On_Hours=10977 Power_Cycle_Count=19 Program_Fail_Count=0 Erase_Fail_Count=12 Ave_Block-Erase_Count=1489 Unexpect_Power_Loss_Ct=16 Unused_Reserve_NAND_Blk=42 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=42 Reallocated_Event_Count=12 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=5 Percent_Lifetime_Remain=99 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=29728798703 Host_Program_Page_Count=22560954702 FTL_Program_Page_Count=39992573131

By using the already existing -w / --warn parameter, the threshold can be increased:

# ./check_smart.pl -d /dev/sda -i ata -l -w "Percent_Lifetime_Remain=100"
WARNING: Drive  CT1000MX500SSD1 S/N XXX:  Reallocated_Event_Count is non-zero (12), Percent_Lifetime_Remain is non-zero (99) (but less than threshold 100)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=12 Power_On_Hours=10977 Power_Cycle_Count=19 Program_Fail_Count=0 Erase_Fail_Count=12 Ave_Block-Erase_Count=1489 Unexpect_Power_Loss_Ct=16 Unused_Reserve_NAND_Blk=42 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=42 Reallocated_Event_Count=12 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=5 Percent_Lifetime_Remain=99 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=29728758086 Host_Program_Page_Count=22560921678 FTL_Program_Page_Count=39992573096

The initial thought was to add this attribute to the default list of SMART attributes to be checked. But it might be annoying to users who use a drive which indicates an "expired life usage" and they then need to manually adjust warning thresholds in monitoring. Using the optional -l / --ssd-lifetime parameter is a quick change and allows the user to decide, whether or not to monitor this attribute.

Important note: This attribute does not exist across all SSD drives. From my own personal experience I have only seen this attribute across Crucial MX drives. Other SSD vendors might add this attribute to their drives, too. It's also important to understand that the value of "Percent_Lifetime_Remain" is an internally calculated value in the SSD firmware. Reaching a value close to or 100 can be understood as helpful indicator from the vendor that "we cannot guarantee a fully working drive from now on". The drive might still continue to work correctly after reaching 100. This attribute is therefore not an indicator that the drive actually failed, it is more a reminder the drive could fail from now on.

Removed Command_Timeout from default raw list

In the previous release, 6.8.0, the SMART attribute "Command_Timeout" was added to the default raw list to be checked. See issue #61 for more details. However after going through dozens of drives, where a couple of them had a "Command_Timeout" value above zero, none of them showed any other failing signs, such as re-allocated blocks. It was more misleading than helpful and required to manually adjust the warning thresholds for this attribute for a couple of drives.

If this attribute should be checked, it can be added back to the list of attributes to checked using the -r / --raw parameter. See documentation for more information.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder