A new version of check_smart, an open source monitoring plugin to monitor the health of physical hard, solid state and nvme drives, is available!
The newest release 6.9.0 enhances the plugin with the following possibilities.
In previous versions, check_smart used a colon character (:) to separate multiple devices given by the -g parameter. This internal way of handling the devices list prevented that device paths, which actually use a colon in the path, would not work to be monitored.
With the pull request #64 by Even Felix (thanks!) this internal handling has been changed from a colon to a pipe character. Therefore devices can now be monitored using the PCI device path. For example:
$ ./check_smart.pl -d /dev/disk/by-path/pci-0000\:00\:1f.2-ata-1 -i ata
OK: Drive Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected. |Reallocated_Sector_Ct=0 Power_On_Hours=14511 Power_Cycle_Count=332 Wear_Leveling_Count=12 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=5 Total_LBAs_Written=15197779706
Version 6.9.0 also introduces a new parameter pair: -l and its long counterpart --ssd-lifetime is a parameter which can be understood as boolean switch. If used, check_smart will additionally check the "Percent_Lifetime_Remain" attribute to the list of attributes to be checked.
A default warning threshold of "90" (which needs to be understood as 90% lifetime used) is added in the background.
# ./check_smart.pl -d /dev/sda -i ata -l
WARNING: Drive CT1000MX500SSD1 S/N XXX: Reallocated_Event_Count is non-zero (12), Percent_Lifetime_Remain is non-zero (99)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=12 Power_On_Hours=10977 Power_Cycle_Count=19 Program_Fail_Count=0 Erase_Fail_Count=12 Ave_Block-Erase_Count=1489 Unexpect_Power_Loss_Ct=16 Unused_Reserve_NAND_Blk=42 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=42 Reallocated_Event_Count=12 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=5 Percent_Lifetime_Remain=99 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=29728798703 Host_Program_Page_Count=22560954702 FTL_Program_Page_Count=39992573131
By using the already existing -w / --warn parameter, the threshold can be increased:
# ./check_smart.pl -d /dev/sda -i ata -l -w "Percent_Lifetime_Remain=100"
WARNING: Drive CT1000MX500SSD1 S/N XXX: Reallocated_Event_Count is non-zero (12), Percent_Lifetime_Remain is non-zero (99) (but less than threshold 100)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=12 Power_On_Hours=10977 Power_Cycle_Count=19 Program_Fail_Count=0 Erase_Fail_Count=12 Ave_Block-Erase_Count=1489 Unexpect_Power_Loss_Ct=16 Unused_Reserve_NAND_Blk=42 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=42 Reallocated_Event_Count=12 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=5 Percent_Lifetime_Remain=99 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=29728758086 Host_Program_Page_Count=22560921678 FTL_Program_Page_Count=39992573096
The initial thought was to add this attribute to the default list of SMART attributes to be checked. But it might be annoying to users who use a drive which indicates an "expired life usage" and they then need to manually adjust warning thresholds in monitoring. Using the optional -l / --ssd-lifetime parameter is a quick change and allows the user to decide, whether or not to monitor this attribute.
Important note: This attribute does not exist across all SSD drives. From my own personal experience I have only seen this attribute across Crucial MX drives. Other SSD vendors might add this attribute to their drives, too. It's also important to understand that the value of "Percent_Lifetime_Remain" is an internally calculated value in the SSD firmware. Reaching a value close to or 100 can be understood as helpful indicator from the vendor that "we cannot guarantee a fully working drive from now on". The drive might still continue to work correctly after reaching 100. This attribute is therefore not an indicator that the drive actually failed, it is more a reminder the drive could fail from now on.
In the previous release, 6.8.0, the SMART attribute "Command_Timeout" was added to the default raw list to be checked. See issue #61 for more details. However after going through dozens of drives, where a couple of them had a "Command_Timeout" value above zero, none of them showed any other failing signs, such as re-allocated blocks. It was more misleading than helpful and required to manually adjust the warning thresholds for this attribute for a couple of drives.
If this attribute should be checked, it can be added back to the list of attributes to checked using the -r / --raw parameter. See documentation for more information.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder