A couple of months ago I wrote about a Crucial MX500 SSD drive dead after 8650 hours. The Solid State Drive was used in a physical server, attached to a hardware raid controller. This is also the device which marked the SSD as failed and took the drive out of the raid array.
Thanks to statistics collected by monitoring plugin check_smart and stored into an InfluxDB time series database, the drive's SMART attributes from the past can be analyzed. As mentioned, the drive was tagged as "failed" by the hardware raid controller and replaced around hour 8650 (Power_On_Hours) of the drive. Looking back two months prior until this 8650th hour, reveals a few (12) reallocated sectors:
Compared to hard drives (HDDs), that's not a big amount of reallocated sectors so the question came up whether or not this SSD is really dead or if it just reached an (internal safe) threshold of the raid controller.
We decided to find out more and continue to use this SSD (Crucial MX500 1TB, Model CT1000MX500SSD1) in a local test server, running Debian 10 (Buster) with a software (mdadm) RAID-1. That was at the end of January 2021.
Interestingly, the Percent_Lifetime_Remain attribute showed value of "93" at that time. Now it's important to understand that SMART attributes are (usually) a counter, counting upwards. The name of the attribute is falsely indicating that 93% lifetime is still remaining - it needs to be understood as "Percent_Lifetime_Used".
Even when the 100 value was reached on March 8th 2021, the SSD still seemed to run without indication of failure.
At hour ~11730, the moment of truth happened: The drive "disappeared" from the OS and not even check_smart was able to read any SMART attributes anymore.
In dmesg we were able to find Kernel errors, telling mdadm to mark the drive (SDA) as failed:
[Sun Apr 25 15:22:26 2021] ata1.00: exception Emask 0x0 SAct 0x4000700 SErr 0x0 action 0x6 frozen
[Sun Apr 25 15:22:26 2021] ata1.00: failed command: WRITE FPDMA QUEUED
[Sun Apr 25 15:22:26 2021] ata1.00: cmd 61/08:40:98:8f:60/00:00:29:00:00/40 tag 8 ncq dma 4096 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Apr 25 15:22:26 2021] ata1.00: status: { DRDY }
[Sun Apr 25 15:22:26 2021] ata1.00: failed command: WRITE FPDMA QUEUED
[Sun Apr 25 15:22:26 2021] ata1.00: cmd 61/20:48:cf:06:1d/00:00:17:00:00/40 tag 9 ncq dma 16384 out
res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Apr 25 15:22:26 2021] ata1.00: status: { DRDY }
[Sun Apr 25 15:22:26 2021] ata1.00: failed command: WRITE FPDMA QUEUED
[Sun Apr 25 15:22:26 2021] ata1.00: cmd 61/08:50:e8:4d:1b/00:00:22:00:00/40 tag 10 ncq dma 4096 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Apr 25 15:22:26 2021] ata1.00: status: { DRDY }
[Sun Apr 25 15:22:26 2021] ata1.00: failed command: WRITE FPDMA QUEUED
[Sun Apr 25 15:22:26 2021] ata1.00: cmd 61/10:d0:08:30:bc/00:00:2b:00:00/40 tag 26 ncq dma 8192 out
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[Sun Apr 25 15:22:26 2021] ata1.00: status: { DRDY }
[Sun Apr 25 15:22:26 2021] ata1: hard resetting link
[Sun Apr 25 15:22:36 2021] ata1: softreset failed (device not ready)
[Sun Apr 25 15:22:36 2021] ata1: hard resetting link
[Sun Apr 25 15:22:46 2021] ata1: softreset failed (device not ready)
[Sun Apr 25 15:22:46 2021] ata1: hard resetting link
[Sun Apr 25 15:22:57 2021] ata1: link is slow to respond, please be patient (ready=0)
[Sun Apr 25 15:23:21 2021] ata1: softreset failed (device not ready)
[Sun Apr 25 15:23:21 2021] ata1: limiting SATA link speed to 3.0 Gbps
[Sun Apr 25 15:23:21 2021] ata1: hard resetting link
[Sun Apr 25 15:23:26 2021] ata1: softreset failed (device not ready)
[Sun Apr 25 15:23:26 2021] ata1: reset failed, giving up
[Sun Apr 25 15:23:26 2021] ata1.00: disabled
[Sun Apr 25 15:23:26 2021] ata1: EH complete
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#16 CDB: Read(10) 28 00 22 16 29 00 00 00 08 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 571877632
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 03 f9 f1 28 00 00 08 00
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#13 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 97656848
[Sun Apr 25 15:23:26 2021] md: super_written gets error=10
[Sun Apr 25 15:23:26 2021] md/raid1:md2: Disk failure on sda3, disabling device.
md/raid1:md2: Operation continuing on 1 devices.
[Sun Apr 25 15:23:26 2021] md/raid1:md2: sda3: rescheduling sector 473958656
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#15 CDB: Read(10) 28 00 04 42 df 40 00 00 08 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 71491392
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 22630208
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 66711848
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 17850664
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#1 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 48828424
[Sun Apr 25 15:23:26 2021] md: super_written gets error=10
[Sun Apr 25 15:23:26 2021] md/raid1:md1: Disk failure on sda2, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 03 b8 e9 78 00 00 08 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 62450040
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 13588856
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#12 CDB: Read(10) 28 00 03 cc 5e 18 00 00 08 00
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#17 CDB: Read(10) 28 00 2c 76 45 50 00 00 08 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 63725080
[Sun Apr 25 15:23:26 2021] md/raid1:md2: sda3: rescheduling sector 648029520
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 14863896
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#2 CDB: Write(10) 2a 00 2b bc 30 08 00 00 10 00
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 733753352
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 456018040
[Sun Apr 25 15:23:26 2021] print_req_error: I/O error, dev sda, sector 456030136
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:23:26 2021] sd 0:0:0:0: [sda] tag#18 CDB: Read(10) 28 00 03 cd 86 00 00 00 08 00
[Sun Apr 25 15:23:26 2021] md/raid1:md0: sda1: rescheduling sector 25036800
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 14939648
[Sun Apr 25 15:23:26 2021] md/raid1:md1: sda2: rescheduling sector 7819648
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 25036800 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md0: sda1: rescheduling sector 25036944
[Sun Apr 25 15:23:26 2021] md: super_written gets error=10
[Sun Apr 25 15:23:26 2021] md/raid1:md0: Disk failure on sda1, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
[Sun Apr 25 15:23:26 2021] md: super_written gets error=10
[Sun Apr 25 15:23:26 2021] md/raid1:md1: redirecting sector 22630208 to other mirror: sdd2
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 25036944 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 3173568 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 3173592 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 3173608 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md0: redirecting sector 3173656 to other mirror: sdd1
[Sun Apr 25 15:23:26 2021] md/raid1:md1: redirecting sector 17850664 to other mirror: sdd2
[Sun Apr 25 15:23:26 2021] md/raid1:md1: redirecting sector 13588856 to other mirror: sdd2
[Sun Apr 25 15:23:26 2021] md/raid1:md1: redirecting sector 14863896 to other mirror: sdd2
[Sun Apr 25 15:49:05 2021] scsi_io_completion_action: 36 callbacks suppressed
[Sun Apr 25 15:49:05 2021] sd 0:0:0:0: [sda] tag#27 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Sun Apr 25 15:49:05 2021] sd 0:0:0:0: [sda] tag#27 CDB: ATA command pass through(16) 85 06 2c 00 00 00 00 00 00 00 00 00 00 00 e5 00
This can also be seen in the graphical statistics; The Power_On_Hours increased up to the moment the drive failed:
This can be interpreted as good news - the SSD continued to live on for another third of the previous life time!
Of course the question arises, if there were more defect sectors detected and re-allocated on this SSD. Interestingly the answer is no: The attribute Reallocated_Event_Count remained at 12:
What about the Lifetime? At what "Percent_Lifetime_Remain" value was the SSD when it definitely died? The answer is 109!
So if your SSD (a Crucial MX anyway) reaches the value 100 on Percent_Lifetime_Remain, you definitely should not waste any time to get a replacement.
Another important attribute is "Total_LBAs_Written", which is the most relevant attribute for warranty calculations:
The counter stopped at 34.129 billion. Using the awesome SSD Total Bytes Written (TBW) Calculator from Florial Grehl, this results in roughly 15.89 TB written on this SSD:
What does Crucial's warranty say about the Total Bytes Written (TBW) value? From the MX500 datasheet:
Endurance
250GB drive: 100TB Total Bytes Written (TBW), equal to 54GB per day for 5 years
500GB drive: 180TB Total Bytes Written (TBW), equal to 98GB per day for 5 years
1TB drive: 360TB Total Bytes Written (TBW), equal to 197GB per day for 5 years
2TB drive: 700TB Total Bytes Written (TBW), equal to 383GB per day for 5 years
Warranty
Limited five-year warranty
For this 1 TB drive, we're way off. With a purchase of the SSD in 2018, we're still off the 5 years usage. However the Percent_Lifetime_Remain attribute clearly showed a value above the 100% usage. But this is a value calculated internally by the SSD's firmware. I'm not sure based on which other attributes this value is calculated.
Crucial MX SSDs show the Percent_Lifetime_Remain SMART attribute (I haven't seen this attribute on other SSD models). This seems to be a helpful indicator when to buy a replacement drive. Besides that I'm still puzzled why the drive was marked as failed by the hardware raid controller on hour 8650.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder