Replacing physical drives connected to a hardware raid controller is usually the easiest thing to do. Just take out the failing or failed drive, insert the replacement drive and an intelligent/decent raid controller should automatically start the rebuild.
But not all hardware raid controllers are capable of doing this on-line using the "hot-swap" method. This article is about a head-shaking experience of an expected simple task becoming a time-intensive search for answers.
Monitoring (Icinga 2 using check_raid monitoring plugin) detected a problem in the raid configuration of an old IBM System x3650 M4 server:
Additional Info: CRITICAL: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 0 Drives: SAS2IRCU Unknown exit::]
This alert alone does not indicate a failing drive, however it shows that the sas2ircu command did not return any valid information.
Manual investigation on this particular server showed that a lot of Kernel logs were written, pointing to a failing /dev/sdb drive.
Oct 21 06:42:41 linux kernel: [29525867.713971] sd 1:1:0:0: [sdb] tag#37 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:42:41 linux kernel: [29525867.713987] sd 1:1:0:0: [sdb] tag#37 CDB: Write(10) 2a 00 14 43 70 00 00 08 00 00
Oct 21 06:42:41 linux kernel: [29525867.713991] blk_update_request: I/O error, dev sdb, sector 339963904
Oct 21 06:42:41 linux kernel: [29525867.715656] EXT4-fs warning (device dm-4): ext4_end_bio:330: I/O error -5 writing to inode 2884588 (offset 381681664 size 4194304 starting block 11247104)
Oct 21 06:42:41 linux kernel: [29525867.715908] sd 1:1:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:42:41 linux kernel: [29525867.715911] sd 1:1:0:0: [sdb] tag#10 CDB: Write(10) 2a 00 14 43 78 00 00 08 00 00
Oct 21 06:42:41 linux kernel: [29525867.715914] blk_update_request: I/O error, dev sdb, sector 339965952
Oct 21 06:42:41 linux kernel: [29525867.717595] EXT4-fs warning (device dm-4): ext4_end_bio:330: I/O error -5 writing to inode 2884588 (offset 381681664 size 4194304 starting block 11247360)
Oct 21 06:43:25 linux kernel: [29525911.525681] sd 1:1:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.525700] sd 1:1:0:0: [sdb] tag#23 CDB: Write(10) 2a 00 14 43 b8 00 00 08 00 00
Oct 21 06:43:25 linux kernel: [29525911.525703] blk_update_request: I/O error, dev sdb, sector 339982336
Oct 21 06:43:25 linux kernel: [29525911.529165] Buffer I/O error on device dm-4, logical block 11249409
Oct 21 06:43:25 linux kernel: [29525911.532704] Buffer I/O error on device dm-4, logical block 11249411
Oct 21 06:43:25 linux kernel: [29525911.534323] Buffer I/O error on device dm-4, logical block 11249412
Oct 21 06:43:25 linux kernel: [29525911.543279] sd 1:1:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.543287] sd 1:1:0:0: [sdb] tag#25 CDB: Write(10) 2a 00 14 44 70 00 00 08 00 00
Oct 21 06:43:25 linux kernel: [29525911.548580] sd 1:1:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.548584] blk_update_request: I/O error, dev sdb, sector 339998720
Oct 21 06:43:25 linux kernel: [29525911.550341] sd 1:1:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.550345] blk_update_request: I/O error, dev sdb, sector 340033536
Oct 21 06:43:25 linux kernel: [29525911.929647] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 21 06:43:26 linux kernel: [29525913.126566] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 21 07:23:07 linux kernel: [29528293.304936] blk_update_request: I/O error, dev sdb, sector 100821248
As /dev/sdb is a logical drive using a RAID-1 configuration on a LSI SAS2004 Fusion-MPT controller, this shouldn't be happening. The operating system shouldn't see IO errors when one of the physical drives are failing - unless the raid controller itself has a problem.
By comparing the SMART attributes of both physical drives (using /dev/sg3 and /dev/sg4), the second drive sg4, a Seagate ST2000NM0033, was deemed failing:
Note: Graph created by Icinga monitoring with data from check_smart monitoring plugin, using InfluxDB as data storage and Grafana to display the graphs.
In the past couple of hours, the Command_Timeout attribute sharply increased - pointing to a problem of the physical disk. The beginning of the increasing Command_Timeout values matched the first logged errors in kern.log.
Although the drive clearly seems to be failing, the LSI SAS2004 Fusion-MPT controller still saw the drive as operational (even when there were clearly noticeable problems), as can be verified using the sas2ircu output:
root@linux:~# sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.
Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
Controller type : SAS2004
BIOS version : 7.39.00.00
Firmware version : 20.00.05.00
Channel description : 1 Serial Attached SCSI
Initiator ID : 0
Maximum physical devices : 125
Concurrent commands supported : 1760
Slot : 1
Segment : 0
Bus : 10
Device : 0
Function : 0
RAID Support : Yes
------------------------------------------------------------------------
IR Volume information
------------------------------------------------------------------------
IR volume 1
Volume ID : 180
Status of volume : Okay (OKY)
Volume wwid : 03d2c452682c3bea
RAID level : RAID1
Size (in MB) : 1906394
Physical hard disks :
PHY[0] Enclosure#/Slot# : 1:13
PHY[1] Enclosure#/Slot# : 1:12
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0
Device is a Hard disk
Enclosure # : 1
Slot # : 12
SAS Address : 4433221-1-0300-0000
State : Optimal (OPT)
Size (in MB)/(in sectors) : 1907729/3907029167
Manufacturer : ATA
Model Number : ST2000NM0033
Firmware Revision : BB59
Serial No : redacted
GUID : redacted
Protocol : SATA
Drive Type : SATA_HDD
Device is a Hard disk
Enclosure # : 1
Slot # : 13
SAS Address : 4433221-1-0200-0000
State : Optimal (OPT)
Size (in MB)/(in sectors) : 1907729/3907029167
Manufacturer : ATA
Model Number : WD2000FYYZ-23UL
Firmware Revision : WD37
Serial No : redacted
GUID : redacted
Protocol : SATA
Drive Type : SATA_HDD
------------------------------------------------------------------------
Enclosure information
------------------------------------------------------------------------
Enclosure# : 1
Logical ID : 50050760:43851e74
Numslots : 4
StartSlot : 12
------------------------------------------------------------------------
SAS2IRCU: Command DISPLAY Completed Successfully.
SAS2IRCU: Utility Completed Successfully.
Obviously a replacement drive was ordered and it arrived 2 days later.
When the technician was ready to replace the physical drive, the next challenge awaited: Which drive is it? This particular server has a total of 14 physical drives where 2 drives are being used in this LSI MPT raid controller, the other 14 drives are attached to a MegaRAID controller. As if this wasn't a strange enough setup already, the first 12 physical drives are accessible from the front, the last two "drive slots" are only accessible from the back of the server.
Even though this server has IBM Integrated Management Module (IIM) enabled, the local storage list didn't really help to see in which bay the drive would be located:
However the output from sas2ircu above helps a lot by adding the slot number to the output. In this case we now have the failing drive (model ST2000NM0033) and the physical slot:
Device is a Hard disk
Enclosure # : 1
Slot # : 12
SAS Address : 4433221-1-0300-0000
State : Optimal (OPT)
Size (in MB)/(in sectors) : 1907729/3907029167
Manufacturer : ATA
Model Number : ST2000NM0033
Firmware Revision : BB59
Serial No : redacted
GUID : redacted
Protocol : SATA
Drive Type : SATA_HDD
But then I got another call from the technician: There are no slot/bay numbers on this server. Back to square one, trying to identify the failing drive.
Luckily the sas2ircu command is able to "locate" a physical drive. This basically means to manually turn on the drive's LED:
root@linux:~# sas2ircu 0 locate 1:12 on
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.
SAS2IRCU: LOCATE command completed successfully.
Shortly after that I received a picture from the technician: Thanks to the turned on LED he was able to identify the drive.
Finally, the drive got replaced!
Right after the drive was replaced, smartctl was used (once again) on /dev/sg4 to see the new drive. And yes, another model (ST2000NM0008) could be found:
root@bd-radoi01-p:~# smartctl -i /dev/sg4
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-142-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST2000NM0008 81Y9795 81Y3864LEN
Serial Number: redacted
LU WWN Device Id: 5 000c50 0c2b9687d
Firmware Version: LJ92
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Oct 26 16:27:54 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Great, the raid rebuild was probably already started. But a verification with sas2ircu turned into deception: Although the physical drive was changed (and /dev/sg4 showed the new drive information), the raid controller still had the old model (ST2000NM0033) showing up a physical drive in slot 12:
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0
Device is a Hard disk
Enclosure # : 1
Slot # : 12
SAS Address : 4433221-1-0300-0000
State : Optimal (OPT)
Size (in MB)/(in sectors) : 1907729/3907029167
Manufacturer : ATA
Model Number : ST2000NM0033
Firmware Revision : BB59
Serial No : redacted
GUID : redacted
Protocol : SATA
Drive Type : SATA_HDD
Device is a Hard disk
Enclosure # : 1
Slot # : 13
SAS Address : 4433221-1-0200-0000
State : Optimal (OPT)
Size (in MB)/(in sectors) : 1907729/3907029167
Manufacturer : ATA
Model Number : WD2000FYYZ-23UL
Firmware Revision : WD37
Serial No : redacted
GUID : redacted
Protocol : SATA
Drive Type : SATA_HDD
Different attempts to force a device rescan on the scsi_host did not help:
root@linux:~# lspci | egrep -i "(SAS|SCSI|SATA)"
02:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
03:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
03:01.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
04:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe-PCI Bridge [PPB]
0a:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
14:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
root@linux:~# ll /sys/class/scsi_device/
total 0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 0:0:0:0 -> ../../devices/pci0000:00/0000:00:03.0/0000:14:00.0/host0/target0:0:0/0:0:0:0/scsi_device/0:0:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 0:2:0:0 -> ../../devices/pci0000:00/0000:00:03.0/0000:14:00.0/host0/target0:2:0/0:2:0:0/scsi_device/0:2:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:0:0:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/scsi_device/1:0:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:0:1:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0/scsi_device/1:0:1:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:1:0:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/target1:1:0/1:1:0:0/scsi_device/1:1:0:0
root@linux:~# echo "0 0 0" > /sys/class/scsi_host/host1/scan
root@linux:~# echo "- - -" > /sys/class/scsi_host/host1/scan
Even though the correct scsi_host could be identified using the PCI device ID, sas2ircu's output still showed the old drive.
At the end if really needed a reboot that the MTP2SAS raid controller was able to detect the new physical drive. Not cool.
Booted into BIOS -> System Settings -> Storage, selected LSI SAS2 MPT Controller SAS2004, select LSI SAS2 MPT Controller Version 7.27.04.00 and finally selected Physical Disk Management -> View Physical Disk Properties. Here the physical disk could be changed to show 0:1:12 (Enclosure 1, Slot 12) and finally the new drive model was showing up and the state is set to Rebuilding:
Once booted back into the OS (Ubuntu 16.04), sas2ircu finally showed the new drive and the current rebuilding state, too.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder