How to identify and replace a failing drive behind Fusion-MPT SAS-2 controller on IBM System X server

Written by - 0 comments

Published on - Listed in Hardware Linux


Replacing physical drives connected to a hardware raid controller is usually the easiest thing to do. Just take out the failing or failed drive, insert the replacement drive and an intelligent/decent raid controller should automatically start the rebuild.

But not all hardware raid controllers are capable of doing this on-line using the "hot-swap" method. This article is about a head-shaking experience of an expected simple task becoming a time-intensive search for answers.

Failing drive detected

Monitoring (Icinga 2 using check_raid monitoring plugin) detected a problem in the raid configuration of an old IBM System x3650 M4 server:

Additional Info: CRITICAL: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 0 Drives: SAS2IRCU Unknown exit::]

This alert alone does not indicate a failing drive, however it shows that the sas2ircu command did not return any valid information.

Manual investigation on this particular server showed that a lot of Kernel logs were written, pointing to a failing /dev/sdb drive.

Oct 21 06:42:41 linux kernel: [29525867.713971] sd 1:1:0:0: [sdb] tag#37 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:42:41 linux kernel: [29525867.713987] sd 1:1:0:0: [sdb] tag#37 CDB: Write(10) 2a 00 14 43 70 00 00 08 00 00
Oct 21 06:42:41 linux kernel: [29525867.713991] blk_update_request: I/O error, dev sdb, sector 339963904
Oct 21 06:42:41 linux kernel: [29525867.715656] EXT4-fs warning (device dm-4): ext4_end_bio:330: I/O error -5 writing to inode 2884588 (offset 381681664 size 4194304 starting block 11247104)
Oct 21 06:42:41 linux kernel: [29525867.715908] sd 1:1:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:42:41 linux kernel: [29525867.715911] sd 1:1:0:0: [sdb] tag#10 CDB: Write(10) 2a 00 14 43 78 00 00 08 00 00
Oct 21 06:42:41 linux kernel: [29525867.715914] blk_update_request: I/O error, dev sdb, sector 339965952
Oct 21 06:42:41 linux kernel: [29525867.717595] EXT4-fs warning (device dm-4): ext4_end_bio:330: I/O error -5 writing to inode 2884588 (offset 381681664 size 4194304 starting block 11247360)
Oct 21 06:43:25 linux kernel: [29525911.525681] sd 1:1:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.525700] sd 1:1:0:0: [sdb] tag#23 CDB: Write(10) 2a 00 14 43 b8 00 00 08 00 00
Oct 21 06:43:25 linux kernel: [29525911.525703] blk_update_request: I/O error, dev sdb, sector 339982336
Oct 21 06:43:25 linux kernel: [29525911.529165] Buffer I/O error on device dm-4, logical block 11249409
Oct 21 06:43:25 linux kernel: [29525911.532704] Buffer I/O error on device dm-4, logical block 11249411
Oct 21 06:43:25 linux kernel: [29525911.534323] Buffer I/O error on device dm-4, logical block 11249412
Oct 21 06:43:25 linux kernel: [29525911.543279] sd 1:1:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.543287] sd 1:1:0:0: [sdb] tag#25 CDB: Write(10) 2a 00 14 44 70 00 00 08 00 00
Oct 21 06:43:25 linux kernel: [29525911.548580] sd 1:1:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.548584] blk_update_request: I/O error, dev sdb, sector 339998720
Oct 21 06:43:25 linux kernel: [29525911.550341] sd 1:1:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_RESET driverbyte=DRIVER_OK
Oct 21 06:43:25 linux kernel: [29525911.550345] blk_update_request: I/O error, dev sdb, sector 340033536
Oct 21 06:43:25 linux kernel: [29525911.929647] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 21 06:43:26 linux kernel: [29525913.126566] JBD2: Detected IO errors while flushing file data on dm-4-8
Oct 21 07:23:07 linux kernel: [29528293.304936] blk_update_request: I/O error, dev sdb, sector 100821248

As /dev/sdb is a logical drive using a RAID-1 configuration on a LSI SAS2004 Fusion-MPT controller, this shouldn't be happening. The operating system shouldn't see  IO errors when one of the physical drives are failing - unless the raid controller itself has a problem.

By comparing the SMART attributes of both physical drives (using /dev/sg3 and /dev/sg4), the second drive sg4, a Seagate ST2000NM0033, was deemed failing:

Note: Graph created by Icinga monitoring with data from check_smart monitoring plugin, using InfluxDB as data storage and Grafana to display the graphs.

In the past couple of hours, the Command_Timeout attribute sharply increased - pointing to a problem of the physical disk. The beginning of the increasing Command_Timeout values matched the first logged errors in kern.log.

Although the drive clearly seems to be failing, the LSI SAS2004 Fusion-MPT controller still saw the drive as operational (even when there were clearly noticeable problems), as can be verified using the sas2ircu output:

root@linux:~# sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2004
  BIOS version                            : 7.39.00.00
  Firmware version                        : 20.00.05.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 125
  Concurrent commands supported           : 1760
  Slot                                    : 1
  Segment                                 : 0
  Bus                                     : 10
  Device                                  : 0
  Function                                : 0
  RAID Support                            : Yes
------------------------------------------------------------------------
IR Volume information
------------------------------------------------------------------------
IR volume 1
  Volume ID                               : 180
  Status of volume                        : Okay (OKY)
  Volume wwid                             : 03d2c452682c3bea
  RAID level                              : RAID1
  Size (in MB)                            : 1906394
  Physical hard disks                     :
  PHY[0] Enclosure#/Slot#                 : 1:13
  PHY[1] Enclosure#/Slot#                 : 1:12
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 12
  SAS Address                             : 4433221-1-0300-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA     
  Model Number                            : ST2000NM0033
  Firmware Revision                       : BB59
  Serial No                               : redacted
  GUID                                    : redacted
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 13
  SAS Address                             : 4433221-1-0200-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA     
  Model Number                            : WD2000FYYZ-23UL
  Firmware Revision                       : WD37
  Serial No                               : redacted
  GUID                                    : redacted
  Protocol                                : SATA
  Drive Type                              : SATA_HDD
------------------------------------------------------------------------
Enclosure information
------------------------------------------------------------------------
  Enclosure#                              : 1
  Logical ID                              : 50050760:43851e74
  Numslots                                : 4
  StartSlot                               : 12
------------------------------------------------------------------------
SAS2IRCU: Command DISPLAY Completed Successfully.
SAS2IRCU: Utility Completed Successfully.

Obviously a replacement drive was ordered and it arrived 2 days later.

Identifying and physically replacing the drive

When the technician was ready to replace the physical drive, the next challenge awaited: Which drive is it? This particular server has a total of 14 physical drives where 2 drives are being used in this LSI MPT raid controller, the other 14 drives are attached to a MegaRAID controller. As if this wasn't a strange enough setup already, the first 12 physical drives are accessible from the front, the last two "drive slots" are only accessible from the back of the server.

Even though this server has IBM Integrated Management Module (IIM) enabled, the local storage list didn't really help to see in which bay the drive would be located:

However the output from sas2ircu above helps a lot by adding the slot number to the output. In this case we now have the failing drive (model ST2000NM0033) and the physical slot:

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 12
  SAS Address                             : 4433221-1-0300-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA     
  Model Number                            : ST2000NM0033
  Firmware Revision                       : BB59
  Serial No                               : redacted
  GUID                                    : redacted
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

But then I got another call from the technician: There are no slot/bay numbers on this server. Back to square one, trying to identify the failing drive.

Luckily the sas2ircu command is able to "locate" a physical drive. This basically means to manually turn on the drive's LED:

root@linux:~# sas2ircu 0 locate 1:12 on
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.

SAS2IRCU: LOCATE command completed successfully.

Shortly after that I received a picture from the technician: Thanks to the turned on LED he was able to identify the drive.

Finally, the drive got replaced!

MTP2SAS Raid controller still sees the old drive

Right after the drive was replaced, smartctl was used (once again) on /dev/sg4 to see the new drive. And yes, another model (ST2000NM0008) could be found:

root@bd-radoi01-p:~# smartctl -i /dev/sg4  
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-142-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST2000NM0008         81Y9795 81Y3864LEN
Serial Number:    redacted
LU WWN Device Id: 5 000c50 0c2b9687d
Firmware Version: LJ92
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct 26 16:27:54 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Great, the raid rebuild was probably already started. But a verification with sas2ircu turned into deception: Although the physical drive was changed (and /dev/sg4 showed the new drive information), the raid controller still had the old model (ST2000NM0033) showing up a physical drive in slot 12:

------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 12
  SAS Address                             : 4433221-1-0300-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA     
  Model Number                            : ST2000NM0033    
  Firmware Revision                       : BB59
  Serial No                               : redacted
  GUID                                    : redacted
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 13
  SAS Address                             : 4433221-1-0200-0000
  State                                   : Optimal (OPT)
  Size (in MB)/(in sectors)               : 1907729/3907029167
  Manufacturer                            : ATA     
  Model Number                            : WD2000FYYZ-23UL
  Firmware Revision                       : WD37
  Serial No                               : redacted
  GUID                                    : redacted
  Protocol                                : SATA
  Drive Type                              : SATA_HDD

Different attempts to force a device rescan on the scsi_host did not help:

root@linux:~# lspci | egrep -i "(SAS|SCSI|SATA)"
02:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
03:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
03:01.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
04:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe-PCI Bridge [PPB]
0a:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
14:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

root@linux:~# ll /sys/class/scsi_device/
total 0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 0:0:0:0 -> ../../devices/pci0000:00/0000:00:03.0/0000:14:00.0/host0/target0:0:0/0:0:0:0/scsi_device/0:0:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 0:2:0:0 -> ../../devices/pci0000:00/0000:00:03.0/0000:14:00.0/host0/target0:2:0/0:2:0:0/scsi_device/0:2:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:0:0:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0/scsi_device/1:0:0:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:0:1:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0/scsi_device/1:0:1:0
lrwxrwxrwx 1 root root 0 Oct 21 08:40 1:1:0:0 -> ../../devices/pci0000:00/0000:00:02.0/0000:0a:00.0/host1/target1:1:0/1:1:0:0/scsi_device/1:1:0:0

root@linux:~# echo "0 0 0" > /sys/class/scsi_host/host1/scan
root@linux:~# echo "- - -" > /sys/class/scsi_host/host1/scan

Even though the correct scsi_host could be identified using the PCI device ID, sas2ircu's output still showed the old drive.

Reboot (this hurts every Linux engineer)

At the end if really needed a reboot that the MTP2SAS raid controller was able to detect the new physical drive. Not cool.

Booted into BIOS -> System Settings -> Storage, selected LSI SAS2 MPT Controller SAS2004, select LSI SAS2 MPT Controller Version 7.27.04.00 and finally selected Physical Disk Management -> View Physical Disk Properties. Here the physical disk could be changed to show 0:1:12 (Enclosure 1, Slot 12) and finally the new drive model was showing up and the state is set to Rebuilding:

IBM System X3650 M4 BIOS showing physical drive behind MPT raid controller

Once booted back into the OS (Ubuntu 16.04), sas2ircu finally showed the new drive and the current rebuilding state, too.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder