Monitor dual storage (raid) controller on IBM x3650 M4

Written by - 0 comments

Published on - Listed in Hardware Linux Monitoring


I wanted to monitor the current RAID status on an IBM x3650 M4 server, simply by using check_raid. I've been using this plugin for years and it supports most software and hardware raid controllers. I've never had any problems with it (once I installed the required cli tools for each hardware controller) - until today.

Due to a very strange hardware setup, inherited from an ex-colleague, the server turns out to have two different RAID controllers active. 12 physical drives are attached to one controller, 2 physical drives to another.

Once I installed the megacli command (from http://hwraid.le-vert.net/), the plugin correctly identified the physical drives behind /dev/sda:

# /usr/lib/nagios/plugins/check_raid -l
megacli
1 active plugins

# /usr/lib/nagios/plugins/check_raid
WARNING: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

To disable the warning on the disabled WriteCache:

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

But where are the other two physical drives? From my experience with hardware raid controllers I was pretty sure that megacli is able to detect multiple controllers and is able to retrieve the drive information from all controllers.
A manual verification using megacli still only returned 12 drives:

# megacli -CfgDsply -aall |grep Physical
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512

Thankfully a colleague, who recently was working on that particular server, made a screenshot of the storage controller menu during the boot process:

As it turns out, there are two different storage controllers built into that server. One is a MegaRaid controller (ServeRAID M5210) and one is a MPT controller:

# lspci | grep -i LSI
0a:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
14:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

No wonder megacli wasn't able to find the drives!

I tried again with "mpt-status" (http://hwraid.le-vert.net/wiki/LSIFusionMPT), but this didn't show any config:

# apt-get install mpt-status

/usr/sbin/mpt-status -p
Checking for SCSI ID:0
ioctl: No such device

I removed mpt-status again and went on to try the command "sas2ircu" for newer MPT cards. Finally I got some output:

# apt-get install sas2ircu

# sas2ircu LIST
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.


         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   0     SAS2004     1000h    70h   00h:0ah:00h:00h      1014h   040eh
SAS2IRCU: Utility Completed Successfully.

And, hurray, check_raid was now able to read the infos from both controllers:

# /usr/lib/nagios/plugins/check_raid -l
megacli
sas2ircu
2 active plugins

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 2 Drives: Optimal (OPT)::]

Update November 15th 2018:

This article helped me again today when the check_raid plugin alarmed of a failed drive:

# sas2ircu-status
-- Controller informations --
-- ID | Model
c0 | SAS2004

-- Arrays informations --
-- ID | Type | Size | Status
c0u0 | RAID1 | 1906G | Degraded (DGD)

-- Disks informations
-- ID | Model | Status
c0u0p0 | WD2000FYYZ-23UL (WDWMC1P051XXXX) | Failed (FLD)
c0u0p1 | WD2000FYYZ-23UL (WDWMC1P0D3XXXX) | Optimal (OPT)

There is at least one disk/array in a NOT OPTIMAL state.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder