I'm constantly monitoring the SMART Status of server hard disks and as error rates increase, the chance of a failing disk is imminent. I prefer to replace defect hardware as soon as possible, before it actually fails, if possible. In case of a HDD this is possible.
The following steps explain how to replace a HDD of a software raid unter Linux. These steps also apply to solid state drives (SSD) of course.
Update February 28th 2013: Added commands for GPT disks.
1. Determine the defect or failing HDD -> in my case I already got that information from my monitoring using SMART data: SDB.
If the disk already completely failed, you can see that also with cat /proc/mdstat.
2. Get the current Raid-layout:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sda6[0] sdb6[1]
688009088 blocks [2/2] [UU]
md3 : active raid1 sda5[0] sdb5[1]
20971392 blocks [2/2] [UU]
md2 : active raid1 sda3[0] sdb3[1]
20971456 blocks [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
524224 blocks [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
2096064 blocks [2/2] [UU]
unused devices:
As you can see, disk SDB is still shown as active in all Raid Arrays.
3. (optional in case the failing disk is still working in the software raid)
Set the failing disk (SDB) as "fail" in the software raid:
# mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
# mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md2
# mdadm --manage /dev/md3 --fail /dev/sdb5
mdadm: set /dev/sdb5 faulty in /dev/md3
# mdadm --manage /dev/md4 --fail /dev/sdb6
mdadm: set /dev/sdb6 faulty in /dev/md4
Now the raid status looks like the following (as if SDB failed):
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sda6[0] sdb6[2](F)
688009088 blocks [2/1] [U_]
md3 : active raid1 sda5[0] sdb5[2](F)
20971392 blocks [2/1] [U_]
md2 : active raid1 sda3[0] sdb3[2](F)
20971456 blocks [2/1] [U_]
md1 : active raid1 sda2[0] sdb2[2](F)
524224 blocks [2/1] [U_]
md0 : active raid1 sda1[0] sdb1[2](F)
2096064 blocks [2/1] [U_]
unused devices:
4. Remove all SDB partitions from each Raid Array:
# mdadm /dev/md0 -r /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md0
# mdadm /dev/md1 -r /dev/sdb2
mdadm: hot removed /dev/sdb2 from /dev/md1
# mdadm /dev/md2 -r /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md2
# mdadm /dev/md3 -r /dev/sdb5
mdadm: hot removed /dev/sdb5 from /dev/md3
# mdadm /dev/md4 -r /dev/sdb6
mdadm: hot removed /dev/sdb6 from /dev/md4
Again a verification of the current status of the software Raid - all SDB entries are now removed:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sda6[0]
688009088 blocks [2/1] [U_]
md3 : active raid1 sda5[0]
20971392 blocks [2/1] [U_]
md2 : active raid1 sda3[0]
20971456 blocks [2/1] [U_]
md1 : active raid1 sda2[0]
524224 blocks [2/1] [U_]
md0 : active raid1 sda1[0]
2096064 blocks [2/1] [U_]
unused devices:
5. (optional) Check that on the remaining disk a boot loader is installed:
# dd if=/dev/sda bs=1024 count=1 2>&1 | strings | egrep -i "lilo|grub"
GRUB
6. Shut down server (if necessary) and replace the drive. Then start the server, which should boot from SDA.
7. Copy SDA's partition table to the new SDB HDD (SDA: Good/old, SDB: New empty diks, SDA -> SDB).
Note: If you are going to replace the drive with a larger drive and your goal is to extend the size of the raid array, do not copy the partition table. Instead check out this article: Replace hard or solid state drive with a bigger one and grow software (mdadm) raid.
For disks with the MBR Master Boot Record:
# sfdisk -d /dev/sda | sfdisk /dev/sdb
For drives with the GPT partition table (all drives larger than 2TB):
# sgdisk -R /dev/sdb /dev/sda
# sgdisk -G /dev/sdb
8. Insert new SDB to Raid Arrays:
# mdadm /dev/md0 -a /dev/sdb1
# mdadm /dev/md1 -a /dev/sdb2
# mdadm /dev/md2 -a /dev/sdb3
# mdadm /dev/md3 -a /dev/sdb5
# mdadm /dev/md4 -a /dev/sdb6
9. Check Synchronisation:
# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md4 : active raid1 sdb6[2] sda6[0]
688009088 blocks [2/1] [U_]
resync=DELAYED
md3 : active raid1 sdb5[2] sda5[0]
20971392 blocks [2/1] [U_]
[>....................] recovery = 1.2% (271936/20971392) finish=5.0min speed=67984K/sec
md2 : active raid1 sdb3[2] sda3[0]
20971456 blocks [2/1] [U_]
resync=DELAYED
md1 : active raid1 sdb2[2] sda2[0]
524224 blocks [2/1] [U_]
resync=DELAYED
md0 : active raid1 sdb1[1] sda1[0]
2096064 blocks [2/2] [UU]
unused devices:
# grub-install /dev/sdb
Installation finished. No error reported.
# dd if=/dev/sdb bs=1024 count=1 2>&1 | strings | egrep -i "lilo|grub"
GRUB
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder