rm: cannot remove due to Input/output error (EXT4-fs error)

Written by - 1 comments

Published on - Listed in Linux Hardware


On some servers I use rsnapshot as a backup method. It's fast, easy to manage and reliable. But a backup a few days ago returned the following error in the daily backup log:

echo 5190 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706S=12695,W=12855:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843S=4225,W=4321:2,S': Input/output error
/bin/rm: cannot remove `/backup/rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error
----------------------------------------------------------------------------
rsnapshot encountered an error! The program was invoked with these options:
/usr/bin/rsnapshot daily

[Backflash] This particular server suffered a disk defect in the past days and the raid recovery had some issues to resynchronize. I removed the defect disk out of the raid array and told the guys in the data center to replace the defect disk. And here's comes the problem: They made a big mistake by making a short SMART test which resulted in no errors and decided, the disk was good and rebooted the server. As the server booted, the Kernel recognized two disks with each a raid configuration on them. But guess what? mdraid tells the kernel that these disks are not in the same raid (because I previously removed the defect disk, remember?).
So by accident, the kernel took the defect disk, created a weird new raid array (called md127) and mounted the filesystems on it while the good disk was just mounted as read-only block device. During a day or so the server was running like this until I realized what happened. I then hoped I could at least resynchronize the defect disk with the good disk, so the new data would be synchronized. But due to too many I/O errors from the bad disk (because the disk is bad, told ya!!) the raid recovery failed at around 20%. So some data was resynchronized, some not. Some probably even contain invalid data as the disk was going to die. After a lot of nightly effort I could restore the server to a more or less stable system, thanks to the help of a very good friend. But one question remained open: What did the failed raid recovery do to the good disk? [/Backflash]

Back to present. My guess is that the input/output errors on that filesystem originate from the failed raid rebuild where some corrupt data might have come from the defect disk. I tried to manually remove the daily.9 folder to see what would happen:

rm -r daily.9
rm: cannot remove `daily.9/localhost/home/mail/web98p1/Maildir/cur/1359872706,S=12695,W=12855:2,S': Input/output error
rm: cannot remove `daily.9/localhost/home/mail/web98p8/Maildir/cur/1360095843,S=4225,W=4321:2,S': Input/output error
rm: cannot remove `daily.9/localhost/var/www/web136/files/g2data/cache/entity/3/2': Input/output error

Same issue. Let's see if dmesg tells us something useful:

dmesg
[116836.420063] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24274320
[116842.656399] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24400033
[116842.574064] EXT4-fs error (device dm-5): ext4_lookup: deleted inode referenced: 24273584

The "fs error" tells it all. There's some issues on that file system. Next step: File system check.
As this is the backup file system, separately mounted, I can unmount it and make the file system check while the server continues to do its job.

umount /backup

fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure

Entry '1360095843,S=4225,W=4321:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p8/Maildir/cur (8674221) has deleted/unused inode 24274320.  Clear? yes

Entry '2' in /rsnapshot/daily.9/localhost/var/www/web136/files/g2data/cache/entity/3 (24400027) has deleted/unused inode 24400033.  Clear? yes

Entry '1359872706,S=12695,W=12855:2,S' in /rsnapshot/daily.9/localhost/home/mail/web98p1/Maildir/cur (8671256) has deleted/unused inode 24273584.  Clear? yes

Heyyy.. these entries look familiar. Don't they? So these files were referenced in an inode which does not exist anymore. To my luck this is an older backup and I can permit myself to clear the inodes from the file system. Well... what choice would I have anyway. The fsck continues:

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts
Inode 939825 ref count is 10, should be 9.  Fix? yes
Inode 939826 ref count is 10, should be 9.  Fix? yes
Inode 939827 ref count is 10, should be 9.  Fix? yes
Inode 939843 ref count is 10, should be 9.  Fix? yes

On Pass 4 there came a lot of Inode ref count fix suggestions. Inode ref counts are nothing more than a count of hard links to a particular file. rsnapshot users know that rsnapshot works with hard links to save disk space with daily backups.

Pass 5: Checking group summary information
Block bitmap differences:  -(97030728--97030731) -(97033838--97033839) -97529271
Fix? yes

Free blocks count wrong for group #2961 (3626, counted=3632).
Fix? yes

Free blocks count wrong for group #2976 (8930, counted=8931).
Fix? yes

Free blocks count wrong (50341877, counted=50341884).
Fix? yes

Inode bitmap differences:  -24273584 -24274320 -24400033
Fix? yes

Free inodes count wrong for group #2963 (2017, counted=2019).
Fix? yes

Free inodes count wrong for group #2978 (3021, counted=3022).
Fix? yes

Directories count wrong for group #2978 (4051, counted=4050).
Fix? yes

Free inodes count wrong (26902779, counted=26902782).
Fix? yes

/dev/mapper/vg0-backup: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/vg0-backup: 5865218/32768000 files (0.1% non-contiguous), 80730116/131072000 blocks

Pass 5 fixed the count of free inodes and free blocks on the file system. Must also be an artifact from the failed raid recovery. That was the end of the fsck.
Another check to see if everything is alright now:

fsck.ext4 /dev/mapper/vg0-backup
e2fsck 1.41.12 (17-May-2010)
/dev/mapper/vg0-backup: clean, 5865218/32768000 files, 80730116/131072000 blocks

So the file system should now be clean. Falsely referenced inodes were deleted and rsnapshot should do its job correctly now. Let me check the rsnapshot logs:

echo 9385 > /var/run/rsnapshot.pid
/bin/rm -rf /backup/rsnapshot/daily.9/
mv /backup/rsnapshot/daily.8/ /backup/rsnapshot/daily.9/
[...]

Yep, rsnapshot continues its fine job once more. Over and out.


Add a comment

Show form to leave a comment

Comments (newest first)

Jake from wrote on Nov 27th, 2015:

Thank you for this well written walk through!


RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder