For a couple of months I've always wondered about the following error messages appearing on my NAS, a HP Proliant N40L Microserver running Debian 7 Wheezy, every five minutes:
[Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[Hardware Error]: MC2 Error: : SNP error during data copyback.
[Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[Hardware Error]: Corrected error, no action required.
I came across some articles, like the following:
But none offered real solutions to the problem. Some even said this logged error messages could simply be ignored...
A couple of days ago, I upgraded the NAS server from Debian Wheezy to Jessie (as a mid-way upgrad to Stretch) and realized after the successful OS upgrade, that the log entries now happen ALL THE TIME. I couldn't even use the terminal anymore because it was flooded by these messages:
[ 1026.904428] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1026.910229] [Hardware Error]: MC2_ADDR: 0x00000000d3b42540
[ 1026.915945] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1026.921690] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.182836] [Hardware Error]: Corrected error, no action required.
[ 1027.188553] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
[ 1027.194345] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
[ 1027.200132] [Hardware Error]: MC2 Error: : SNP error during data copyback.
[ 1027.205915] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
[ 1027.338890] [Hardware Error]: Corrected error, no action required.
[ 1027.344632] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.350428] [Hardware Error]: MC1_ADDR: 0x0000ffff81012550
[ 1027.356222] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.361997] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.430924] [Hardware Error]: Corrected error, no action required.
[ 1027.436645] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
[ 1027.442419] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
[ 1027.448216] [Hardware Error]: MC1 Error: Parity error during data load.
[ 1027.453960] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[ 1027.939102] [Hardware Error]: Corrected error, no action required.
Damn. It's time to dig into that problem again. This time I got luckier and came across this forum thread:
The most interesting posted text there was:
"It is most likely a CPU fan dust bunny. That's the signal from the kernel to clean those out."
As easy as this sounds, it made sense. The microserver has been running day and night since it became my NAS server in December 2012 (see article Building a home file server with HP Proliant N40L). That's more than 5 years of total run time. As you might be aware of, the motherboard of this Microserver is under the drive cage and not easily accessible. And therefore not easily cleanable either.
I gave it a shot, shut down the server, removed the cables from the motherboard and pulled it out.
There it is. A thick layer of dust sitting on the CPU's heat sink.
I cleaned the motherboard (vacuumed the dust off), re-attached the cable and pushed the motherboard back in position. Time of truth. I booted the server.
Checking syslog, you can easily see when I turned off (15:28) and booted the server again (15:42):
May 24 15:28:04 nas kernel: [77872.129490] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.135237] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.140955] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.146656] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.263866] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.269509] [Hardware Error]: CPU:0 (10:6:3) MC2_STATUS[-|CE|-|-|AddrV|CECC]: 0x940040000000018a
May 24 15:28:04 nas kernel: [77872.275283] [Hardware Error]: MC2_ADDR: 0x0000000001af2540
May 24 15:28:04 nas kernel: [77872.280990] [Hardware Error]: MC2 Error: : SNP error during data copyback.
May 24 15:28:04 nas kernel: [77872.286694] [Hardware Error]: cache level: L2, tx: GEN, mem-tx: SNP
May 24 15:28:04 nas kernel: [77872.323890] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.329552] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.335294] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2550
May 24 15:28:04 nas kernel: [77872.341013] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.346716] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:28:04 nas kernel: [77872.371793] [Hardware Error]: Corrected error, no action required.
May 24 15:28:04 nas kernel: [77872.377085] [Hardware Error]: CPU:0 (10:6:3) MC1_STATUS[-|CE|-|-|AddrV]: 0x9400000000000151
May 24 15:28:04 nas kernel: [77872.382397] [Hardware Error]: MC1_ADDR: 0x0000ffff810b2540
May 24 15:28:04 nas kernel: [77872.387718] [Hardware Error]: MC1 Error: Parity error during data load.
May 24 15:28:04 nas kernel: [77872.393030] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 24 15:42:13 nas kernel: [ 0.000000] Initializing cgroup subsys cpuset
May 24 15:42:13 nas kernel: [ 0.000000] Initializing cgroup subsys cpu
May 24 15:42:13 nas kernel: [ 0.000000] Initializing cgroup subsys cpuacct
May 24 15:42:13 nas kernel: [ 0.000000] Linux version 3.16.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08)
May 24 15:42:13 nas kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.16.0-6-amd64 root=UUID=e00b8ddf-5247-4b9f-834c-d557df90f575 ro quiet
May 24 15:42:13 nas kernel: [ 0.000000] e820: BIOS-provided physical RAM map:
Then, I waited. From the logs above (which flooded my terminal) you can see that already after 1026 seconds of uptime the hardware errors appeared.
Now, after 1200 seconds of uptime, still no hardware errors:
root@nas:~# uptime
16:03:00 up 20 min, 1 user, load average: 0.04, 0.15, 0.09
root@nas:~# echo $((20 * 60 ))
1200
root@nas:~# dmesg | tail
[ 10.257700] RPC: Registered named UNIX socket transport module.
[ 10.257706] RPC: Registered udp transport module.
[ 10.257709] RPC: Registered tcp transport module.
[ 10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 10.272263] FS-Cache: Loaded
[ 10.321299] FS-Cache: Netfs 'nfs' registered for caching
[ 10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[ 11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[ 11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Even after now 41 minutes (=2460 seconds) of uptime, still no errors:
root@nas:~# uptime && dmesg |tail
16:23:49 up 41 min, 1 user, load average: 0.02, 0.03, 0.01
[ 10.257700] RPC: Registered named UNIX socket transport module.
[ 10.257706] RPC: Registered udp transport module.
[ 10.257709] RPC: Registered tcp transport module.
[ 10.257711] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 10.272263] FS-Cache: Loaded
[ 10.321299] FS-Cache: Netfs 'nfs' registered for caching
[ 10.376030] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[ 11.809469] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 11.809478] tg3 0000:02:00.0 eth0: Flow control is on for TX and on for RX
[ 11.809506] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
These error messages really turned out to be a warning from the OS to clean the server. Who'd have thought that looking at these hardware error messages...
Alex from DE wrote on May 28th, 2018:
You can use -T option with dmesg command to make DateTime more readable:
alex:~$ dmesg -T | head -10
[Mon May 21 11:19:55 2018] Initializing cgroup subsys cpuset
[Mon May 21 11:19:55 2018] Initializing cgroup subsys cpu
[Mon May 21 11:19:55 2018] Initializing cgroup subsys cpuacct
[Mon May 21 11:19:55 2018] Linux version 3.10.0-693.21.1.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) #1 SMP Wed Mar 7 19:03:37 UTC 2018
[Mon May 21 11:19:55 2018] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.21.1.el7.x86_64 root=/dev/mapper/vg.01-lv_root ro crashkernel=auto rd.lvm.lv=vg.01/lv_root rd.lvm.lv=vg.02/lv_swap noquiet ipv6.disable=1 net.ifnames=0 elevator=deadline user_namespace.enable=1 biosdevname=0 fsck.repair=yes LANG=en_US.UTF-8
[Mon May 21 11:19:55 2018] Disabled fast string operations
[Mon May 21 11:19:55 2018] e820: BIOS-provided physical RAM map:
[Mon May 21 11:19:55 2018] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
[Mon May 21 11:19:55 2018] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
[Mon May 21 11:19:55 2018] BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder