Ubuntu freeze due to bug in tg3 driver for Broadcom NIC (?)

Written by - 0 comments

Published on - Listed in Linux Hardware Rant


Today I experienced a server freeze on an Ubuntu 12.04 LTS running with the quantal kernel (3.5.0-48-generic #72~precise1-Ubuntu).

The last entries on the console were the following:

tg3 0000:03:00.2: eth2: 0: Host status block [00000005:00000003:(0000:0000:0000):(0000:0000))]
tg3 0000:03:00.2: eth2: 0: NAPI info [00000003:00000003:(0000:0000:01ff):0000(02e6:0000:0000:0000)]
tg3 0000:03:00.2: eth2: 1: Host status block [00000001:000000c2:(0000:0000:0000):(0f22:0150))]
tg3 0000:03:00.2: eth2: 1: NAPI info [000000c2:000000c2:(00bf:0150:01ff):0f22:(0722:0722:0000:0000)]
tg3 0000:03:00.2: eth2: 2: Host status block [00000001:00000064:(0b3f:0000:0000):(0000:0049)]
tg3 0000:03:00.2: eth2: 2: NAPI info [00000064:00000064:(0049:0049:01ff):0b3f:(03ff:03ff:0000:0000)]
tg3 0000:03:00.2: eth2: 3: Host status block [00000001:00000024:(0000:0000:0000):(00000:012b)]
tg3 0000:03:00.2: eth2: 3: NAPI info [00000024:00000024:(012b:012b:01ff):0a8f:(028f:028f:0000:0000)]
tg3 0000:03:00.2: eth2: 4: Host status block [00000001:000000c7:(0000:0000:0d2e):(0000:010d)]
tg3 0000:03:00.2: eth2: 4: NAPI info [000000c7:000000c7:(010d:010d:01ff):0d2e:(052e:052e:0000:0000)]
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:03:00.2: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:03:00.2: eth2: Link is down
tg3 0000:03:00.1: eth1: Link is down
tg3 0000:03:00.0: eth0: Link is down
br1: port 1(eth1) entered disabled state

After these entries, the system completely froze. Not even the console was working anymore.

Here some additional information about the system:

ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: off

lspci  | grep 03:00
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

uname -a
Linux myserver.local 3.5.0-48-generic #72~precise1-Ubuntu SMP Tue Mar 11 20:09:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

I've tried to pinpoint the freeze to a certain bug, but couldn't really find a description which EXACTLY describes this issue. I did however find some clues/possibilities:

Deadlock bug in tg3 driver (tg3_change_mtu)?
It's possible that this freeze was triggered by a bug in the tg3 driver in the tg3_change_mtu function. A bug fix was released just recently on March 4th 2014 (see https://lkml.org/lkml/2014/3/4/568).
According to the Ubuntu changelog for the linux-lts-quantal package, Ubuntu (Canonical) added this kernel fix in 3.5.0-49~precise1, released on May 5th 2014 (one week ago).
I will definitely give it a try with the new kernel.

Broken TSO (TCP Segmentation Offload) handling in tg3 driver?
I found another bug report which shows very similar kernel outputs (see http://hotpotato.tistory.com/361). This bug report seems to be a copy of https://access.redhat.com/site/solutions/69382, but unfortunately the solution on the RedHat site can only be seen with a valid subscription. ARGH. According to the first page, the root cause for the issue is:

Certain Broadcom devices, mostly the BMC5704 controllers, failed to work due to incorrect TSO (TCP Segmentation Offload) handling in the tg3 driver. The TSO handling code has been revised so that the devices now work as expected.

But as this bug is already known since August 30th 2013 on the Red Hat site, I still tend for the first possibility (the deadlock bug).

General tg3 issue with Broadcom BCM5719?
According to the VMware Knowledge Base entry #2035701, last updated on December 11th 2013, there is a general issue in the tg3 driver specific on BCM5719 and BCM5720 NIC controllers. The issue can be resolved by updating the Broadcom driver (tg3). As a workaround, the "NetQueue feature" can be disabled. As this is a VMware feature, it doesn't seem to be the cause for my freeze.

By the way there is a video on Youtube (https://www.youtube.com/watch?v=6jRho13n-k4) from Önder Yilmaz, published on April 28th 2014, which seems to be describing the same issue.

Update May 19th, 2014:
After an uptime of 5 days with the new kernel (3.5.0-49-generic), the entries have disappared from /var/log/kern.log and dmesg.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder