There isn't really anyway to brief about this problem, as it's highly involved. Apologies if this is missing any information, I have tried to be as thorough as possible. We have hit a wall and are looking for guidance in continuing troubleshooting, because the driver seems to be resetting the adapter. This is speculation without a deeper understanding :-)
The only (possibly) relevant thread I could find was http://comments.gmane.org/gmane.linux.drivers.e1000.devel/9934 http://comments.gmane.org/gmane.linux.drivers.e1000.devel/9934.
We have several physical machines containing both a dual port 82576, which is onboard the mobo, and a quad port 82580 based expansion card. The exact cards between machines may not be identical, but the chipsets are. Several machines are using the I340-T4. They are configured as:
eth0: External bond
eth1: iSCSI over mpio to SAN
eth2: External bond
eth3: iSCSI over mpio to SAN
eth4: internal network
eth5: management interface
My personal opinion is that the following section is not directly relevant to the problem, but might help inform the scenario. The machines are running Xen and multiple VMs. Traffic to and from the VMs travels over the bonded pair of eth0 and eth2. The VMs access their disks via the iSCSI over mpio connections. Access to dom0 is over the management interface.
The symptoms are these:
- on the majority of machines, at regular, but unpredictable, intervals we see unresponsive network connectivity from the physical machines (and therefore obviously the VMs they host)
--- on these machines modinfo igb shows:
- on one machine in particular, we have reproducible results (detail later)
--- this machine was running 3.0.6
--- it has been upgraded to:
- these periods of no response can be seen on *all* interfaces, making the igb driver and upwards the only common factors
- ifconfig shows overruns and dropped packets (detail later)
- /var/log/messages shows the following, repeatedly:
Sep 13 12:23:28 HV020 kernel: connection3:0: ping timeout of 10 secs expired, recv timeout 5, last rx 4312351768, last ping 4312353018, now 4312355518
Sep 13 12:23:28 HV020 kernel: connection3:0: detected conn error (1011)
Sep 13 12:23:29 HV020 iscsid: Kernel reported iSCSI connection 3:0 error (1011) state (3)
Sep 13 12:23:43 HV020 kernel: session3: session recovery timed out after 15 secs
Sep 13 12:23:56 HV020 iscsid: connection3:0 is operational after recovery (2 attempts)
Sep 13 12:31:59 HV020 kernel: bonding: bond1: link status definitely down for interface eth0, disabling it
Sep 13 12:32:02 HV020 kernel: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 13 12:32:02 HV020 kernel: bonding: bond1: link status definitely up for interface eth0.
Sep 13 19:22:35 HV020 kernel: connection2:0: ping timeout of 10 secs expired, recv timeout 5, last rx 4318638456, last ping 4318639706, now 4318642245
Sep 13 19:22:35 HV020 kernel: connection2:0: detected conn error (1011)
Sep 13 19:22:37 HV020 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)
Sep 13 19:23:11 HV020 iscsid: connection2:0 is operational after recovery (1 attempts)
Sep 13 19:27:13 HV020 kernel: NETDEV WATCHDOG: eth2: transmit timed out
Sep 13 19:27:16 HV020 kernel: bonding: bond1: link status definitely down for interface eth2, disabling it
Sep 13 19:27:19 HV020 kernel: igb: eth4 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Sep 13 19:27:19 HV020 kernel: igb: eth5 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Sep 13 19:27:20 HV020 kernel: igb: eth2 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Sep 13 19:27:20 HV020 kernel: bonding: bond1: link status definitely up for interface eth2.
Sep 13 19:27:26 HV020 kernel: igb: eth3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
As mentioned, on one machine we can reproduce this reliably. The machine has just 4 VMs;
- 3 in a cluster, generating anywhere up to 30k of connections, but only 600Kbps incoming and 1Mbps outgoing.
- 1 receiving an SFTP transfer at ~1Mbps
The load produced by the cluster of machines follows an inverse bell curve, peaking every 60 mins. Throughout any given hour, the following symptoms can be seen, but with higher frequency and longer duration around the peak:
- slow response times from the cluster, taking 30-90s to respond to a HTTP request (traffic is over eth0/eth2 bond)
- tcpdumps of the SFTP traffic, at multiple points in the network stack, slow, or pause completely (traffic is over eth0/eth2 bond)
- timeout errors in /var/log/messages regarding iSCSI traffic (eth1 and eth3)
- long pauses and occasional disconnects on SSH sessions to dom0 (eth5)
The overruns and dropped packets coincide with periods of high load, when the SFTP traffic will stop completely and SSH to dom0 becomes unresponsive. It often also coincides with an interface reset.
We can prevent the overruns (as I suppose as should be expected) by increasing the size of the ring RX variable, to fully buffer the packets while the device is unresponsive.
I appreciate that there will need to be further diagnostic work done to ascertain the problem. Any guidance is appreciated.