Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
2,276 Views

i40e XL710 hang up - tx_timeout hung_queue - ubuntu

Hello,

We have installed PC with Ubuntu 14.04.3 with all updates as Border router:

Linux hellnat 3.19.0-47-generic # 53~14.04.1-Ubuntu SMP Mon Jan 18 16:09:14 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

CPU: 2*E5-2690v3 with hyperthreading enabled (so total 48 logical "cores" in OS)

Intel XL710 quad port, every "channel" of every p1p* interface is binded to its core

It is used as border router, so it uses BGP. We use p1p1 and p1p3 to connect to internal routers and p1p2 and p1p3 - to Uplinks.

Suddenly traffic stopped when it was NOT rush hour.

After reboot (via IPMI) I saw next lines in syslog file:

Jan 31 02:33:33 hellnat kernel: [220504.793680] ------------[ cut here ]------------

Jan 31 02:33:33 hellnat kernel: [220504.793701] WARNING: CPU: 45 PID: 0 at /build/linux-lts-vivid-Yt59dr/linux-lts-vivid-3.19.0/net/sched/sch_generic.c:303 dev_watchdog+0x24f/0x260()

Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

Jan 31 02:33:33 hellnat kernel: [220504.793707] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

Jan 31 02:33:33 hellnat kernel: [220504.793817] CPU: 45 PID: 0 Comm: swapper/45 Tainted: G OE 3.19.0-47-generic # 53~14.04.1-Ubuntu

Jan 31 02:33:33 hellnat kernel: [220504.793820] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

Jan 31 02:33:33 hellnat kernel: [220504.793822] ffffffff81b3fcc0 ffff88105f4a3d58 ffffffff817afcd5 0000000000000000

Jan 31 02:33:33 hellnat kernel: [220504.793827] ffff88105f4a3da8 ffff88105f4a3d98 ffffffff81074dea 0000000000000286

Jan 31 02:33:33 hellnat kernel: [220504.793830] 0000000000000008 ffff88105b65a000 0000000000000040 ffff88105748cf40

Jan 31 02:33:33 hellnat kernel: [220504.793835] Call Trace:

Jan 31 02:33:33 hellnat kernel: [220504.793837] [] dump_stack+0x45/0x57

Jan 31 02:33:33 hellnat kernel: [220504.793857] [] warn_slowpath_common+0x8a/0xc0

Jan 31 02:33:33 hellnat kernel: [220504.793860] [] warn_slowpath_fmt+0x46/0x50

Jan 31 02:33:33 hellnat kernel: [220504.793869] [] dev_watchdog+0x24f/0x260

Jan 31 02:33:33 hellnat kernel: [220504.793874] [] ? dev_graft_qdisc+0x80/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793879] [] call_timer_fn+0x39/0x110

Jan 31 02:33:33 hellnat kernel: [220504.793883] [] ? dev_graft_qdisc+0x80/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793888] [] run_timer_softirq+0x220/0x320

Jan 31 02:33:33 hellnat kernel: [220504.793898] [] ? lapic_next_deadline+0x33/0x40

Jan 31 02:33:33 hellnat kernel: [220504.793905] [] __do_softirq+0xe4/0x270

Jan 31 02:33:33 hellnat kernel: [220504.793909] [] irq_exit+0x9d/0xb0

Jan 31 02:33:33 hellnat kernel: [220504.793916] [] smp_apic_timer_interrupt+0x4a/0x60

Jan 31 02:33:33 hellnat kernel: [220504.793924] [] apic_timer_interrupt+0x6d/0x80

Jan 31 02:33:33 hellnat kernel: [220504.793926] [] ? cpuidle_enter_state+0x70/0x170

Jan 31 02:33:33 hellnat kernel: [220504.793938] [] ? cpuidle_enter_state+0x5d/0x170

Jan 31 02:33:33 hellnat kernel: [220504.793943] [] cpuidle_enter+0x17/0x20

Jan 31 02:33:33 hellnat kernel: [220504.793949] [] cpu_startup_entry+0x334/0x3d0

Jan 31 02:33:33 hellnat kernel: [220504.793955] [] ? clockevents_register_device+0xe3/0x140

Jan 31 02:33:33 hellnat kernel: [220504.793960] [] start_secondary+0x197/0x1c0

Jan 31 02:33:33 hellnat kernel: [220504.793963] ---[ end trace 43e1a051ade0289e ]---

Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

Jan 31 02:33:43 hellnat watchquagga[2972]: zebra state -> unresponsive : no response yet to ping sent 10 seconds ago

Jan 31 02:33:49 hellnat watchquagga[2972]: bgpd state -> unresponsive : no response yet to ping sent 10 seconds ago

Jan 31 02:33:50 hellnat kernel: [220521.908228] NMI watchdog: BUG: soft lockup - CPU# 13 stuck for 23s! [kworker/13:1:536]

Jan 31 02:33:50 hellnat kernel: [220521.908306] Modules linked in: nf_conntrack_netlink nfnetlink xt_tcpudp xt_multiport iptable_filter xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_mangle xt_CT iptable_raw ast ttm joydev intel_rapl iosf_mbi drm_kms_helper x86_pkg_temp_thermal intel_powerclamp drm syscopyarea sysfillrect sysimgblt coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd lpc_ich mei_me sb_edac edac_core mei ipmi_si 8250_fintek ipmi_msghandler lp wmi acpi_pad parport ioatdma mac_hid shpchp nf_conntrack_ftp acpi_power_meter nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat nf_conntrack ip_tables x_tables 8021q garp mrp stp llc tcp_htcp hid_generic i40e(OE) igb vxlan ip6_udp_tunnel i2c_algo_bit udp_tunnel usbhid dca uas configfs ahci ptp usb_storage hid megaraid_sas libahci pps_core

Jan 31 02:33:50 hellnat kernel: [220521.908396] CPU: 13 PID: 536 Comm: kworker/13:1 Tainted: G W OE 3.19.0-47-generic # 53~14.04.1-Ubuntu

Jan 31 02:33:50 hellnat kernel: [220521.908399] Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 1.1 08/13/2015

Jan 31 02:33:50 hellnat kernel: [220521.908408] Workqueue: events inet_frag_worker

The main lines , I think, are:

Jan 31 02:33:33 hellnat kernel: [220504.793705] NETDEV WATCHDOG: p1p1 (i40e): transmit queue 8 timed out

Jan 31 02:33:33 hellnat kernel: [220504.793973] i40e 0000:81:00.0 p1p1: tx_timeout: VSI_seid: 399, Q 8, NTC: 0xd36, HWB: 0xa1, NTU: 0xa1, TAIL: 0xa1, INT: 0x0

Jan 31 02:33:33 hellnat kernel: [220504.793976] i40e 0000:81:00.0 p1p1: tx_timeout recovery level 1, hung_queue 8

We can see that tx queue 8 hang up. Why can it happen? I think it is a problem of network adapter or driver. Can you explain it to me and how to fix it? It is big problem when it happens because all ...

0 Kudos
4 Replies
Highlighted
Valued Contributor I
121 Views

Hi Evgeny,

Thank you for contacting Intel Customer Support.

How often do you encounter the time out? What do you do to re-connect?

Please provide the adapter's details:

1. Specific Card Model:

2. Serial number (if available)

3. Modules installed

Sincerely,

Sandy

0 Kudos
Highlighted
Beginner
121 Views

Hi Sandy,

This was first time, yesterday was second one. I reboot server via IPMI

1.

Ethernet controller: Intel Corporation Ethernet 10G 2P X710 Adapter (rev 01)

Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-4

2. Device Serial Number f0-24-30-ff-ff-ca-05-68

3. Moduletech SFP+ modules.

After yesterday hangup I looked through syslog messages and found that there was lines with "event inet_frag_worker". After that I found that there were some patches implemented in fresh kernels (4.2) where some bugs in inet_fragment.c fixed. Today I updated to 4.2 kernel and I'll be watching for this machine for ome days. If there will not any problem for a week I think it is problem in kernel. So I'd like to take some timeout in closing this topic.

Regards,

Evgeny

0 Kudos
Highlighted
Valued Contributor I
121 Views

Hi Evgeny,

Thank you for your information.

The time outs maybe related to the SFP+ modules. We recommend to use only the supported module. Please refer to the link below for more information:

http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000007045.html X710 Series—Compatible SFP+ Modules, SFP Modules, and Cables for...

Hope this is helpful.

Sincerely,

Sandy

0 Kudos
Highlighted
Beginner
121 Views

Hi Sandy,

This is not because of SFP+ modules. I understand that you advise to use Intel SFP+ modules but there are other vendors that make compatible SFP+ modules.

As for my problem, there were problem in Linux kernel version 3, some part that works with fragmented packets. In kernel 4.2 there is some changes about this. So now it works stable for more than a week.

Thank you for a participation.

Sincerely,

Evgeny.

0 Kudos