Community
cancel
Showing results for 
Search instead for 
Did you mean: 
SH1
Novice
2,699 Views

Will X710 firmware update 4.53 to 5.05 address sporadic transmit queue timeout?

Jump to solution

We have experienced three occurrences on two servers of this error "tx_timeout" / "hung_queue", and packets stopped flowing for some number of seconds (but then recovered):

Apr 10 02:04:14 node39 kernel: WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x276/0x280()

Apr 10 02:04:14 node39 kernel: NETDEV WATCHDOG: p2p1 (i40e): transmit queue 8 timed out

...

Apr 10 02:04:14 node39 kernel: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE ------------ 3.10.0-514.6.1.el7.x86_64 # 1

Apr 10 02:04:14 node39 kernel: Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.1.3 11/20/2013

...

Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout: VSI_seid: 390, Q 8, NTC: 0x113, HWB: 0x116, NTU: 0x116, TAIL: 0x116, INT: 0x1

Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: tx_timeout recovery level 1, hung_queue 8

Apr 10 02:04:14 node39 kernel: i40e 0000:42:00.0 p2p1: adding 3c:fd:fe:9f:b7:48 vid=0

This is within first 3 weeks of usage of Intel X710 duo adapters running firmware 4.53 (with supported Intel SFP+) recently installed in a cluster of two-year-old Dell R620s, running CentOS 7.3:

node39:/# lspci -vv | grep -A 1 10GbE

pcilib: sysfs_read_vpd: read failed: Input/output error

05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2

--

05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Intel Corporation Ethernet Converged Network Adapter X710

--

42:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Intel Corporation Ethernet Converged Network Adapter X710-2

--

42:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Intel Corporation Ethernet Converged Network Adapter X710

node39:/usr/local/bin# ethtool -i p2p1

driver: i40e

version: 1.5.10-k

firmware-version: 4.53 0x8000206e 0.0.0

expansion-rom-version:

bus-info: 0000:42:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: yes

We have used X710s without issue in a few other servers, but in those cases they are HP OEM, and running firmware 4.60:

node93:/# lspci -vv |grep -A 1 10GbE

04:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562FLR-SFP+ Adapter

--

04:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter

--

05:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Hewlett-Packard Company HP Ethernet 10Gb 2-port 562SFP+ Adapter

--

05:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)

Subsystem: Hewlett-Packard Company Ethernet 10Gb 562SFP+ Adapter

node93:/# ethtool -i ens2f0

driver: i40e

version: 1.5.10-k

firmware-version: 4.60 0x80001f47 1.3072.0

expansion-rom-version:

bus-info: 0000:05:00.0

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: yes

I have downloaded nvmupdate64e and updated a spare Dell to firmware 5.05, so if this is the correct solution I have confirmed the procedure. However threads such as this one give me pause-- crash and reboot would certainly be worse than a 10-20 second transmit hang.

My questions are:

  1. Has anyone else experienced these tx_timeout / hung_queue issues?
  2. Is it a known issue? If so, is it an issue with firmware, with i40e driver, or something else such as TSO/GSO (which are currently ON but I could turn them off).
  3. If it is an issue with firmware, has it been corrected between versions 4.53 and 4.60, and is it recommended to flash production machines to 5.05, or to some other version. I could not find a detailed Change List.
  4. Is there a way (such as generating high data rates using iperf) to make the sporadic issues occur reproducibly, so that I can demonstrate whether any attempted solution has been successful.

Thanks in advance!

0 Kudos
1 Solution
SH1
Novice
523 Views

The answer to the question posed in the title is NO.

We have now experienced this issue with both Firmware 4.53 and Firmware 5.05.

I've opened a direct support request and will mark this question as 'answered'.

View solution in original post

11 Replies
SH1
Novice
523 Views

Replying to myself regarding question 4: I was unable to cause the issue to occur simply by running high data rates (1 to 9 Gbps) to/from multiple X710 interfaces using "nuttcp", on any of the firmware versions, 4.53, 4.60, or 5.05. The four times it has occurred in production were under moderate data rates (less than 100 Mbps) but highly variable traffic patterns that are hard to simulate.

We have had to reconfigure our production applications to avoid using these NICs for now.

Any suggestions would be quite welcome.

idata
Community Manager
523 Views

Hi seh4nc,

 

 

Thank you for posting in Wired Ethernet Community.

 

 

We're still checking your questions and will update the thread as soon as possible.

 

 

 

regards,

 

Vince
idata
Community Manager
523 Views

Hi seh4nc,

 

 

While still checking your issue internally, I'd like to clarify information below:

 

 

You've also mentioned "The four times it has occurred in production" in your 2nd post, does it include the HP OEM X710 with FW 4.6 and retail version of X710 with FW 5.05?

 

 

 

regards,

 

Vince
SH1
Novice
523 Views

Hi Vince, many thanks for continuing to investigate this.

We have not observed the issue at all on the HP OEM X710 with FW 4.6.

We have only observed the issue on the Intel-branded X710 with FW 4.53 that are installed in Dells.

In each of the five Dells there are two X710 duos, one in a PCIe slot on NUMA Node 0 and the other in a PCIe slot on NUMA Node 1. The issues have only occurred on NUMA Node 1. However this could be due to the traffic patterns: NUMA Node 0 sees less traffic, while NUMA Node 1 sees continuous TCP message bus traffic.

FYI after adjusting some application-level timeouts to mitigate, we have re-enabled this message bus traffic on the troublesome NICs and we observed another hang, so now there have been 5 hangs in total on four different servers.

SH1
Novice
523 Views

Vince: to clarify, we have not deployed firmware 5.05 on any NICs involved in production traffic, just on two non-production machines (also Dell, same Intel-branded X710 that had FW 4.53 originally) so that we could perform some "burn-in" testing. No issues were observed with 5.05, but the same testing did not trigger the tx_timeout on 4.53 either, so I was reluctant to upgrade any other machines to 5.05 without any official recommendation from Intel, or at least some Change Logs that I could review (see my original Question # 3).

idata
Community Manager
523 Views

Hi seh4nc, thanks for providing additional information. FW 5.05 addresses security issue mentioned in the link below. For the change logs from 4.53 to 5.05, i'm still checking it internally.

 

 

https://security-center.intel.com/advisory.aspx?intelid=INTEL-SA-00063&languageid=en-fr

 

 

regards,

 

Vince

 

 

idata
Community Manager
523 Views

Hi seh4nc, For the issues fixed by version 5.x, you may refer to page 13 and 14 of our spec update document, kindly refer to the link below.

 

 

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xl710-10-40-contro...

 

 

regards,

 

Vince

 

 

idata
Community Manager
523 Views

Hi seh4nc, we'd like to check if you still need assistance regarding X710-DA2.

 

 

regards,

 

Vince
SH1
Novice
523 Views

Hi Vince, thanks very much to the link to the Spec Update document, that was very thorough, and helpful, even though none of the issues specifically addressed seem to match my situation.

I do not consider the issue to be resolved. I will be putting our production application back onto the X710 interfaces, some running firmware 4.53 and others running 5.05, and I will report back in the coming weeks if any tx_timeout issues recur, hopefully with some packet captures before/during any event.

idata
Community Manager
523 Views

Hi seh4nc, thanks for the feedback, kindly post the update in this thread once available.

 

 

regards,

 

Vince
SH1
Novice
524 Views

The answer to the question posed in the title is NO.

We have now experienced this issue with both Firmware 4.53 and Firmware 5.05.

I've opened a direct support request and will mark this question as 'answered'.

View solution in original post

Reply