Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Krystian
Beginner
307 Views

Intel XL710 - restart ports under high load

Hello,

We have problem with that card.
every time the network usage is high, port restarts appear in the system.

Mar 10 00:32:00 c6o001 kernel: [117451.620285] ------------[ cut here ]------------
Mar 10 00:32:00 c6o001 kernel: [117451.620289] NETDEV WATCHDOG: ens1f1 (i40e): transmit queue 12 timed out
Mar 10 00:32:00 c6o001 kernel: [117451.620309] WARNING: CPU: 32 PID: 206 at net/sched/sch_generic.c:448 dev_watchdog+0x264/0x270
Mar 10 00:32:00 c6o001 kernel: [117451.620310] Modules linked in: ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables sctp iptable_filter bpfilter softdog bonding nfnetlink_log nfnetlink ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate pcspkr ast drm_vram_helper ttm drm_kms_helper joydev input_leds drm i2c_algo_bit mxm_wmi fb_sys_fops syscopyarea sysfillrect sysimgblt ioatdma mei_me mei dca ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost tap sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
Mar 10 00:32:00 c6o001 kernel: [117451.620349]  multipath linear hid_generic usbkbd usbmouse usbhid hid ses enclosure i40e(O) mpt3sas raid_class scsi_transport_sas xhci_pci ahci lpc_ich ehci_pci xhci_hcd ehci_hcd libahci i2c_i801 wmi
Mar 10 00:32:00 c6o001 kernel: [117451.620360] CPU: 32 PID: 206 Comm: ksoftirqd/32 Tainted: P           O      5.4.101-1-pve #1
Mar 10 00:32:00 c6o001 kernel: [117451.620361] Hardware name: Supermicro SSG-6048R-E1CR90L/X10DSC-TP4S, BIOS 3.2 11/21/2019
Mar 10 00:32:00 c6o001 kernel: [117451.620363] RIP: 0010:dev_watchdog+0x264/0x270
Mar 10 00:32:00 c6o001 kernel: [117451.620365] Code: 48 85 c0 75 e6 eb a0 4c 89 ef c6 05 4f d8 ef 00 01 e8 e0 b6 fa ff 89 d9 4c 89 ee 48 c7 c7 90 55 43 be 48 89 c2 e8 a6 51 15 00 <0f> 0b eb 82 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
Mar 10 00:32:00 c6o001 kernel: [117451.620366] RSP: 0018:ffffb47a9919bd60 EFLAGS: 00010286
Mar 10 00:32:00 c6o001 kernel: [117451.620367] RAX: 0000000000000000 RBX: 000000000000000c RCX: 0000000000000006
Mar 10 00:32:00 c6o001 kernel: [117451.620368] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff9ba63fb178c0
Mar 10 00:32:00 c6o001 kernel: [117451.620368] RBP: ffffb47a9919bd90 R08: 0000000000000b8c R09: 0000000000000004
Mar 10 00:32:00 c6o001 kernel: [117451.620369] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000040
Mar 10 00:32:00 c6o001 kernel: [117451.620370] R13: ffff9ba629d09000 R14: ffff9ba629d09480 R15: ffff9ba62093cf40
Mar 10 00:32:00 c6o001 kernel: [117451.620371] FS:  0000000000000000(0000) GS:ffff9ba63fb00000(0000) knlGS:0000000000000000
Mar 10 00:32:00 c6o001 kernel: [117451.620371] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 10 00:32:00 c6o001 kernel: [117451.620372] CR2: 000055bb1cf59000 CR3: 000000488d80a001 CR4: 00000000003606e0
Mar 10 00:32:00 c6o001 kernel: [117451.620373] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 10 00:32:00 c6o001 kernel: [117451.620374] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 10 00:32:00 c6o001 kernel: [117451.620374] Call Trace:
Mar 10 00:32:00 c6o001 kernel: [117451.620378]  ? pfifo_fast_enqueue+0x160/0x160
Mar 10 00:32:00 c6o001 kernel: [117451.620383]  call_timer_fn+0x32/0x130
Mar 10 00:32:00 c6o001 kernel: [117451.620385]  run_timer_softirq+0x1a5/0x430
Mar 10 00:32:00 c6o001 kernel: [117451.620388]  ? __switch_to_asm+0x34/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620390]  ? __switch_to_asm+0x40/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620392]  ? __switch_to_asm+0x34/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620393]  ? __switch_to_asm+0x40/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620395]  ? __switch_to_asm+0x34/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620398]  ? __switch_to+0x85/0x480
Mar 10 00:32:00 c6o001 kernel: [117451.620399]  ? __switch_to_asm+0x40/0x70
Mar 10 00:32:00 c6o001 kernel: [117451.620401]  __do_softirq+0xdc/0x2d4
Mar 10 00:32:00 c6o001 kernel: [117451.620405]  ? tasklet_init+0x30/0x30
Mar 10 00:32:00 c6o001 kernel: [117451.620407]  run_ksoftirqd+0x2b/0x40
Mar 10 00:32:00 c6o001 kernel: [117451.620409]  smpboot_thread_fn+0xd0/0x170
Mar 10 00:32:00 c6o001 kernel: [117451.620412]  kthread+0x120/0x140
Mar 10 00:32:00 c6o001 kernel: [117451.620413]  ? sort_range+0x30/0x30
Mar 10 00:32:00 c6o001 kernel: [117451.620414]  ? kthread_park+0x90/0x90
Mar 10 00:32:00 c6o001 kernel: [117451.620416]  ret_from_fork+0x35/0x40
Mar 10 00:32:00 c6o001 kernel: [117451.620418] ---[ end trace 42d3f30e8736292a ]---

 

Mar 10 00:32:00 c6o001 kernel: [117451.620423] i40e 0000:01:00.1 ens1f1: tx_timeout: VSI_seid: 396, Q 12, NTC: 0x339, HWB: 0x6bc, NTU: 0x6bc, TAIL: 0x6bc, INT: 0x0
Mar 10 00:32:00 c6o001 kernel: [117451.620425] i40e 0000:01:00.1 ens1f1: tx_timeout recovery level 1, hung_queue 12
Mar 10 00:32:00 c6o001 kernel: [117451.672381] bond0: (slave ens1f1): link status definitely down, disabling slave
Mar 10 00:32:00 c6o001 kernel: [117451.876274] i40e 0000:81:00.3 ens6f3: tx_timeout: VSI_seid: 399, Q 12, NTC: 0xe4, HWB: 0x2d6, NTU: 0x2d6, TAIL: 0x2d6, INT: 0x0
Mar 10 00:32:00 c6o001 kernel: [117451.876279] i40e 0000:81:00.3 ens6f3: tx_timeout recovery level 1, hung_queue 12
Mar 10 00:32:02 c6o001 kernel: [117453.672256] i40e 0000:81:00.2 ens6f2: tx_timeout: VSI_seid: 398, Q 12, NTC: 0xbe5, HWB: 0xca4, NTU: 0xca4, TAIL: 0xca4, INT: 0x0
Mar 10 00:32:02 c6o001 kernel: [117453.672261] i40e 0000:81:00.2 ens6f2: tx_timeout recovery level 1, hung_queue 12
Mar 10 00:32:04 c6o001 kernel: [117455.720217] i40e 0000:01:00.0 ens1f0: tx_timeout: VSI_seid: 397, Q 12, NTC: 0xf13, HWB: 0xeb, NTU: 0xeb, TAIL: 0xeb, INT: 0x0
Mar 10 00:32:04 c6o001 kernel: [117455.720220] i40e 0000:01:00.0 ens1f0: tx_timeout recovery level 1, hung_queue 12
Mar 10 00:32:04 c6o001 kernel: [117455.792334] bond0: (slave ens1f0): link status definitely down, disabling slave
Mar 10 00:32:15 c6o001 kernel: [117466.560143] bond1: (slave ens6f2): link status definitely down, disabling slave
Mar 10 00:32:15 c6o001 kernel: [117466.560151] bond1: now running without any active interface!
Mar 10 00:32:15 c6o001 kernel: [117466.560192] bond1: (slave ens6f3): link status definitely down, disabling slave
Mar 10 00:32:16 c6o001 kernel: [117467.006770] i40e 0000:01:00.1: VF BW shares not restored
Mar 10 00:32:16 c6o001 kernel: [117467.294913] i40e 0000:81:00.3: VF BW shares not restored
Mar 10 00:32:16 c6o001 kernel: [117467.517220] i40e 0000:81:00.2: VF BW shares not restored
Mar 10 00:32:16 c6o001 kernel: [117467.852519] i40e 0000:01:00.0: VF BW shares not restored
Mar 10 00:32:16 c6o001 kernel: [117467.860206] bond0: (slave ens1f0): link status definitely up, 10000 Mbps full duplex
Mar 10 00:32:16 c6o001 kernel: [117467.860219] bond0: active interface up!
Mar 10 00:32:16 c6o001 kernel: [117467.860273] bond0: (slave ens1f1): link status definitely up, 10000 Mbps full duplex
Mar 10 00:32:16 c6o001 kernel: [117467.868160] bond1: (slave ens6f2): link status definitely up, 10000 Mbps full duplex
Mar 10 00:32:16 c6o001 kernel: [117467.868170] bond1: active interface up!
Mar 10 00:32:16 c6o001 kernel: [117467.868222] bond1: (slave ens6f3): link status definitely up, 10000 Mbps full duplex
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

Linux 5.4.101-1-pve #1 SMP PVE 5.4.101-1 (Fri, 26 Feb 2021 13:13:09 +0100) x86_64 GNU/Linux

 

any sugestion how to get rid of this problem?

0 Kudos
9 Replies
AlfredoS_Intel
Moderator
293 Views

Hi Krystian, 

Thank you for posting on our Intel® Ethernet Communities Page.

We are sorry to hear about the issue that you are experiencing with your network adapter. 

Please allow us to ask the following questions to get more context on the issue:

1. May we know the impact on the system when those port restarts happen?

2. May we know the Operating system that you are using?

3. To better assist you, we need to get some logs from your system. It will tell us the different driver versions and components installed on your system. We would like to ask your help to provide System Support Utility Logs (SSU) from your system. You may download the software from this page, https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility. Please download the software which applies to the Operating system that you are running on your system. Once you have downloaded it, kindly run it, and you will have the option to save the logs to a text file. Please attach the text file to your reply to this email.

We look forward to hearing from you. If we do not get your reply, we will follow up after three business days.



Best Regards,

Alfred S

Intel® Customer Support



Krystian
Beginner
280 Views

Hi,

 

1. May we know the impact on the system when those port restarts happen?

 

like disconnecting the cables from the network card and reconnecting it. Server loses access to the network for this second.

 

 

2. May we know the Operating system that you are using?

 

I wrote everything in this first post at the end.

 

 

3. To better assist you, we need to get some logs from your system. It will tell us the different driver versions and components installed on your system. We would like to ask your help to provide System Support Utility Logs (SSU) from your system. 

 

unfortunately i can't use this script because when I run him, he shows me this information :

- This product is not supported on this operating system.
- The wodim package is recommended to retrieve wodim details. (I do not agree)
- The x11-xserver-utils package is recommended to retrieve xrandr details. (I do not agree)

 

please tell me exactly what you expect. I can't just install some packages on the production server.
what version of the driver is installed I gave in the first post, and as well as the full system log at the time of the problem.

 

 

AlfredoS_Intel
Moderator
229 Views

Hi Krystian,

Thank you for the quick response.

If you are unable to run the software then that is fine, kindly provide the following information that will be helpful in our investigation of the issue:

1. Are you able to run this command, ethtool -i ethx where ethx is the Ethernet port.

2. If it is okay with you, we need to determine if the card that you are using is an OEM distributed card or a retail card, as our approach for the issue will be different based on the type of card that you are using. Kindly provide us a picture of the markings on the card, particularly the PBA number. You may refer to this page, https://www.intel.com/content/www/us/en/support/articles/000007022/network-and-i-o/ethernet-products..., to locate the picture needed.

We look forward to hearing from you. If we do not get your reply, we will follow up after 3 business days.

Best Regards,

Alfred S

Intel® Customer Support


Krystian
Beginner
224 Views

Yes, sure.

root ~# ethtool -i ens1f0 ; ethtool -i ens1f1 ; ethtool -i ens6f2 ; ethtool -i ens6f3

driver: i40e
version: 2.14.13
firmware-version: 5.02 0x80002370 1.1373.0
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

driver: i40e
version: 2.14.13
firmware-version: 5.02 0x80002370 1.1373.0
expansion-rom-version:
bus-info: 0000:01:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

driver: i40e
version: 2.14.13
firmware-version: 7.00 0x80004ea4 1.2228.0
expansion-rom-version:
bus-info: 0000:81:00.2
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

driver: i40e
version: 2.14.13
firmware-version: 7.00 0x80004ea4 1.2228.0
expansion-rom-version:
bus-info: 0000:81:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

 
About second question, 
Unfortunately server is mounted in DataCenter and no one has tinkg to take a picture of the card.
One what I can say, and maybe it will guide is that the servers are the manufacturer of Supermicro
https://www.supermicro.com/en/products/system/4U/6048/SSG-6048R-E1CR90L.cfm

We have on one server two SIOM card with 4 x SFP+
And S/N of one that card are: HA18AS008236 

AlfredoS_Intel
Moderator
216 Views

Hi Krystian,

Thank you for your response.

We appreciate the time that you took out of your busy schedule to provide us these details.

We made initial checks on the information that you have sent, and we would like to clarify:

Are the cards that you are reporting built-in on the Super Micro board or are they discrete?

We look forward to your reply. Should we not get your reply, we will follow up after three business days.


Best Regards,

Alfred S

Intel® Customer Support


Krystian
Beginner
213 Views

these are two separate cards - exchangable.
And looks like that: https://www.supermicro.com/en/products/accessories/addon/AOC-MTG-i4S.php

AlfredoS_Intel
Moderator
195 Views

Hi

Thank you for sharing that information with us. 

To give you a background on why we asked that question:

We noticed that the firmware on the card was something that we didn't distribute, which led us to suspect that the card was third-party assembled, and the information you shared confirmed it.

We are looking at the firmware since it is our usual recommendation for these kinds of issues is to update the card's firmware. We have our own set of downloadable firmware; however, it doesn't work on third-party cards.

Due to this development, we would recommend you to please check with super micro if they have an updated firmware for the card. 

We will reach out to you after 



Best Regards,

Alfred S

Intel® Customer Support


Hi


AlfredoS_Intel
Moderator
173 Views

Hi Krystian,

We are just following up.

It looks like you need more time to check with the card's manufacturer since we have not heard from you. 

We will follow up again after 3 business days. Should we not hear from you, our system may automatically close the thread.



Best Regards,

Alfred S

Intel Customer Support


AlfredoS_Intel
Moderator
126 Views

Hi Krystian, 

We need to close this thread since we have not gotten a response from you: maybe because you are busy or preoccupied at the moment. We know that this is important for you to get it resolved and it is also equally important for us to give you the right solution; as much as we would like to assist you, we need to close it to attend to other customers. We hope for your consideration and understanding on this one.


If you need any additional information, please submit a new question as this thread will no longer being monitored.


Thank you for contacting Intel® and have a great week!




Best Regards,

Alfred S

Intel® Customer Support


Reply