Ethernet Products
Determine ramifications of Intel® Ethernet products and technologies
5530 Discussions

X710 PCIe errors and stability issues

MartinV
Beginner
1,259 Views

Hello,

My server has an onboard Intel X710 dual port NIC. The server is used for virtualization running Proxmox VE 8.2 with Linux kernel 6.8.12. The storage for the VMs is attached with iSCSI via the X710 card.

When the load on the network card is high, I get various PCIe errors, sometimes the server reboots. The errors are detected by the Linux kernel and by the BMC. I used to use both ports on the card for LACP link aggregation, but then I switched to a single port setup (without bonding).

 

To troubleshoot this problem, the following steps have been taken so far:

- the motherboard was swapped

- memory was tested

- i40e driver was updated (2.25.11 and 2.26.8)

- Open vSwitch with DPDK was used instead of the i40e driver

- another X710 card was added to the server

 

Additional info:

- configurations with and without LACP link aggregation behave the same

- the onboard Intel i210 doesn't exhibit such problems

- flow control is disabled

 

In all the above situations, the errors persist in high-load conditions. Currently, I'm using the PCIe card (not the onboard), because this way the system doesn't restart.

 

The motherboard is an ASRock Rack BERGAMOD8 (SP5 Epyc).

 

NVM on affected cards is different:

fw 8.1.63299 api 1.12 nvm 8.10 0x800093ea 1.2829.0 (PCIe card)
fw 9.130.73618 api 1.15 nvm 9.30 0x8000e5d0 1.3429.0 (onboard card)

 

The errors from the kernel:

[ 261.039195] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 261.039199] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 261.039201] {1}[Hardware Error]: event severity: corrected
[ 261.039202] {1}[Hardware Error]: Error 0, type: corrected
[ 261.039203] {1}[Hardware Error]: fru_text: PcieError
[ 261.039205] {1}[Hardware Error]: section_type: PCIe error
[ 261.039205] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 261.039207] {1}[Hardware Error]: version: 0.2
[ 261.039208] {1}[Hardware Error]: command: 0x0406, status: 0x0010
[ 261.039209] {1}[Hardware Error]: device_id: 0000:81:00.0
[ 261.039210] {1}[Hardware Error]: slot: 0
[ 261.039211] {1}[Hardware Error]: secondary_bus: 0x00
[ 261.039212] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 261.039213] {1}[Hardware Error]: class_code: 020000
[ 261.039214] {1}[Hardware Error]: bridge: secondary_status: 0x2380, control: 0x0000
[ 261.039215] {1}[Hardware Error]: Error 1, type: corrected
[ 261.039216] {1}[Hardware Error]: section_type: PCIe error
[ 261.039217] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 261.039218] {1}[Hardware Error]: version: 0.2
[ 261.039219] {1}[Hardware Error]: command: 0x0406, status: 0x0010
[ 261.039220] {1}[Hardware Error]: device_id: 0000:81:00.1
[ 261.039221] {1}[Hardware Error]: slot: 0
[ 261.039222] {1}[Hardware Error]: secondary_bus: 0x00
[ 261.039223] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 261.039224] {1}[Hardware Error]: class_code: 020000
[ 261.039225] {1}[Hardware Error]: bridge: secondary_status: 0x2380, control: 0x0000
[ 261.039590] i40e 0000:81:00.0: AER: aer_status: 0x00003000, aer_mask: 0x00000000
[ 261.039593] i40e 0000:81:00.0: [12] Timeout
[ 261.039595] i40e 0000:81:00.0: [13] NonFatalErr
[ 261.039597] i40e 0000:81:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 261.039605] i40e 0000:81:00.1: AER: aer_status: 0x00003000, aer_mask: 0x00000000
[ 261.039606] i40e 0000:81:00.1: [12] Timeout
[ 261.039608] i40e 0000:81:00.1: [13] NonFatalErr
[ 261.039609] i40e 0000:81:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 916.921040] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
[ 916.921048] {2}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 916.921051] {2}[Hardware Error]: event severity: corrected
[ 916.921054] {2}[Hardware Error]: Error 0, type: corrected
[ 916.921057] {2}[Hardware Error]: section_type: PCIe error
[ 916.921059] {2}[Hardware Error]: port_type: 0, PCIe end point
[ 916.921062] {2}[Hardware Error]: version: 0.2
[ 916.921064] {2}[Hardware Error]: command: 0x0406, status: 0x0010
[ 916.921067] {2}[Hardware Error]: device_id: 0000:81:00.0
[ 916.921070] {2}[Hardware Error]: slot: 0
[ 916.921072] {2}[Hardware Error]: secondary_bus: 0x00
[ 916.921075] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 916.921077] {2}[Hardware Error]: class_code: 020000
[ 916.921080] {2}[Hardware Error]: bridge: secondary_status: 0x2380, control: 0x0000
[ 916.921082] {2}[Hardware Error]: Error 1, type: corrected
[ 916.921085] {2}[Hardware Error]: section_type: PCIe error
[ 916.921087] {2}[Hardware Error]: port_type: 0, PCIe end point
[ 916.921089] {2}[Hardware Error]: version: 0.2
[ 916.921091] {2}[Hardware Error]: command: 0x0406, status: 0x0010
[ 916.921094] {2}[Hardware Error]: device_id: 0000:81:00.1
[ 916.921097] {2}[Hardware Error]: slot: 0
[ 916.921099] {2}[Hardware Error]: secondary_bus: 0x00
[ 916.921101] {2}[Hardware Error]: vendor_id: 0x8086, device_id: 0x15ff
[ 916.921103] {2}[Hardware Error]: class_code: 020000
[ 916.921105] {2}[Hardware Error]: bridge: secondary_status: 0x2380, control: 0x0000
[ 916.921148] i40e 0000:81:00.0: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 916.921154] i40e 0000:81:00.0: [12] Timeout
[ 916.921159] i40e 0000:81:00.0: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 916.921178] i40e 0000:81:00.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 916.921182] i40e 0000:81:00.1: [12] Timeout
[ 916.921186] i40e 0000:81:00.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

 

The device_id points to the network card. If I use another network card, then the device_id changes accordingly.

 

Errors from the BMC:

 

ID | TimeStamp | Sensor Name | Sensor Type | Description
======|=====================|==================|====================================|================================================================
3 | 10/20/2024 14:01:18 | BIOS | critical_interrupt | PCIe SEL Log - Asserted
| | | | Data1: PCI PERR
| | | | Data2: PCI bus number for failed device: 0x81
| | | | Data3: PCI device number: 0x00 PCI function number: 0x01
------|---------------------|------------------|------------------------------------|----------------------------------------------------------------
2 | 10/20/2024 14:01:18 | BIOS | critical_interrupt | PCIe SEL Log - Asserted
| | | | Data1: PCI PERR
| | | | Data2: PCI bus number for failed device: 0x81
| | | | Data3: PCI device number: 0x00 PCI function number: 0x00
------|---------------------|------------------|------------------------------------|----------------------------------------------------------------

Driver kernel messages while probing the card (only the PCIe card):

 

[ 1.476600] i40e: loading out-of-tree module taints kernel.
[ 1.476606] i40e: module verification failed: signature and/or required key missing - tainting kernel
[ 1.491116] i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 2.26.8
[ 1.491120] i40e: Copyright (C) 2013-2024 Intel Corporation
[ 1.507266] i40e 0000:81:00.0: fw 8.1.63299 api 1.12 nvm 8.10 0x800093ea 1.2829.0
[ 1.583150] i40e 0000:81:00.0: MAC source pruning enabled on all VFs
[ 1.583787] i40e 0000:81:00.0: MAC address: 1c:fd:08:78:3d:94
[ 1.584263] i40e 0000:81:00.0: FW LLDP is disabled
[ 1.584477] i40e 0000:81:00.0: FW LLDP is disabled, attempting SW DCB
[ 1.591232] i40e 0000:81:00.0: SW DCB initialization succeeded.
[ 1.605881] i40e 0000:81:00.0: MAC source pruning enabled on all VFs
[ 1.617199] i40e 0000:81:00.0 eth0: NIC Link is Up, 10 Gbps Full Duplex, Flow Control: None, EEE: Enabled
[ 1.619991] i40e 0000:81:00.0: PCI-Express: Speed 8.0GT/s Width x8
[ 1.622519] i40e 0000:81:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
[ 1.640315] i40e 0000:81:00.1: fw 8.1.63299 api 1.12 nvm 8.10 0x800093ea 1.2829.0
[ 1.715615] i40e 0000:81:00.1: MAC source pruning enabled on all VFs
[ 1.716256] i40e 0000:81:00.1: MAC address: 1c:fd:08:78:3d:95
[ 1.716731] i40e 0000:81:00.1: FW LLDP is disabled
[ 1.716951] i40e 0000:81:00.1: FW LLDP is disabled, attempting SW DCB
[ 1.723490] i40e 0000:81:00.1: SW DCB initialization succeeded.
[ 1.737849] i40e 0000:81:00.1: MAC source pruning enabled on all VFs
[ 1.747764] i40e 0000:81:00.1: PCI-Express: Speed 8.0GT/s Width x8
[ 1.750090] i40e 0000:81:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA

[ 1.766465] i40e 0000:09:00.0: fw 9.130.73618 api 1.15 nvm 9.30 0x8000e5d0 1.3429.0
[ 1.846460] i40e 0000:09:00.0: MAC source pruning enabled on all VFs
[ 1.846902] i40e 0000:09:00.0: MAC address: 9c:6b:00:4b:45:ac
[ 1.847389] i40e 0000:09:00.0: FW LLDP is disabled
[ 1.847612] i40e 0000:09:00.0: FW LLDP is disabled, attempting SW DCB
[ 1.854173] i40e 0000:09:00.0: SW DCB initialization succeeded.
[ 1.869698] i40e 0000:09:00.0: MAC source pruning enabled on all VFs
[ 1.880154] i40e 0000:09:00.0: PCI-Express: Speed 8.0GT/s Width x4
[ 1.880158] i40e 0000:09:00.0: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[ 1.880160] i40e 0000:09:00.0: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[ 1.882716] i40e 0000:09:00.0: Features: PF-id[0] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA
[ 1.899594] i40e 0000:09:00.1: fw 9.130.73618 api 1.15 nvm 9.30 0x8000e5d0 1.3429.0
[ 1.980548] i40e 0000:09:00.1: MAC source pruning enabled on all VFs
[ 1.981199] i40e 0000:09:00.1: MAC address: 9c:6b:00:4b:45:ad
[ 1.981689] i40e 0000:09:00.1: FW LLDP is disabled
[ 1.981911] i40e 0000:09:00.1: FW LLDP is disabled, attempting SW DCB
[ 1.988458] i40e 0000:09:00.1: SW DCB initialization succeeded.
[ 2.003590] i40e 0000:09:00.1: MAC source pruning enabled on all VFs
[ 2.017394] i40e 0000:09:00.1: PCI-Express: Speed 8.0GT/s Width x4
[ 2.017396] i40e 0000:09:00.1: PCI-Express bandwidth available for this device may be insufficient for optimal performance.
[ 2.017398] i40e 0000:09:00.1: Please move the device to a different PCI-e link with more lanes and/or higher transfer rate.
[ 2.019990] i40e 0000:09:00.1: Features: PF-id[1] VFs: 64 VSIs: 66 QP: 48 RSS FD_ATR FD_SB NTUPLE CloudF DCB VxLAN Geneve NVGRE PTP VEPA

 

Thanks for any insights and tips!

0 Kudos
1 Reply
MartinV
Beginner
1,021 Views

It was ASPM. Turning it off in BIOS resolved the problem completely.

 

I have a second server (same build), which also had the problem, and the solution was identical.

0 Kudos
Reply