Ethernet Products
Determine ramifications of Intel® Ethernet products and technologies
4875 Discussions

i40e driver crashes machine under high network pressure

CRI_EPITA
Beginner
10,074 Views

Hi all,

During our benchmark of a Ceph cluster, we noticed one of our machines had a kernel panic and needed a reboot. After checking the kernel logs (included), it turns out this error came from somewhere inside the i40e driver, which is used for our X722-DA2 network cards. After browsing this forum, we found that someone had a similar problem, but in Windows with their cards being in a team. Ours were in a LACP bond, so we decided to stress test them when not in a bond, and were able to reproduce the problem. It also happens when stress testing only one of the interfaces, and in 1Gb/s mode (the previous tests were all done in 10Gb/s), although it takes longer. The amount of time after which the machine freezes/kernel panics is proportional to the network load.
After further investigation, it seems that the number of softirqs is increasing by arbitrary steps (graph included) to a point where the machine freezes. The fact that this bug also happens in 1Gb/s mode is leading us to believe that there is a memory leak in the i40e driver. During our tests, we observed that the slab memory is constantly increasing, without the machine doing anything else than network operations.
The problem only happens when receiving traffic. Or at least we haven't been able to reproduce it when only sending traffic.
We now plan to further analyze the memory operations made by the i40e driver and will report back if we find anything. In the meantime, we are opening this thread hoping that Intel or anyone else might already know of this issue and have a fix.

Regards,

-----

Some useful information:

We are running the latest version of the i40e downloaded from Intel's website because we had to upgrade the firmware of our network cards. This upgrade was necessary because the cards would otherwise randomly disconnect and the server had to be restarted or the cables un- and re-plugged for the card to work again (shutting the port off and back on on the switch did the trick too). We did not investigate this issue any further.

The tests were ran as such: 4 iperf3 servers were listening on one node (here after called node-2), and two nodes (here after called node-1 and node-3) were each running 2 iperf3 clients. As such, each interface of node-2 was hit by two iperf3 clients, one from node-1, another from node-3. Note that we tried different combinations of this, with the same outputs, and as such we can conclude that the problem isn't due to one network card.

The included kernel logs are from the test using only one interface. Look for the `[i40e]` pattern from the end of the file and you'll find the relevant stacktraces (there are many).

OS: Ubuntu 20.10
Kernel: 5.8.0-33-generic
i40e version: 2.13.10 - BB96E598E7BFA4F229F7E53
X722 firmware: 5.15 0x8000275d 1.2829.0
iperf: 3.7
TSO and GRO are deactivated
Server: Dell PowerEdge R6525
CPUs: 2x AMD EPYC 7352 24-Core Processor

0 Kudos
22 Replies
Caguicla_Intel
Moderator
1,054 Views

Hello CRI_EPITA,


Good day!


This is just a follow up to check if you already had a chance to test the driver from the link provided. We would highly appreciate if you can share an update for this request. 


Looking forward to your reply.


We'll make sure to reach out after 3 business days in case we don't hear from you.


Best regards,

Crisselle C

Intel® Customer Support


0 Kudos
Caguicla_Intel
Moderator
1,040 Views

Hello CRI_EPITA,


How are you doing?


Please be informed that we will now close this request since we haven't received any response from our previous follow ups. Just feel free to post a new question if you may have any other inquiry in the future as this thread will no longer be monitored.


May you have a great day!


Best regards,

Crisselle C

Intel® Customer Support


0 Kudos
Reply