Ethernet Products
Intel® Ethernet products and technologies
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!
3696 Discussions

i40e driver crashes machine under high network pressure

CRI_EPITA
Beginner
2,782 Views

Hi all,

During our benchmark of a Ceph cluster, we noticed one of our machines had a kernel panic and needed a reboot. After checking the kernel logs (included), it turns out this error came from somewhere inside the i40e driver, which is used for our X722-DA2 network cards. After browsing this forum, we found that someone had a similar problem, but in Windows with their cards being in a team. Ours were in a LACP bond, so we decided to stress test them when not in a bond, and were able to reproduce the problem. It also happens when stress testing only one of the interfaces, and in 1Gb/s mode (the previous tests were all done in 10Gb/s), although it takes longer. The amount of time after which the machine freezes/kernel panics is proportional to the network load.
After further investigation, it seems that the number of softirqs is increasing by arbitrary steps (graph included) to a point where the machine freezes. The fact that this bug also happens in 1Gb/s mode is leading us to believe that there is a memory leak in the i40e driver. During our tests, we observed that the slab memory is constantly increasing, without the machine doing anything else than network operations.
The problem only happens when receiving traffic. Or at least we haven't been able to reproduce it when only sending traffic.
We now plan to further analyze the memory operations made by the i40e driver and will report back if we find anything. In the meantime, we are opening this thread hoping that Intel or anyone else might already know of this issue and have a fix.

Regards,

-----

Some useful information:

We are running the latest version of the i40e downloaded from Intel's website because we had to upgrade the firmware of our network cards. This upgrade was necessary because the cards would otherwise randomly disconnect and the server had to be restarted or the cables un- and re-plugged for the card to work again (shutting the port off and back on on the switch did the trick too). We did not investigate this issue any further.

The tests were ran as such: 4 iperf3 servers were listening on one node (here after called node-2), and two nodes (here after called node-1 and node-3) were each running 2 iperf3 clients. As such, each interface of node-2 was hit by two iperf3 clients, one from node-1, another from node-3. Note that we tried different combinations of this, with the same outputs, and as such we can conclude that the problem isn't due to one network card.

The included kernel logs are from the test using only one interface. Look for the `[i40e]` pattern from the end of the file and you'll find the relevant stacktraces (there are many).

OS: Ubuntu 20.10
Kernel: 5.8.0-33-generic
i40e version: 2.13.10 - BB96E598E7BFA4F229F7E53
X722 firmware: 5.15 0x8000275d 1.2829.0
iperf: 3.7
TSO and GRO are deactivated
Server: Dell PowerEdge R6525
CPUs: 2x AMD EPYC 7352 24-Core Processor

0 Kudos
22 Replies
CRI_EPITA
Beginner
2,426 Views

After testing with the upstream 5.10.2 kernel (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.10.2/) and the in-tree i40e module, the issue doesn't seem to happen anymore. We are now trying to pinpoint which commit fixed it.

CRI_EPITA
Beginner
2,417 Views

Some more update on this:

The in-tree module of the kernel on the v5.8 tag and on v5.10 works fine. However we still get this warning in the kernel logs:

i40e 0000:43:00.0: The driver for the device detected a newer version of the NVM image v1.11 than expected v1.9. Please install the most recent version of the network driver.
 

When using the Intel provided module at version 2.12.6 downloaded from https://downloadcenter.intel.com/download/29945/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Eth...
we don't trigger the bug but get the same warning as above, except that v1.9 is v1.10.
The bug is thus only triggered by the Intel provided module at version 2.13.10. This happens on Linux 5.8 either from upstream and from the Ubuntu patched version of the kernel. If we have time, we'll try to reproduce it on 5.10 from upstream.

CrisselleF_C_Intel
Moderator
2,407 Views

Hello CRI_EPITA,


Thank you for posting in Intel Ethernet Communities. 


We are sorry to hear that you are having this kind of issue. For us to proceed with the investigation, please share the PBA number of the adapter. You may refer to the link below on where to find the PBA number. Providing photos of the adapter focusing on the markings (white sticker) found on the physical card will be highly appreciated for us to double check on it. The PBA consists 6-3 digit number located at the last part of the serial number.

https://www.intel.com/content/www/us/en/support/articles/000007022/network-and-i-o/ethernet-products...


Awaiting to hear from you.


We will reach out after 3 business days in case we don't receive a reply.


Best regards,

Crisselle C

Intel® Customer Support


CRI_EPITA
Beginner
2,399 Views

Hi,

Unfortunately, due to the current pandemic, no one from our team is in an easy position to access the servers, and thus retrieve the part number. However, we can provide you with the MAC addresses, if that can help.

Regards,

CRI_EPITA
Beginner
2,361 Views

Hi again,

Some more update on this. After tracing the memory allocations and deallocations made by the kernel, we were able to confirm that the driver leaks memory. However, this leak doesn't happen when ntuples are off.

Regards,

Michael_L_Intel2
Moderator
2,339 Views

Hello CRI_EPITA,


Thank you for the update. I am really sorry, MAC address will not help us validating the cards that you are using. We need to get the markings from the actual card that's why we are asking for a photo on both sides of the cards focusing on the markings.


If you have questions, please let us know. In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.


Best regards,

Michael L.

Intel® Customer Support


CRI_EPITA
Beginner
2,333 Views

Hi again,

Although this is clearly a software problem, here are the pictures you asked for.

Regards,

Michael_L_Intel2
Moderator
2,316 Views

Hello CRI_EPITA,


Thank you for sending the markings of the card. While validating the markings, let me also gather the SSU.


Please download the System Support Utility from the link below, and attach a log file for us to further investigate this issue. 


https://downloadcenter.intel.com/download/26735/Intel-System-Support-Utility-for-the-Linux-Operating...


If you have questions, please let us know. In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.


Best regards,

Michael L.

Intel® Customer Support


CRI_EPITA
Beginner
2,312 Views

Hi,

Here is the output of SSU.

Regards,

Michael_L_Intel2
Moderator
2,302 Views

Hello CRI_EPITA,


While reviewing the SSU, I noticed that the firmware of the card is not yet updated. Kindly try to update the firmware and try again if you are still having the same issue. Kindly open the link below for the latest firmware and Usage guide to update the firmware:


https://downloadcenter.intel.com/download/24769/Non-Volatile-Memory-NVM-Update-Utility-for-Intel-Eth...


If you have questions, please let us know. In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.


Best regards,

Michael L.

Intel® Customer Support


CRI_EPITA
Beginner
2,256 Views

nvmeupdate64 reports that the firmware is indeed up-to-date:

# ./nvmupdate64e 19:48 0

Intel(R) Ethernet NVM Update Tool
NVMUpdate version 1.35.42.7
Copyright (C) 2013 - 2020 Intel Corporation.


WARNING: To avoid damage to your device, do not stop the update or reboot or power off the system during this update.
Inventory in progress. Please wait [-.........]


Num Description Ver.(hex) DevId S:B Status
=== ================================== ============ ===== ====== ==============
01) Intel(R) Ethernet Network Adapter 5.21(5.15) 37D0 00:067 Up to date
X722-2


Tool execution completed with the following status: All operations completed successfully.

If you confirm that 5.21 is not the latest version, can you provide us we the exact version numbers we are to expect?

Michael_L_Intel2
Moderator
2,228 Views

Hello CRI_EPITA,


Thank you for the update. We need to check the issue and get back to you once we have an update or a recommendation.

Please give us 3 to 4 working days for us to further investigate the issue.


Thank you.


Best regards,

Michael L.

Intel® Customer Support


Michael_L_Intel2
Moderator
2,214 Views

Hello CRI_EPITA,


I hope you are having a good day. While we are checking the issue, let me also ask the following details for us to better understand your end goal.


  1. What do mean by crashing, can you specifically explain the what is happening to the system? Is it hanging, rebooting?
  2. Does it hang the whole system or the network only disconnects during stress test?
  3. Does it also occur during normal usage?
  4. Any particular reason why you need to test the connection under pressure or stress test?


If you have questions, please let us know. In case we do not hear from you, we will make a follow up after 3 workings days. Thank you.


Thank you.


Best regards,

Michael L.

Intel® Customer Support


CRI_EPITA
Beginner
2,212 Views

Hi,

  1. What do mean by crashing, can you specifically explain the what is happening to the system? Is it hanging, rebooting?

The system freezes and needs a hardware reset to be able to reboot.

  1. Does it hang the whole system or the network only disconnects during stress test?

The whole system. We're not able to do anything on the physical console at all.

  1. Does it also occur during normal usage?

Yes, but after much longer, which pointed us to the memory leak theory, which we then confirmed.

  1. Any particular reason why you need to test the connection under pressure or stress test?

The servers are destined to be in a Ceph cluster, which can sometimes be under pressure, which actually happened during our benchmarks. We decided to use the stress tests as an easy way to reproduce the problem.

Regards,

CrisselleF_C_Intel
Moderator
2,165 Views

Hello CRI_EPITA,


Appreciate the swift response.


Please allow us to continue investigating the issue with our engineers. We will get back to you as soon as there is any findings but no later than 3 business days.


Best regards,

Crisselle C

Intel® Customer Support


CrisselleF_C_Intel
Moderator
2,105 Views

Hello CRI_EPITA,


Thank you for the patience on this matter. 


Please below response for our engineering team's findings for this request. 


This is a known issue that was recently discovered. This should be fixed in the latest driver v2.14.13 that was just released on sourceforge.

https://sourceforge.net/projects/e1000/


Kindly try it out and let us know of the results after testing.


Hoping to hear an update from you.


We will follow up after 3 business days in case we don't receive a reply.


Best regards,

Crisselle C

Intel® Customer Support


CrisselleF_C_Intel
Moderator
2,035 Views

Hello CRI_EPITA,


Good day!


This is just a follow up to check if you read our previous post and able to test the latest driver from link provided. We would highly appreciate if you can share an update regarding the status of this request. 


Awaiting to your reply.


In case we don't hear from you, we will follow up after 3 business days.


Best regards,

Crisselle C

Intel® Customer Support


CrisselleF_C_Intel
Moderator
2,024 Views

Hello CRI_EPITA,

 

How are you doing? 

 

We'd like to check if you have read our previous message. If yes, kindly let us know if you are able to test the latest driver from link provided and share the result after testing.

 

We hope to hear from you soon.

 

Should there be no response from you, I’ll make sure to reach out after 3 business days.

 

Best regards,

Crisselle C

Intel® Customer Support

CRI_EPITA
Beginner
2,012 Views

Hi,

We are currently unable to test the new release. Once we are able, we will report our findings here.

Regards,

CrisselleF_C_Intel
Moderator
1,602 Views

Hello CRI_EPITA,


Thank you for the prompt reply.


Then, we will make another follow up after a week to give you enough time on this. If in case that you feel it might take too long in testing the new release and wish to temporarily close this request, feel free to let us know. 


Awaiting to hear an update from you. 


Best regards,

Crisselle C

Intel® Customer Support


Reply