i40EN driver spewing Rx errors on VMware ESXi 7.0.2 hosts

Slesiak · ‎06-17-2021

ProLiant DL380 Gen10

VMware ESXi 7.0.2

vmnic10 0000:b0:00.0 i40en Up Up 10000 Full d4:f5:ef:19:28:90 1500 Intel(R) Ethernet Controller X710 for 10GbE SFP+

vmnic11 0000:b0:00.1 i40en Up Up 10000 Full d4:f5:ef:19:28:98 1500 Intel(R) Ethernet Controller X710 for 10GbE SFP+

vmnic7 0000:37:00.1 i40en Up Up 10000 Full d4:f5:ef:16:7d:68 9000 Intel(R) Ethernet Controller X710 for 10GbE SFP+

vmnic8 0000:13:00.0 i40en Up Up 10000 Full d4:f5:ef:18:bd:60 1500 Intel(R) Ethernet Controller X710 for 10GbE SFP+

vmnic9 0000:13:00.1 i40en Up Up 10000 Full d4:f5:ef:18:bd:68 1500 Intel(R) Ethernet Controller X710 for 10GbE SFP+

esxcli network nic get -n vmnic9

Advertised Auto Negotiation: true

Advertised Link Modes: Auto, 1000BaseSR/Full, 10000BaseSR/Full

Auto Negotiation: true

Cable Type: FIBRE

Current Message Level: 0

Driver Info:

Bus Info: 0000:13:00:1

Driver: i40en

Firmware Version: 10.51.5

Version: 1.10.9.0

Link Detected: true

Link Status: Up

Name: vmnic9

PHYAddress: 0

Pause Autonegotiate: false

Pause RX: false

Pause TX: false

Supported Ports: FIBRE

Supports Auto Negotiation: true

Supports Pause: true

Supports Wakeon: false

Transceiver:

Virtual Address: 00:50:56:51:ab:53

Wakeon: None

One of the hosts just suddenly went off the network and VMware is blaming it on these errors spewing out of the i40en driver:

vmkernel.log:2021-06-15T16:16:21.805Z cpu54:2097508)i40en: indrv_AllocMultiQueue:165: Failed to allocate Rx

I'm see hundreds if not thousands daily, has anyone else seen these? I have 8 Proliant servers, and all 8 of them are spewing the same error into the vmkernel log.

Slesiak · ‎06-17-2021

HPE is stating that this is the fix:
https://kb.vmware.com/s/article/83243

but all of the interfaces are 10Gb, not 25Gb as it states in the article. We're also on 7.0.2, and the KB states the issue was "fixed" in 7.0U1

netPagePoolLimitPerGB -v 15360 (from the article) versus the current default number 5120
netPagePoolLimitCap -v 1375920(from the article) versus the current default number 1048576

If I can schedule some time to do a test, I'll post the results.

Mike_Intel · ‎06-17-2021

Hello Slesiak,

Thank you for posting in Intel Ethernet Communities.

We understand that you also found a fix for the issue, If you have questions, please let us know.

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Slesiak · ‎06-18-2021

Thanks for the response Michael. We're still working through this, the noted requisites for the problem does not match what we have in our environment. We're not necessarily sure the provided answer will fix the issue.

Mike_Intel · ‎06-20-2021

Hello Slesiak,

Thank you for the update. While waiting for your reply, can you provide the following details for me to check the issue as well?

Can you share the link of your latest driver?
Are you using onboard/embedded X710?
What is the brand and model of your system?
Other troubleshooting steps that you tried so far?

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Slesiak · ‎06-21-2021

Can you share the link of your latest driver? https://support.hpe.com/hpesc/public/swd/detail?swItemId=MTX_4d2addac81bf4876b47f54925d&#tab4 <-- latest OEM driver from HPE, it's the one we have installed.
Are you using onboard/embedded X710? Yes, there's both.
What is the brand and model of your system? HPE ProLiant DL380 Gen10
Other troubleshooting steps that you tried so far? Unfortunately none. These errors were found while trying to track another issue with flapping MACs on Cisco CSRv devices. Even with the CSRs moved to another device and then shut off, these errors persist. Also I am hesitant to do too much because the devices are in production and I cannot take them offline to perform any work/testing without a solid cause. While the errors are there, the management team does not feel there is enough of an issue to warrant taking any of the ESXi hosts offline.

Mike_Intel · ‎06-21-2021

Hello Slesiak,

Thank you for the quick reply. After checking all of the drivers and updates that you provided. Let me asked if you already tried to raise this issue to HP? The network card is embedded on the board and the system builder is the one who validates the OS.

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Slesiak · ‎06-22-2021

As mentioned earlier, we did bring this up with HPE and they are still looking at the issue. They did give us a response, but the symptoms and circumstances do not mach.

https://kb.vmware.com/s/article/83243

but all of the interfaces are 10Gb, not 25Gb as it states in the article. We're also on 7.0.2, and the KB states the issue was "fixed" in 7.0U1

netPagePoolLimitPerGB -v 15360 (from the article) versus the current default number 5120
netPagePoolLimitCap -v 1375920(from the article) versus the current default number 1048576

Because of the variance in our circumstance to the depictions in the article our management is hesitant to make any changes.

Mike_Intel · ‎06-22-2021

Hello Slesiak,

Thank you for the quick response and we do understand your situation, however the network card is embedded on HP system so they have altered network card. My suggestion is they can also investigate this issue as a new/different error or issue.

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Mike_Intel · ‎06-27-2021

Hello Slesiak,

I hope you enjoyed your weekend. I am just checking if you are now talking to HP for further assistance regarding this issue.

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Slesiak · ‎06-28-2021

Hello,

We've made the changes as suggested in https://kb.vmware.com/s/article/83243. We're going to monitor the ESXi host for a few days to make sure the issue doesn't come back. My only fear is that management did not want to move all of the services back to the host in case there was an issue during monitoring, so the host isn't necessarily under the same load as it was before.

I'll continue to monitor and let you know if they decide this was enough to allow the fix to go onto all of the systems.

Mike_Intel · ‎06-28-2021

Hello Slesiak,

Thank you for the update. I hope everything will get better after trying the recommendations. By the way, since you are now talking to HP for further assistance, would you like us to keep this thread open?

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Mike_Intel · ‎07-01-2021

Hello Slesiak,

I hope this message finds you well. I am just checking if you are now talking with HP regarding the issue and I hope that the system is working fine now.

In case we do not hear from you, we will make a follow up after 3 workings days.

Thank you.

Best regards,

Michael L.

Intel® Customer Support Technician

Slesiak · ‎07-01-2021

It looks like the response from HPE is working as intended.

Mike_Intel · ‎07-01-2021

Hello Slesiak,

Thank you so much for the update and we are glad that the recommendation is working. Please continue to coordinate with HP and as for this thread since you are now talking to HP, we will close this inquiry.

If you need further assistance again, please post a new question.

Thank you and stay safe.

Best regards,

Michael L.

Intel® Customer Support Technician