Ethernet Products
Determine ramifications of Intel® Ethernet products and technologies
5174 Discussions

Issue with HP DL380 Gen10 server with Intel XXV710-2

KPoku
Beginner
15,797 Views

We are using HP DL380 Gen10 servers each with two Intel XXV710-2 NIC's in our data center with SR-IOV feature.

 

OS on servers is Ubuntu:

VERSION="16.04.6 LTS (Xenial Xerus)"

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 16.04.6 LTS"

VERSION_ID="16.04"

Linux ri-cgn-kvm1 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

 

Intel i40e drivers are up date:

 i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 2.8.43

 

Firmware version on servers:

iLO 5   1.40 Feb 05 2019   

System ROM   U30 v2.04 (04/18/2019)  

Intelligent Platform Abstraction Data   8.9.0 Build 38  

System Programmable Logic Device   0x2E   

Power Management Controller Firmware   1.0.4  

Power Supply Firmware   1.00   Bay 1   

Power Supply Firmware   1.00   Bay 2   

Innovation Engine (IE) Firmware   0.2.0.11   

Server Platform Services (SPS) Firmware   4.1.4.251   

Redundant System ROM   U30 v2.00 (02/02/2019)  

Intelligent Provisioning   3.30.213   System Board   

Power Management Controller FW Bootloader   1.1  

HPE Smart Storage Battery 1 Firmware   0.70   Embedded Device   

HPE Ethernet 1Gb 4-port 331i Adapter - NIC   20.14.54   

HPE Smart Array P408i-a SR Gen10   1.98   Embedded RAID   

Intel Ethernet Network Adapter XXV710-2   1.2154.0   PCI-E Slot1

Intel Ethernet Network Adapter XXV710-2   1.2154.0   PCI-E Slot4   

Embedded Video Controller   2.5   Embedded Device

 

In average once per week we get same error on different server in our data center on iLO:

1. PCI Bus   Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000   05/22/2019 04:53:12   1   Hardware

2. System Error   Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible.   05/22/2019 04:53:12   1   Hardware

3. CPU   Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'AE000000).

 

In this case unable server is not responding, even the console on iLO doesn't work and only reboot helps.

Error is related to PCI-E slot where Intel cards are connected.

 

Also there are issues with SR-IOV, when VM that is using VF stop to process traffic and we i see this in kern.log file:

Jul 14 06:28:47 ri-cgn-kvm4 kernel: [803323.350238] i40e 0000:af:00.0: TX driver issue detected on VF 1

Jul 14 06:28:47 ri-cgn-kvm4 kernel: [803323.350241] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF

 

Did anyone had this issues?

I've tried to contact HP support but as it seems we are using NIC that is offically unsupported with HP server.

 

Regards,

Kresimir

0 Kudos
35 Replies
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for posting in Intel Ethernet Communities. 

 

Please provide the following details for us to check on your query.

1.) When was the issue first encountered?

2.) Is there any software or hardware changes prior to the issue?

3.) Where exactly does the error message appears?

4.) How many servers with XXV710 NICs are affected on this issue?

5.) Can you share more details on your issue regarding SR-IOV.

6.) Please provide the System Support Utility log of your system. This will allow us to check your Adapter details and configuration. This would also help us identify if you are using an OEM or retail version of Intel Ethernet Adapter. Kindly refer to the steps below.

a- https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility

b- Open SSU.exe

c- Mark the box "Everything" and then click "Scan".

d- When finished scanning, click "Next".

e- Click on "Save" and attach the file to a post.

 

Looking forward to your reply.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views

Hi Crisselle,

 

1.) This issue was first encountered on 15th June, couple of days after we put servers into production

2.) At the beginning there was no SW or HW change. After we experienced this issue couple of times, on 1st July we updated the firmware and OS drivers of Intel NIC's to the latest version.

3.) As there are two type of issues, one when whole server is down, then we see this error in log file on iLO:

1. PCI Bus   Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000   05/22/2019 04:53:12   1   Hardware

2. System Error   Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible.   05/22/2019 04:53:12   1   Hardware

3. CPU   Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address x00000000'00000000, Misc 0x00000000'AE000000).

Second issue is when one Virtual Function stops working, then we see this in Ubuntu /var/log/kern.log file:

Jul 16 01:28:30 ri-cgn-kvm4 kernel: [58525.774076] i40e 0000:12:00.0: TX driver issue detected on VF 1

Jul 16 01:28:30 ri-cgn-kvm4 kernel: [58525.774079] i40e 0000:12:00.0: Use PF Control I/F to re-enable the VF

 

4.) There are 5 servers and issues appear randomly on all of them

5.) As mentioned before, issues with SR-IOV are detected when VM that is running on top, stops to process network traffic. Then we see what i wrote above in /var/log/kern.log file. After reboot of VM, everything works OK.

 

6.) I've attached the ri-cgn-kvm4.txt file.

 

Regards,

Kresimir

 

 

 

 

 

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for the prompt reply. 

 

We will check on your query and give you an update within 2-3 business days.

 

Hoping for your patience. 

 

(We might post on this thread requesting an addition information that would help us to investigate the issue.)

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for the patience on this matter.

 

Please try the Latest i40e driver 2.9.21 and check if it would be of help to the issue.

https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connections-Under-Linux-?product=95259

 

Kindly share when does the issue shows up, is it during heavy network traffic?

 

Looking forward to your reply.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views

Hi Crisselle,

 

thank you for your feedback. We will try to install the latest driver.

Issue appears randomly, even during non-peek hours and when network traffic is minimal.

 

I was able to find this online:

https://ixnfo.com/en/solution-tx-driver-issue-detected-pf-reset-issued.html

What is your opinion to turn off the offloading?

 

Regards,

Kresimir

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for the reply.

 

We will wait for your update once you've tried the latest driver.

 

Please allow us to check the website that you provided regarding offloading.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

We'd like to check if you were able to update the driver to its latest version?

 

Please be informed that we are still checking the website that you provided regarding offloading.

 

Looking forward to your reply.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Mike_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for patiently waiting. The link that you provided is a 3rd party link so Intel cannot provide more comment about it, however here is our comment regarding offloading.

 

Disabling the offloading feature of the NIC would be beneficial for usage that requires low latency \ quick response. However, this will increase the CPU utilization.

 

Looking forward to your reply.

 

Best regards,

Michael L.

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

We'd like to check if you have any other concerns or additional questions on this matter. If you do, please let us know for us to further assist you.

 

Looking forward to hear from you.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Since we haven't receive any response on our previous follow up, we will now proceed closing this inquiry. If you have any other concern or additional questions, please do not hesitate to post a new question.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views
Hi Crisselle, Please do close this inquiry yet. I'm on vacation and will be returning next week. Regards, Krešimir Pokupec
0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

We appreciate your reply.

 

Should you have any other concern or further assistance needed in the future, please do not hesitate to post a new question.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views

Hi Crisselle,

 

regarding driver update, we need to discuss this with our end customer as they are in network freeze period.

Is there any other action that you recommend beside driver update?

 

Regards,

Kresimir

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Good day!

 

Please allow us to look into this further and check if there are any other recommendations aside from driver update. We will get back to you within 1-3 business days.

 

Hoping for your patience. 

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for the patience on this matter.

 

You may try to turn off the offloading engines of the adapter. You can use the command: Ethtool -k to show the offloading features that are enabled.

 

We are looking forward to hear an update from you after trying out our suggestion.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views

Hello Crisselle,

 

i would like to inform you that we updated the NIC driver to the latest version 2.9.21.

Also we turned off the offloading feature on one server just to check if issue will reappear with offloading disabled.

Can we have this case opened for next two weeks in case that these actions don't resolve the issue?

 

Regards,

Kresimir

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for keeping us posted.

 

Sure, no problem. We will wait for your another update for the next two weeks and we will make a follow up on September 9, 2019.

 

Have a lovely day!

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
10,427 Views

Hello Crisselle,

 

it seems that driver update didn't help. Issue appeared again after driver update:

Aug 19 23:15:37 ri-cgn-kvm3 kernel: [45675.220108] i40e 0000:af:00.0: TX driver issue detected on VF 1

Aug 19 23:15:37 ri-cgn-kvm3 kernel: [45675.220109] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF

 

We'll try to turn off the offloading feature.

 

Regards,

Kresimir

0 Kudos
Caguicla_Intel
Moderator
10,427 Views

Hello Kresimir,

 

Thank you for sharing your observation. We'll wait for your update for the results after turning off the offloading feature.

 

Best regards,

Crisselle C

Intel Customer Support

A Contingent Worker at Intel

0 Kudos
KPoku
Beginner
9,788 Views

Hello Crisselle,

 

unfortunately even after driver upgrade and turning off the offloading feature, both issues appeared during the weekend.

One server crashed, here are the logs from iLO (In slot4 we have Intel XXV710-2 connected):

1. PCI Bus  Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000  08/31/2019 04:53:12  1  Hardware

2. System Error  Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible.  08/31/2019 04:53:12  1  Hardware

3. CPU  Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'AE000000).

 

And on the other server we had issue with one VM not processing the traffic with same error as before in /var/log/kern.log:

Aug 31 18:31:47 ri-cgn-kvm3 kernel: [803323.350238] i40e 0000:af:00.0: TX driver issue detected on VF 1

Aug 31 18:31:47 ri-cgn-kvm3 kernel: [803323.350241] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF

 

Could you please advise is there anything else that we can try to resolve this issues?

 

Regards,

Kresimir

 

 

0 Kudos
Reply