- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are using HP DL380 Gen10 servers each with two Intel XXV710-2 NIC's in our data center with SR-IOV feature.
OS on servers is Ubuntu:
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
Linux ri-cgn-kvm1 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Intel i40e drivers are up date:
i40e: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver - version 2.8.43
Firmware version on servers:
iLO 5 1.40 Feb 05 2019
System ROM U30 v2.04 (04/18/2019)
Intelligent Platform Abstraction Data 8.9.0 Build 38
System Programmable Logic Device 0x2E
Power Management Controller Firmware 1.0.4
Power Supply Firmware 1.00 Bay 1
Power Supply Firmware 1.00 Bay 2
Innovation Engine (IE) Firmware 0.2.0.11
Server Platform Services (SPS) Firmware 4.1.4.251
Redundant System ROM U30 v2.00 (02/02/2019)
Intelligent Provisioning 3.30.213 System Board
Power Management Controller FW Bootloader 1.1
HPE Smart Storage Battery 1 Firmware 0.70 Embedded Device
HPE Ethernet 1Gb 4-port 331i Adapter - NIC 20.14.54
HPE Smart Array P408i-a SR Gen10 1.98 Embedded RAID
Intel Ethernet Network Adapter XXV710-2 1.2154.0 PCI-E Slot1
Intel Ethernet Network Adapter XXV710-2 1.2154.0 PCI-E Slot4
Embedded Video Controller 2.5 Embedded Device
In average once per week we get same error on different server in our data center on iLO:
1. PCI Bus Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000 05/22/2019 04:53:12 1 Hardware
2. System Error Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible. 05/22/2019 04:53:12 1 Hardware
3. CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'AE000000).
In this case unable server is not responding, even the console on iLO doesn't work and only reboot helps.
Error is related to PCI-E slot where Intel cards are connected.
Also there are issues with SR-IOV, when VM that is using VF stop to process traffic and we i see this in kern.log file:
Jul 14 06:28:47 ri-cgn-kvm4 kernel: [803323.350238] i40e 0000:af:00.0: TX driver issue detected on VF 1
Jul 14 06:28:47 ri-cgn-kvm4 kernel: [803323.350241] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF
Did anyone had this issues?
I've tried to contact HP support but as it seems we are using NIC that is offically unsupported with HP server.
Regards,
Kresimir
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for posting in Intel Ethernet Communities.
Please provide the following details for us to check on your query.
1.) When was the issue first encountered?
2.) Is there any software or hardware changes prior to the issue?
3.) Where exactly does the error message appears?
4.) How many servers with XXV710 NICs are affected on this issue?
5.) Can you share more details on your issue regarding SR-IOV.
6.) Please provide the System Support Utility log of your system. This will allow us to check your Adapter details and configuration. This would also help us identify if you are using an OEM or retail version of Intel Ethernet Adapter. Kindly refer to the steps below.
a- https://downloadcenter.intel.com/product/91600/Intel-System-Support-Utility
b- Open SSU.exe
c- Mark the box "Everything" and then click "Scan".
d- When finished scanning, click "Next".
e- Click on "Save" and attach the file to a post.
Looking forward to your reply.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Crisselle,
1.) This issue was first encountered on 15th June, couple of days after we put servers into production
2.) At the beginning there was no SW or HW change. After we experienced this issue couple of times, on 1st July we updated the firmware and OS drivers of Intel NIC's to the latest version.
3.) As there are two type of issues, one when whole server is down, then we see this error in log file on iLO:
1. PCI Bus Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000 05/22/2019 04:53:12 1 Hardware
2. System Error Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible. 05/22/2019 04:53:12 1 Hardware
3. CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address x00000000'00000000, Misc 0x00000000'AE000000).
Second issue is when one Virtual Function stops working, then we see this in Ubuntu /var/log/kern.log file:
Jul 16 01:28:30 ri-cgn-kvm4 kernel: [58525.774076] i40e 0000:12:00.0: TX driver issue detected on VF 1
Jul 16 01:28:30 ri-cgn-kvm4 kernel: [58525.774079] i40e 0000:12:00.0: Use PF Control I/F to re-enable the VF
4.) There are 5 servers and issues appear randomly on all of them
5.) As mentioned before, issues with SR-IOV are detected when VM that is running on top, stops to process network traffic. Then we see what i wrote above in /var/log/kern.log file. After reboot of VM, everything works OK.
6.) I've attached the ri-cgn-kvm4.txt file.
Regards,
Kresimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for the prompt reply.
We will check on your query and give you an update within 2-3 business days.
Hoping for your patience.
(We might post on this thread requesting an addition information that would help us to investigate the issue.)
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for the patience on this matter.
Please try the Latest i40e driver 2.9.21 and check if it would be of help to the issue.
https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connections-Under-Linux-?product=95259
Kindly share when does the issue shows up, is it during heavy network traffic?
Looking forward to your reply.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Crisselle,
thank you for your feedback. We will try to install the latest driver.
Issue appears randomly, even during non-peek hours and when network traffic is minimal.
I was able to find this online:
https://ixnfo.com/en/solution-tx-driver-issue-detected-pf-reset-issued.html
What is your opinion to turn off the offloading?
Regards,
Kresimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for the reply.
We will wait for your update once you've tried the latest driver.
Please allow us to check the website that you provided regarding offloading.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
We'd like to check if you were able to update the driver to its latest version?
Please be informed that we are still checking the website that you provided regarding offloading.
Looking forward to your reply.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for patiently waiting. The link that you provided is a 3rd party link so Intel cannot provide more comment about it, however here is our comment regarding offloading.
Disabling the offloading feature of the NIC would be beneficial for usage that requires low latency \ quick response. However, this will increase the CPU utilization.
Looking forward to your reply.
Best regards,
Michael L.
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
We'd like to check if you have any other concerns or additional questions on this matter. If you do, please let us know for us to further assist you.
Looking forward to hear from you.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Since we haven't receive any response on our previous follow up, we will now proceed closing this inquiry. If you have any other concern or additional questions, please do not hesitate to post a new question.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
We appreciate your reply.
Should you have any other concern or further assistance needed in the future, please do not hesitate to post a new question.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Crisselle,
regarding driver update, we need to discuss this with our end customer as they are in network freeze period.
Is there any other action that you recommend beside driver update?
Regards,
Kresimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Good day!
Please allow us to look into this further and check if there are any other recommendations aside from driver update. We will get back to you within 1-3 business days.
Hoping for your patience.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for the patience on this matter.
You may try to turn off the offloading engines of the adapter. You can use the command: Ethtool -k to show the offloading features that are enabled.
We are looking forward to hear an update from you after trying out our suggestion.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Crisselle,
i would like to inform you that we updated the NIC driver to the latest version 2.9.21.
Also we turned off the offloading feature on one server just to check if issue will reappear with offloading disabled.
Can we have this case opened for next two weeks in case that these actions don't resolve the issue?
Regards,
Kresimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for keeping us posted.
Sure, no problem. We will wait for your another update for the next two weeks and we will make a follow up on September 9, 2019.
Have a lovely day!
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Crisselle,
it seems that driver update didn't help. Issue appeared again after driver update:
Aug 19 23:15:37 ri-cgn-kvm3 kernel: [45675.220108] i40e 0000:af:00.0: TX driver issue detected on VF 1
Aug 19 23:15:37 ri-cgn-kvm3 kernel: [45675.220109] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF
We'll try to turn off the offloading feature.
Regards,
Kresimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Kresimir,
Thank you for sharing your observation. We'll wait for your update for the results after turning off the offloading feature.
Best regards,
Crisselle C
Intel Customer Support
A Contingent Worker at Intel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Crisselle,
unfortunately even after driver upgrade and turning off the offloading feature, both issues appeared during the weekend.
One server crashed, here are the logs from iLO (In slot4 we have Intel XXV710-2 connected):
1. PCI Bus Uncorrectable PCI Express Error Detected. Slot 4 (Segment 0x0, Bus 0xAE, Device 0x0, Function 0x0). Uncorrectable Error Status: 0x100000 08/31/2019 04:53:12 1 Hardware
2. System Error Unrecoverable I/O Error has occurred. System Firmware will log additional details in a separate IML message entry if possible. 08/31/2019 04:53:12 1 Hardware
3. CPU Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000040, Bank 0x00000006, Status 0xBB800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'AE000000).
And on the other server we had issue with one VM not processing the traffic with same error as before in /var/log/kern.log:
Aug 31 18:31:47 ri-cgn-kvm3 kernel: [803323.350238] i40e 0000:af:00.0: TX driver issue detected on VF 1
Aug 31 18:31:47 ri-cgn-kvm3 kernel: [803323.350241] i40e 0000:af:00.0: Use PF Control I/F to re-enable the VF
Could you please advise is there anything else that we can try to resolve this issues?
Regards,
Kresimir
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page