Intel X710 woes - Page 2

DHend8 · ‎04-23-2018

We have 8 brand new HPE DL380 Gen10 servers. Each of these servers has two HPE Ethernet 10Gb 2-port 562SFP+ Adapters. One is embedded and one on a PCI card. This card is based on the Intel X710 controller

https://www.hpe.com/us/en/product-catalog/servers/server-adapters/pip.hpe-ethernet-10gb-2-port-562sfp-adapter.8245220.html HPE Ethernet 10Gb 2-port 562SFP+ Adapter OID8245220 | HPE™

We hooked two DAC cables, one from each card to our Juniper switch. The two ports on the Juniper switch are standard access ports. After installed the HPE customized version of ESXi 6.5 U1 we went into the console and added the two nics. We put in an IP address, gateway, mask, and DNS servers and then rebooted the host. During reboot we had a continuous ping going to the IP address of the ESXi management interface. The ping returned part way through the boot process and after fully booted the ping stopped. If I remove one of the 10Gb nics from the management network, the ping returns. If I use two of the 1Gb interfaces on this server which are based on a Broadcom chipset, the management interface works fine. The server has had all of its firmware upgraded. We are running driver version 1.5.8 and firmware version 10.2.5. The firmware came from here, I think it corresponds to VMWare's version of firmware 6.00

https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1008830010&swItemId=MTX_87c83853cb5a4bc5949e9b0dd5&swEnvOid=4184# tab1 Drivers & Software - HPE Support Center.

I have a ticket opened with HPE but wanted to find out if Intel might have a solution for this

DHend8 · ‎04-27-2018

Are you seeing the same behavior I am

Two 1Gb connections (not the X710 card, other nics) in active/active (under portgroup and vswitch) with External Switch Tagging and management network is fine
One X710 card active for the management network with External Switch Tagging and management network is fine
Two X710 cards in active/active (under portgroup and vswitch) with External Switch Tagging and management network passes no traffic
Two X710 cards in active/active (under portgroup and vswitch) with Virtual Switch Tagging and management network is fine

TShil · ‎04-27-2018

Yes, we see the same.

After a 4 hour support call with Juniper JTAC we confirmed its an arp issue with the x710. The x710 arp entry is in the juniper arp table while the ICMP pings fail. If we clear the x710 arp entry in juniper the ICMP starts to reply for about 20 pings then stops again. Then we placed a layer 3 interface on the access switch that feeds the x710 uplink so we can see a trace from both switches (access & core). Doing a trace on access & core while ICMP is running both switches see the packets bi-directionaly. This proves that the packets are getting to the x710 but the x710 stops replying. We believe the issue is related directly to the x710 and an arp issue when the port modes are in access. Next we followed your solution, changed the interfaces to truck, tagged vswitch0 with the vlan # and everything works fine.

We agree that your findings are correct when the x710s are teamed and have the same question to intel, what logs should show Malicious Driver Detection.

DHend8 · ‎04-27-2018

So are you thinking like I am that this is all related to the i40en driver? We are using the latest, 1.5.8 and having this issue

My hope is a new i40en driver will be released next week that fixes this. It sounds like a new driver release is imminent

TShil · ‎04-27-2018

We are seeing a mixed result. We are running other servers in the same environment with x710s running Firmware Version 18.0.17 & Driver Version 1.3.1 with no issues. The new server came with the same Firm/Driver combo and we were seeing issues out of the box. We updated/downgraded firmware and drivers during our troubleshooting with no success.

Our theory is that there is a corrupt version of the firmware during the manufacturing process or out on vendor sites. Once applied, even when you re-flash the firmware it leaves something behind in the x710 that presents the issue.

DHend8 · ‎04-27-2018

I am not sure how much each vendor, Dell and HP, are involved in the firmware and drivers for these cards but there is some level of involvement. The firmware on our X710 cards is version 10.2.5 which comes from HP's site

https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1008830010&swItemId=MTX_87c83853cb5a4bc5949e9b0dd5&swEnvOid=4184# tab3 Drivers & Software - HPE Support Center.

The driver is version 1.5.6 out of the box which also comes from HP's site

https://support.hpe.com/hpsc/swd/public/detail?sp4ts.oid=1008830010&swItemId=MTX_03c21f88fa3447e78d78770331&swEnvOid=4184# tab4 Drivers & Software - HPE Support Center.

Of course when I go to VMware's site I see the newer 1.5.8 driver which we have tried along with firmware version 6.01

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=37994 VMware Compatibility Guide - I/O Device Search

Not sure if HP's firmware version 10.2.5 is the same as VMware's 6.01. There is also the i40e driver version 2.0.7 on the VMware page which goes along with firmware 5.05

The version of the Dell firmware and driver are completely different

Just waiting for the right combination of firmware and drivers that makes all of this work

TShil · ‎04-27-2018

We tried every combo of Dell firmware and driver on the HCL with no success. If we didn't stumble across your post we would be still be guessing. There is no harm for us to trunk the ports and use Virtual Switch Tagging so for now we will go with your solution.

Thanks again.

DHend8 · ‎04-27-2018

In our case these are brand new servers that we just took delivery a few weeks back. These are replacing all of the 5 year old servers in our data center and DR site. Each server has two of the HPE nics based on the Intel X710 chipset with each nic having two 10Gb SFP+ ports. In our data center we are using the 1Gb broadcom nics for the management network. In our DR site though we are using two of 10Gb nics for the management network. Normally the management network is a low bandwidth application but we use Veeam for backup and replication. During replication our Veeam proxy server reads data directly from our SAN but writes data through vCenter. This means the target replication traffic goes through the management network, hence the need for 10Gb nics.

We will be using two of the 10Gb nics to carry iSCSI, VM, and vMotion traffic. If we do use trunk ports for the management network and Virtual Switch Tagging, my worry is what is going to happen on the other 10Gb ports. In other words, if I put these servers into production the way they are, will I see problems with storage, VM, or vMotion traffic? I am waiting to hear when the new i40en driver will be released and hopefully it will fix this problem. If you Google "Intel X710 ESXi" or "Intel X710 Linux" it is easy to find many cases where this nic has had issues stretching back nearly two years.

As soon as the new i40en driver is released and we have tested it, I will report back

DHend8 · ‎04-29-2018

Maverick85,

Any chance you could pass along your JTAC case number? I would like to give the Juniper engineer I am working with as well as the 3rd level support engineer at HPE any information that might help them solve this issue

TShil · ‎04-30-2018

In our setup we are using two teamed 10gb nics for each: management/veeam | vSan (all flash) | vMotion | Virtual Machine Guests (Tagged)

If you are going to be using the two 10Gb nics to carry iSCSI, VM, and vMotion traffic i would assume you will need to trunk and tag if they will be on separate subnets. If they are all on the same subnet your will need to trunk and tag to keep the x710 working properly.

We believe that it should not matter but are confirming with VMware that there is no issue to tag/trunk our vSan & vMotion even thou they are dedicated.

These are vSAN articles but they mention tagging on a shared network.

"VLAN ID – If you are using VLANs to separate vSAN traffic, enter the relevant VLAN ID."

https://kb.vmware.com/s/article/2058368

Look at section "Allocating Bandwidth for vSAN by Using Network I/O Control"

https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.virtualsan.doc/GUID-031F9637-EE29-4684-8644-7A93B9FD8D7B.html

nbala3 · ‎04-30-2018

Hi Everyone,

I am also facing similar issue. I am working on a HP DL560 Gen 10 server with HPE Ethernet 10Gb 2-port 562SFP+ Adapter.

[root@:~] ethtool -i vmnic6

driver: i40e

version: 2.0.7

firmware-version: 6.00 0x8000366c 1.1825.0

bus-info: 0000:4e:00.0

Above is the details of the current driver and firmware version I used. We are having intermittent performance issue while accessing remote desktops of VMs,

virtual machine hangs and slowness in application performance.

As recommended by HPE, I have updated the driver to ESXi 6.0 i40en 1.5.8.

Also ran the commands to disable the legacy driver and enable the native driver.

After rebooting the server, I have moved few machines to the server and all the VMs lost network connectivity.

It looks the adapters are not able to pass any traffic through it.

How can I fix the issue now, I have reverted back to the driver: i40e driver that shows performance problems.

Please someone let me know a way out!!

TShil · ‎04-30-2018

nidhinmds,

Our issue is that we lose all network connectivity not intermittent performance.

Is your 10gb network teamed and tagged?

What switches are you using?

DHend8 · ‎04-30-2018

Mavericks85

HPE support gave me this command to run

esxcli system module parameters set -m i40en -p LLDP=0,0

With two access ports, two nics devoted to the management network in active/active I now have connectivity. It appears to be an issue with lldp. Since we are a Juniper show we use lldp (standard) and not CDP (Cisco proprietary). I believe the command above disables lldp which is an issue. They are still working on the root cause of the issue with LLDP in our setup. I am still thinking a driver update will be needed to fix this but we are getting closer to a root cause

TShil · ‎04-30-2018

HendersonD

I will give that a try tomorrow and share my results.

Thanks

TShil · ‎05-01-2018

HendersonD,

The command "esxcli system module parameters set -m i40en -p LLDP=0,0" - (8 zeros for use since we have 8 nics) does not fix our issue.

But I have tested the trunk/tag with the drivers below and all work:

i40en 1.3.1-5vmw.650.1.26.5969303

i40en 1.5.8-1OEM.650.0.0.4598673

i40e 2.0.7-1OEM.600.0.0.2494585

TShil · ‎05-01-2018

As always with the x710 its finding the correct firmware/driver combo.

The only combo the LLDP=0 command works on my Dell x730xd system is with Dell Firmware 18.3.6 & i40en Driver 1.5.8-1OEM.650.0.0.4598673.vib (VMW-ESX-6.5.0-i40en-1.5.8-7759470.zip)

idata · ‎08-28-2018

Hi ashleybanks,

We are working on the next SW release which will include i40en drivers for ESX 6.0 and 6.5 that address the MDD issue. We appreciate your patience.

This thread will be updated as soon as the SW release is available for download.

Regards,

Vince T.

Intel Customer Support

idata · ‎09-19-2018

Hi ashleybanks,

Please be informed the latest i40en driver version 1.7.11 is already available. This driver addresses the Malicious Driver Detection issue of Intel Ethernet 700 Series Network adapters (X710, XL710, XXV710 and X722) on ESXI 6.0, ESXI 6.5 and ESXI 6.7.

Kindly visit the link below for additional information:

/community/tech/wired/blog/2018/05/23/malicious-driver-detection-mdd-event-resolved https://communities.intel.com/community/tech/wired/blog/2018/05/23/malicious-driver-detection-mdd-event-resolved

Best Regards,

Vince T.

Intel Customer Support

idata · ‎09-23-2018

Hi ashleybanks,

Please let us know if you were able to try out the latest i40en driver version 1.7.11.

Looking forward to your response.

Best Regards,

Vince T.

Intel Customer Support

idata · ‎09-27-2018

Hi ashleybanks,

We'd like to check if you still need assistance from Intel Wired Communities. Thanks.

Best Regards,

Vince T.

Intel Customer Support