Community
cancel
Showing results for 
Search instead for 
Did you mean: 
idata
Community Manager
3,854 Views

Intel X710-T4 - Issues with SAN(s)

We added a new node to our cluster and we've got 2 of the Intel X710-T4 cards in the server. Both of our SANs are currently only 1Gb (we have a Tegile T3100 and a Lenovo px12-450r). Whenever i try to setup SAN connections using one of the ports on these NICs it causes all sorts of issues. We have CSVs go offline and report being corrupted and unreadable, we've lost VM configs and VMs stop booting from any volumes that are being hosted by this node (2016 Hyper V cluster). The odd part is that this same node was working fine in our 2012 cluster before we updated/migrated into 2016. I have the latest drivers for the adapter. Here are some things that have happened:

- When I initially brought the node into the cluster I configured all 3 connections to the SANs using the X710-T4 (2 for the tegile for each independent switch/controller and 1 for the lenovo which is directly connected). Right off the bat each time this node took ownership of either volume on the Lenovo the VMs that were on that volume would no longer boot. They would give different weird boot errors and eventually it would even hard lock the Lenovo SAN itself and I'd have to hard boot it. I moved that connection down to the onboard broadcom 1Gb NIC and that solved those issues. The tegile was still connected via the X710-T4 and while it continued to operate, some odd things happened here as well. Sometimes the list of connected iSCSI devices would just be blank on that node even though it was still operating. In the latest case the node took ownership of a LUN from the Tegile and immediately all the VMs on that LUN stopped working and the CSV reported as corrupt and unreadable. I moved the CSV to another node and after a while it finally started working again. Problem is this node insists on taking ownership of storage nodes and there doesnt appear to be a way to stop it (cant set node preferences on CSVs). So right now I'm scared to unpause this node and am contemplating just moving the Tegile connections into the broadcom as well and hopefully avoid all the hassle.... but when we eventually do upgrade our SAN and go 10Gb I dont want to run into this issue again.

I realize this is probably incredibly hard to decipher... at this point I'm just looking for suggestions. Is it one of the adapter properties? The adapters that connect to the SANs have all protocols except IPv4 unchecked, they have jumbo frames set to 9014 and they are set to not allow the OS to turn them off (power saving thing). Aside from that they are basically at default settings. I think I could probably disable SRV-IO on these adapters but is that causing my issue (I doubt it). Let me know what you think!

0 Kudos
26 Replies
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for posting in Intel Wired Ethernet Communities. Is this adapter Installed on Windows Server 2016 running Hyper-V, or Hyper-V 2016, and is it a Core or Standard GUI version? What is the brand and model of the node server? Please let us know if these adapters are the retail version, or an OEM version, and provide the product code as seen in the following https://www.intel.com/content/www/us/en/support/articles/000007060/network-and-i-o/ethernet-products... link. Was the server using an the same adapter while using 2012? Please provide an https://downloadcenter.intel.com/download/25293/Intel-System-Support-Utility-for-Windows- SSU log from the server for further diagnosis. If you have any questions please do not hesitate to ask.

 

 

Best regards,

 

Daniel D
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Do you still need assistance with this issue? Let us know if you have any problems getting the requested information. If you have any questions please do not hesitate to ask.

 

 

Best regards,

 

Daniel D
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Please let us know if you have any questions or still need assistance with this issue.

 

 

Best regards,

 

Daniel D
idata
Community Manager
73 Views

Yes sorry. I took a bit of a hiatus to just enjoy some problem free time for a bit. I also was monitoring the affected server for any dropouts after changing some NIC options. All appeared steady so I went ahead and tried to add that node back into the cluster. Unfortunately I was met with very similar results. The exact same volume became corrupt and unreadable again only this time it didnt come back as quickly. I had to do some McGyvering to get it working.

The server is a Lenovo System x3550 M5 - Type 8869

The cluster is a Hyper-V cluster. All nodes are running Windows Server 2016 Datacenter (GUI version). The other 3 nodes dont have any 10Gb adapters. This is our newest node so we got 10Gb adapters to future proof it. I cant recall if they came pre-installed when we ordered from the vendor. is there a way to get the product code without opening up the box? The server was using the very same adapters in 2012 without issue. I have the SSU text file. Do you want me to just paste it in here?

idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for your reply. Please post the SSU log as an attachment by creating a reply and clicking attach in the right bottom corner. We will continue to investigate this issue. If you have any other questions please let us know.

 

 

Best regards,

 

Daniel D
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Please post the SSU logs when you are able. This will help us investigate the issue further. If you have any other questions please do not hesitate to ask.

 

 

Best regards,

 

Daniel D
idata
Community Manager
73 Views

# SSU Scan Information

Scan Info:

Version:"2.5.0.12"

Date:"09-18-2018"

Time:"00:00:22.8220320"

# Scanned Hardware

Computer:

BaseBoard Manufacturer:"LENOVO"

BIOS Mode:"UEFI"

BIOS Version/Date:"LENOVO -[TBE136H-2.70]- , 06-13-2018 12:00 AM"

CD or DVD:"Not Available"

Embedded Controller Version:"4.40"

Platform Role:"Enterprise Server"

Processor:"Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz , GenuineIntel"

Processor:"Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz , GenuineIntel"

Secure Boot State:"Off"

SMBIOS Version:"3.0"

Sound Card:"Not Available"

System Manufacturer:"LENOVO"

System Model:"System x3550 M5: -[8869AC1]-"

System SKU:"(none)"

System Type:"x64-based PC"

- "Display"

Intel ® Graphics Driver Version:"Not Available"

- "Matrox G200eR (Renesas) WDDM 2.0"

Adapter Compatibility:"Matrox Graphics Inc."

Adapter DAC Type:"Integrated, 175 MHz"

Adapter RAM:"0.03 GB"

Availability:"Running or Full Power"

Bits Per Pixel:"32"

Caption:"Matrox G200eR (Renesas) WDDM 2.0"

CoInstallers:"oem41.inf,IN00,Integrated, 175 MHz,Matrox G200eR"

Color Table Entries:"4294967296"

Dedicated Video Memory:"Not Available"

Driver:"MxG2rDO64.sys"

Driver Date:"06-21-2016 07:00 PM"

Driver Path:"C:\Windows\system32\DRIVERS\MxG2rDO64.sys"

Driver Provider:"Matrox Graphics Inc."

Driver Version:"4.3.1.4"

INF:"oem41.inf"

INF Section:"IN00"

Install Date:"Not Available"

Installed Drivers:"Not Available"

Last Error Code:"Not Available"

Last Error Code Description:"Not Available"

Last Reset:"Not Available"

Location:"PCI bus 20, device 0, function 0"

Manufacturer:"Matrox Graphics Inc."

Microsoft DirectX* Version:"DirectX 12"

Monochrome:"No"

Number of Colors:"4294967296"

Number of Video Pages:"Not Available"

PNP Device ID:"PCI\VEN_102B&DEV_0534&SUBSYS_0A011D49&REV_01\7&204E1B7B&0&00000000E3"

Power Management Capabilities:"Not Available"

Power Management Supported:"Not Available"

Refresh Rate - Current:"60 Hz"

Refresh Rate - Maximum:"85 Hz"

Refresh Rate - Minimum:"60 Hz"

Resolution:"1280 X 1024"

Scan Mode:"Noninterlaced"

Service Name:"MxG2rDO64"

Status:"OK"

Video Architecture:"VGA"

Video Memory:"Unknown"

Video Processor:"Matrox G200eR"

- "Memory"

Physical Memory (Available):"311.04 GB"

Physical Memory (Installed):"320 GB"

Physical Memory (Total):"319.31 GB"

- "CPU 1"

Capacity:"16 GB"

Channel:"Dimm 1"

Configured Clock Speed:"2133 MHz"

Configured Voltage:"1200 millivolts"

Data Width:"64 bits"

Form Factor:"DIMM"

Interleave Position:"Not Available"

Manufacturer:"Hynix"

Maximum Voltage:"Not Available"

Memory Type:"Unknown"

Minimum Voltage:"Not Available"

Part Number:"HMA42GR7AFR4N-UH"

Serial Number:"2929FC96"

Status:"Not Available"

Type:"Synchronous"

- "CPU 1"

Capacity:"32 GB"

Channel:"Dimm 2"

Configured Clock Speed:"2133 MHz"

Configured Voltage:"1200 millivolts"

Data Width:"64 bits"

Form Factor:"DIMM"

Interleave Position:"Not Available"

Manufacturer:"Hynix"

Maximum Voltage:"Not Available"

Memory Type:"Unknown"

Minimum Voltage:"Not Available"

Part Number:"HMA84GR7AFR4N-UH"

Serial Number:"1156A295"

Status:"Not Available"

Type:"Synchronous"

- "CPU 1"

Capacity:"16 GB"

Channel:"Dimm 4"

Configured Clock Speed:"2133 MHz"

Configured Voltage:"1200 millivolts"

Data Width:"64 bits"

Form Factor:"DIMM"

Interleave Position:"Not Available"

Manufacturer:"Hynix"

Maximum Voltage:"Not Available"

Memory Type:"Unknown"

Minimum Voltage:"Not Available"

Part Number:"HMA42GR7AFR4N-UH"

Serial Number:"2A676227"

Status:"Not Available"

Type:"Synchronous"

- "CPU 1"

Capacity:"32 GB"

Channel:"Dimm 5"

Configured Clock Speed:"2133 MHz"

Configured Voltage:"1200 millivolts"

Data Width:"64 bits"

Form Factor:"D...

idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for the SSU logs. Please allow some time for us to investigate the logs. Will provide an update as soon as possible. If you have any questions please do not hesitate to ask.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for your patience while we investigate this issue. Are you encountering any errors in Windows 2016 event viewer relating to the issue? We noticed VMQ is disabled on the Broadcom adapter, and enabled on the X710. Can you help us understand the usage of each port on the X710 and the Broadcom adapter so we can look at specific settings for each port? Let us know if you have any questions.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

No event viewer errors that I have noticed though admittedly when the issue happens its a bit of a panic so I'm less focused on finding the root cause than I am just getting our clients up and going again.

The broadcom ports at the time of that log were used as follows: LAN, Hyper-V Adapter 1, Hyper-V Adapter 2, Hyper-V Adapter 3

The Intel X710 NICs were used as the iSCSI adapters that connect to our SANs for our cluster shared volumes (for the SAN adapters the only protocol left operational is TCP/IP v4 and all others are unchecked).

i did also try disabling VMQ on the Intel NICs with no change.

As of right now I have reconfigured everything so that the SANs use the Broadcom adapter and the Intel NICs just do the Hyper-V Connections though I still havent worked up the courage to try adding this node back into the cluster. I should be doing it this thursday

idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for the information. Please let us know how it goes on Thursday. We will continue the investigation with the information provided. If you have any questions please let us know.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for your patience. Have you tried using the adapter at a 1Gbps connection speed? Does the issue still occur? Let us know if you have any questions.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

Oh sorry I maybe didnt mention that. We currently only have 1Gb switches so they are operating at 1Gb. I even tried manually setting their speed to see if that would help.

I will be attempting to add the node back into the cluster tonight (although with the SANs using the broadcom adapters now)

idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for the response. Please set the adapters to 1Gbps, and let us know if anything changes. Let us know if you have any questions.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

I mentioned in the previous reply that I did try setting the adapters to 1Gbps with no change. However now that I've moved the SAN connections to the Broadcom the node is back in the cluster and working!!! So it definitely appears to be something with the X710-T4's.

idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for the reply. We will continue to look into this issue, and update you soon. Let us know if you have any other questions.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

Hello KeithW19,

 

 

Thank you for your patience. Please update the firmware on the X710-T4 to 6.01 using the https://downloadcenter.intel.com/download/24769/Non-Volatile-Memory-NVM-Update-Utility-for-Intel-Eth... NVM Update Utility. Let us know if this changes anything when connecting to the SAN. If you have any questions please do not hesitate to ask.

 

 

Best regards,

 

Daniel D

 

Intel Customer Support
idata
Community Manager
73 Views

Thank you I will give that a try. It wont be anytime soon unfortunately as now that we have the node back in the cluster we need to catch up on some much needed updates and it will be a while before i'm able to move VMs off that node again. Hopefully that will do the trick!

idata
Community Manager
73 Views

Hello KeithW19,

Thank you for the reply. When you are able to update the NVM please let us know how it goes. If you have any questions please do not hesitate to ask.

Best regards,

 

Daniel D

Intel Customer Support