Periodic Bluescreens After Creating New SET Team Switch with Intel X722-DA2 NICs. Microsoft diagnosed the bugcheck as a problem with i40eb68.sys. I am...

THigl · ‎05-06-2019

Periodic Bluescreens After Creating New SET Team Switch with Intel X722-DA2 NICs. Microsoft diagnosed the bugcheck as a problem with i40eb68.sys. I am running the latest driver release from Intel (23.5.2). iWARP Provider was also installed with the 23.5.2 installer (key use case for us) and all default NIC settings left as-is except that we enabled Jumbo Frames. The BSOD was happening on four identically configured cluster nodes. We tried disabling Jumbo Frames, but the BSODs continued even with the default NIC settings configured by the 23.5.2 installer.

We have tested an alternative NIC from Chelsio Networking with the exact same SET Team switching setup on the same servers, without these issues (which should rule out anything related to switching / server models / etc.).

I reverted Intel Driver from that which is installed with Version: 23.5.2 to the Microsoft Server 2019 natively included driver (1.8.103.2). With this change, BSOD *no longer occurred*. However, with this driver (1.8.103.2) we would see heavy hardware resets (1000+ per hour).

One of the hosts is using an X722-DA2 for iSCSI traffic (no SET Team Switch configured). We have not seen any issues with this configuration on either driver version.

It seems that when Intel X722 is added to a Microsoft SET Team switch, there are problems at the driver level (as diagnosed by Microsoft under a Support Case). Under older driver (1.8.103.2) this seems to manifest as near constant hardware resets. Under latest driver complete package installed by 23.5.2, this manifests as everything working seemingly perfectly (Production Traffic, Throughput Tests, Verified iWarp/RDMA working, etc.) *EXCEPT* Windows Server 2019 has a BSOD approximately every 60-120 minutes.

Mike_Intel · ‎05-07-2019

Hello THigl, Thank you for posting in Intel Ethernet Communities. We will check on this issue and get back to you as soon as we have an update. If you have questions, please let us know. Best regards, Michael L. Intel Customer Support Under Contract to Intel Corporation

Mike_Intel · ‎05-08-2019

Hello THigl, While we are checking the issue, we would like to also request for the markings of the NIC card. You may send photos of the NIC on both sides for us to gather all the markings. If you have questions, please let us know. Best regards, Michael L. Intel Customer Support Under Contract to Intel Corporation

THigl · ‎05-09-2019

MikeL_Intel,

Unfortunately, these NICs are still in production servers so taking photos of NIC markings will not be able to be accomplished until at least May 18th. I am enclosing photos of the boxes, which contain significant information about the cards (Serial / Version / Date / Batch Numbers / etc.) in the cases related to this issue (04181222 and 04182853).

Mike_Intel · ‎05-10-2019

Hello THigl, Thank you for the update. By the way, let me clarify, how many cards are affected? By the way, as for the markings, we are hoping that you can take the markings on the physical NIC's. Since you cannot provide the markings of the NIC, kindly try to generate SSU instead. 1- Download from https://downloadcenter.intel.com/download/25293/Intel-System-Support-Utility-for-Windows- 2- Open SSU.exe 3- Mark the box "Everything" and then click "Scan". 4- When finished scanning, click "Next". 5- Click on "Save" and attach the file to a post. Let us know if you have any other questions. If you have questions, please let us know. Best regards, Michael L. Intel Customer Support Under Contract to Intel Corporation

THigl · ‎05-10-2019

I have a total of 5 cards. 4 of the cards were placed in this setup with SET teaming. All four then exhibited this BSOD behavior. The fifth is only used for iSCSI in one host and doesn’t seem to have an issue. As such, all of this points to driver issues (per Microsoft as well).

Will running the SSU cause any disruption to my production workloads?

Mike_Intel · ‎05-13-2019

Hello THigl,

SSU should not cause any disruption since it will just gather information about the system.

If you have questions, please let us know.

Best regards,

Michael L.

Intel Customer Support

Under Contract to Intel Corporation

THigl1 · ‎05-14-2019

For whatever reason, my original login will not allow me to get past the "Welcome to Support" page (no matter what I click, I end up back on that page despite being logged in). I've entered a new case for that, but am sharing to explain why this is a new account.

Here is the SSU run against one of the four host nodes. Please note that, as shared earlier, the current state is running on the older driver set (1.8.103.2) while the ultimate goal here is to make sure the latest drivers (23.5.2+) do not cause BSODs.

Mike_Intel · ‎05-15-2019

Hello THigl, Thank you for sending the SSU. Let me further check what I can get regarding the driver and NIC card information. I will get back to you as soon as I found something. If you have questions, please let us know. Best regards, Michael L. Intel Customer Support Under Contract to Intel Corporation

Mike_Intel · ‎05-20-2019

Hello THigl, Upon further checking, you have an existing service ticket and currently being assisted by our colleague. Please continue to coordinate with them for them to further help you regarding this issue. If you have questions, please let us know. Best regards, Michael L. Intel Customer Support Under Contract to Intel Corporation

pave · ‎06-20-2019

Hello. I have a similar problem with Intel(R) Ethernet Connection X722 for 10Gbe backplane on Lenovo ThinkSystem SN550. Did you find any solution for this problem ?

THigl · ‎06-20-2019

I have submitted several large groups of Crash Dumps, Event Logs, and SSUs to support. The last update was that they're working on the issue trying to reproduce and fix it. While I've followed up regularly, the communication has been okay at best.

In the meantime, Intel released OS Independent drivers v24 and NVM firmware v7 on their Download page. After applying driver v24, the BSOD episodes went down from about once every 2-4 hours to once every 9-10 days. I applied the NVMv7 firmware on two hosts just last evening (June 19th), so don't have any sense if the firmware will make much of a difference.

Meanwhile, I've kept two hosts back on the old drivers (which, unfortunately, don't support RDMA) included in Windows Server 2019. Not a single BSOD and its been over 45 days.

In short, the problem seems to have gotten much, much better in v24. However, seeing a BSOD every 9-10 days is still unacceptable. Hopefully, we'll see further progress out of Intel Support.

Please, if you're encountering this issue, submit a Support Ticket to Intel. I suggest collecting an SSU output (https://downloadcenter.intel.com/download/25293/Intel-System-Support-Utility-for-Windows-) and sending your Windows Crash Mini Dump.

pave · ‎06-20-2019

Ok. Thank you for responce.

Bad news for me :) I have a large number of new servers with this NIC and will plan to use them with RDMA.

I'll follow you suggest and will submit Support request to Intel.

ty again.

THigl · ‎06-20-2019

Absolutely! If you'd like to keep in touch about the issue and/or any progress that Intel makes with us, feel free to drop me an email (thigley [@] ette.biz).

THigl · ‎09-26-2019

Did you ever find a solution?

THigl · ‎06-27-2019

By way of an update, this past weekend, we upgraded all hosts to v24 and NVM7 firmware. However, once we placed significant traffic onto the cards, the BSODs returned. At this point, we've disabled the cards entirely pending a fix.

We've tried following-up with Support, but all communication from Intel ceased as of June 13, 2019.

THigl · ‎09-26-2019

We've provided further information back and forth with Intel for a few more months, no real luck yet in resolving this issue for us. We've reverted to using old, non-RDMA Intel NICs for Storage Spaces Direct in the meantime.

ISIT · ‎08-11-2020

I am having the same issue with a new Lenovo SR550 server, that has dual X722 interfaces on the mainboard.

The latest drivers make no difference to stability. Under stress testing (Copying large VHDx over LAN) in a repeating loop, the system will predictably crash in 1-2 hours.

One thing I tried that is different to the original poster is that my tests were done with No Adapter Teaming configurations enabled.

This means the fault lies with either the X722 design, supporting chipset, firmware or drivers. Either way, the issue is 100% Intel's problem and they need to get on this and produce a solution.

I have also found the solution to be to disable the onboard NIC's and install a PCIE Broadcom NIC.