Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
151 Views

I219-LM silent data corruption, Linux

Dear forum members,

The issue was first detected while transferring a large 3 TB file over scp. Since the ssh protocol has data integrity algorithms built-in, the transfer has failed repeatedly, often towards the end, complaining about "message authentication code incorrect". I tried different encryption schemes and the result was the same (error).

In order to troubleshoot the problem I performed the following steps:

1. Installed the newest driver from the Intel website (3.8.4-NAPI) on top of a stock Debian kernel 4.19.0-10-amd64. The issue persists.

2. Used the "socat" tool to eliminate potential issues with scp. I used raw TCP/IP transfer via socat and pipelined the output to get a sha256 fingerprint. The fingerprint of the transferred file was incorrect in most cases (I had one transfer that has completed with correct fingerprint, out of around six).

3. Installed a different Ethernet card based on a different chip, transfers were with no errors (tried 3 times).

4. Installed Windows 10 and run the same test using CygWin-based socat on Intel chip: no errors (repeated 5 times).

 

The problem happens on a Supermicro motherboard X11SCA-F, the chip comes with the motherboard. I run current stable version of Debian. I tried replacing the motherboard, but the replacement had the same issue.

There are no errors in the log files. The software does not detect any issues with the device and it apparently believes that all the data as passed up the TCP/IP stack is correct, while in fact some bits in this 4 TB-long stream are flipped. It seeps the error happens, on average, once per around 1 TB of data.

 

Any advice how to proceed would be greatly appreciated.

Thanks!

Tomasz

0 Kudos
17 Replies
Highlighted
Moderator
135 Views

Hi Tomasz,

Thank you for posting in our Intel® Ethernet Communities Page.

Kindly provide us the results of this command: ethtool -i ethx where ethx is the Ethernet port.


We look forward to hearing from you. If we do not get your reply, we will follow up after 3 business days.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Beginner
129 Views

Hello Alfred:

 

Here is the output:

driver: e1000e
version: 3.8.4-NAPI
firmware-version: 0.5-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

 

Best,

Tomasz

0 Kudos
Highlighted
Moderator
124 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this.

While we are checking, it would be helpful to our analysis if you can provide this information:

1. May we know the brand and model of ethernet cards that worked and the ethernet controller chip embedded in them?

2. Have you updated the firmware of the adapter or it is what is us from the very start?


We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Moderator
124 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this.

While we are checking, it would be helpful to our analysis if you can provide this information:

1. May we know the brand and model of ethernet cards that worked and the ethernet controller chip embedded in them?

2. Have you updated the firmware of the adapter or it is what is us from the very start?


We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Beginner
117 Views

Hello Alfred,

 

The external card is Rosewill RC-411v3 with the following chipset according to lspci:

Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

More details are available on Amazon we page (ASIN B004F34ONC).

I keep testing the Intel card on Linux, I had three successful transfers of the same 3.5 TB file after I ran Windows tests. I started believing that Windows has fixed something, but another transfer has failed (i.e. completed successfully but produced a wrong  fingerprint).

Thank you for your support,

Tomasz

0 Kudos
Highlighted
Beginner
114 Views

The three transfers that have completed successfully were performed in a similar way to tests on Windows: the received data was directly pipelined to sha256sum like this:

socat TCP:152.3.169.3:1111 STDOUT | pv -t -e -r -b -a | sha256sum

It might be that the test fails when I write the data to a disk at the time. The tests that usually fail go like this:

socat TCP:152.3.169.3:1111 STDOUT | pv -t -e -r -b -a | tee filename | sha256sum

 

So they are not related to hard drive reading errors, as the checksum is generated on the fly in parallel to saving the file, but it might be related to electrical interference. I have 10 hard drives that saturate the motherboard's 8 SATA ports + two SATA cards for extra 2 drives, all of this is configured as software  RAID6.

Best,

Tomasz

0 Kudos
Highlighted
Moderator
109 Views

Hi Tomasz,

Thank you for providing that information.

We will continue checking this and the information that provided is invaluable to us.

Please allow us some time to check on this. We hope for your understanding regarding this

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support 


0 Kudos
Highlighted
Moderator
86 Views

Hi Tomasz,

Thank you for waiting for our update.

While we were closely checking your concern, we find the need to get a more extensive look on your system configuration.

Please download and run our Intel® System Support Utility from this page, https://downloadcenter.intel.com/download/26735/Intel-System-Support-Utility-for-the-Linux-Operating.... After running it, you will be given an option to save the logs to a text file, please do so and attach the file on your reply.

We look forward to your reply. Should we not get your reply, we will follow up after three business days.


Best Regards,

Alfred S

Intel Customer Support


0 Kudos
Highlighted
Beginner
79 Views

Hello Alfred,

 

The output is attached. Please let me know if you need to know any specific detail, I can provide all the information you need.

 

Thank you,

Tomasz

0 Kudos
Highlighted
Moderator
73 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this. 

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support 


0 Kudos
Highlighted
Moderator
56 Views

Hi Tomasz,

Thank you for waiting for our update.

We are still investigating your concern. We may need to ask your cooperation if you can provide us any form of logs or your packet captures (if you have one available), showing the errors while you were transferring the files.

We will try to check the reason of the error from there.

We look forward to your reply. Should we not get your reply, we will follow up after three business days.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Beginner
50 Views

Hello Alfred,

 

That would require capturing terabytes of data. Even if I ran tcpdump on the interface and captured all the packets, I am not sure how could I identify the corrupted one. If, on the other hand, you wanted to do it yourselves, then we would have to transfer around 4 TB of data. Is this what you want?

 

Best,

Tomasz

0 Kudos
Highlighted
Moderator
43 Views

Hi Tomasz,

Thank you for providing that information.

It looks like getting the logs will be a difficult task. We will just proceed with our investigation with what we have at the moment.

We will continue checking this and the previous information that provided is invaluable to us.

Please allow us some time to check on this. We hope for your understanding regarding this

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Moderator
27 Views

Hi Tomasz,

Thank you for your patience in waiting for an update.

Due to the complexity of the issue, please allow us more time to check on your concern.

We will try to get back to you with an answer no later than 3 business days from now; although, we will reach out to you in case we need more information or if we already have developments regarding your concern.



Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Beginner
21 Views

Dear Alfred,

 

Thank you for the update. I keep testing the motherboard and I started test transfers via the external card again. As you probably remember, my past tests (around 4 or 5 of them) have completed successfully. But I just had one transfer via the external network card, which completed this morning, and it also produced a bad fingerprint. This was the 3rd test in this batch.

I am stumped. If the error happens with a different card then it could be either the board itself, the chipset, memory or CPU. But I do not even know where to start. Since this is the second motherboard from Supermicro, I consider it to be unlikely that I have a bad unit. Could it be memory or the CPU itself? If yes, what would be the best procedure to verity? Since the fingerprints are calculated on the fly, before the data is saved to the hard drive, It seems unlikely that hard drives may have any influence, other that they additionally require the CPU to perform extra work in the background.

 

Best,

Tomasz

0 Kudos
Highlighted
Moderator
12 Views

Hi Tomasz,

Thank you for your initiative in doing more tests and for sharing the information to us.

This is a surprising development since from the way your testing went, the issue now may lie on either the board, the CPU, the memory or the Operating system. Do you have access to spare components? In this way, you can isolate which is one is causing the issue.


We look forward to your reply. Should we not get your reply, we will follow up after three business days.


Best Regards,

Alfred S

Intel® Customer Support


0 Kudos
Highlighted
Beginner
8 Views

Hello Alfred,

 

Since this is my second board from Supermicro, it is unlikely it is a bad unit, other possibility might be a bad design and the error happens in the whole batch.

Memory has ECC and the system should be protected from errors. If this is the OS, doesn't Intel developers contribute to development of Linux kernel? Is possible that support for the chipset/CPU is not fully developed in Linux kernel? They are pretty recent hardware units.

Is possible that this chipset, if memory is buggy and has multi-bit ECC errors, operates in a way that the system is not notified about unrecoverable ECC errors?

I do not have access to spare parts, as I purchased parts for a single machine only. I would need to buy them, but the CPU is not very cheap.

 

Best,

Tomasz

0 Kudos