Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Tomasz
Beginner
921 Views

I219-LM silent data corruption, Linux

Dear forum members,

The issue was first detected while transferring a large 3 TB file over scp. Since the ssh protocol has data integrity algorithms built-in, the transfer has failed repeatedly, often towards the end, complaining about "message authentication code incorrect". I tried different encryption schemes and the result was the same (error).

In order to troubleshoot the problem I performed the following steps:

1. Installed the newest driver from the Intel website (3.8.4-NAPI) on top of a stock Debian kernel 4.19.0-10-amd64. The issue persists.

2. Used the "socat" tool to eliminate potential issues with scp. I used raw TCP/IP transfer via socat and pipelined the output to get a sha256 fingerprint. The fingerprint of the transferred file was incorrect in most cases (I had one transfer that has completed with correct fingerprint, out of around six).

3. Installed a different Ethernet card based on a different chip, transfers were with no errors (tried 3 times).

4. Installed Windows 10 and run the same test using CygWin-based socat on Intel chip: no errors (repeated 5 times).

 

The problem happens on a Supermicro motherboard X11SCA-F, the chip comes with the motherboard. I run current stable version of Debian. I tried replacing the motherboard, but the replacement had the same issue.

There are no errors in the log files. The software does not detect any issues with the device and it apparently believes that all the data as passed up the TCP/IP stack is correct, while in fact some bits in this 4 TB-long stream are flipped. It seeps the error happens, on average, once per around 1 TB of data.

 

Any advice how to proceed would be greatly appreciated.

Thanks!

Tomasz

0 Kudos
24 Replies
AlfredoS_Intel
Moderator
748 Views

Hi Tomasz,

Thank you for posting in our Intel® Ethernet Communities Page.

Kindly provide us the results of this command: ethtool -i ethx where ethx is the Ethernet port.


We look forward to hearing from you. If we do not get your reply, we will follow up after 3 business days.



Best Regards,

Alfred S

Intel® Customer Support


Tomasz
Beginner
742 Views

Hello Alfred:

 

Here is the output:

driver: e1000e
version: 3.8.4-NAPI
firmware-version: 0.5-4
expansion-rom-version:
bus-info: 0000:00:1f.6
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

 

Best,

Tomasz

AlfredoS_Intel
Moderator
737 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this.

While we are checking, it would be helpful to our analysis if you can provide this information:

1. May we know the brand and model of ethernet cards that worked and the ethernet controller chip embedded in them?

2. Have you updated the firmware of the adapter or it is what is us from the very start?


We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


AlfredoS_Intel
Moderator
737 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this.

While we are checking, it would be helpful to our analysis if you can provide this information:

1. May we know the brand and model of ethernet cards that worked and the ethernet controller chip embedded in them?

2. Have you updated the firmware of the adapter or it is what is us from the very start?


We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


Tomasz
Beginner
730 Views

Hello Alfred,

 

The external card is Rosewill RC-411v3 with the following chipset according to lspci:

Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

More details are available on Amazon we page (ASIN B004F34ONC).

I keep testing the Intel card on Linux, I had three successful transfers of the same 3.5 TB file after I ran Windows tests. I started believing that Windows has fixed something, but another transfer has failed (i.e. completed successfully but produced a wrong  fingerprint).

Thank you for your support,

Tomasz

Tomasz
Beginner
727 Views

The three transfers that have completed successfully were performed in a similar way to tests on Windows: the received data was directly pipelined to sha256sum like this:

socat TCP:152.3.169.3:1111 STDOUT | pv -t -e -r -b -a | sha256sum

It might be that the test fails when I write the data to a disk at the time. The tests that usually fail go like this:

socat TCP:152.3.169.3:1111 STDOUT | pv -t -e -r -b -a | tee filename | sha256sum

 

So they are not related to hard drive reading errors, as the checksum is generated on the fly in parallel to saving the file, but it might be related to electrical interference. I have 10 hard drives that saturate the motherboard's 8 SATA ports + two SATA cards for extra 2 drives, all of this is configured as software  RAID6.

Best,

Tomasz

AlfredoS_Intel
Moderator
722 Views

Hi Tomasz,

Thank you for providing that information.

We will continue checking this and the information that provided is invaluable to us.

Please allow us some time to check on this. We hope for your understanding regarding this

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support 


AlfredoS_Intel
Moderator
699 Views

Hi Tomasz,

Thank you for waiting for our update.

While we were closely checking your concern, we find the need to get a more extensive look on your system configuration.

Please download and run our Intel® System Support Utility from this page, https://downloadcenter.intel.com/download/26735/Intel-System-Support-Utility-for-the-Linux-Operating.... After running it, you will be given an option to save the logs to a text file, please do so and attach the file on your reply.

We look forward to your reply. Should we not get your reply, we will follow up after three business days.


Best Regards,

Alfred S

Intel Customer Support


Tomasz
Beginner
692 Views

Hello Alfred,

 

The output is attached. Please let me know if you need to know any specific detail, I can provide all the information you need.

 

Thank you,

Tomasz

AlfredoS_Intel
Moderator
686 Views

Hi Tomasz,

Thank you providing those information.

Please allow us some time to check on this. 

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support 


AlfredoS_Intel
Moderator
669 Views

Hi Tomasz,

Thank you for waiting for our update.

We are still investigating your concern. We may need to ask your cooperation if you can provide us any form of logs or your packet captures (if you have one available), showing the errors while you were transferring the files.

We will try to check the reason of the error from there.

We look forward to your reply. Should we not get your reply, we will follow up after three business days.



Best Regards,

Alfred S

Intel® Customer Support


Tomasz
Beginner
663 Views

Hello Alfred,

 

That would require capturing terabytes of data. Even if I ran tcpdump on the interface and captured all the packets, I am not sure how could I identify the corrupted one. If, on the other hand, you wanted to do it yourselves, then we would have to transfer around 4 TB of data. Is this what you want?

 

Best,

Tomasz

AlfredoS_Intel
Moderator
656 Views

Hi Tomasz,

Thank you for providing that information.

It looks like getting the logs will be a difficult task. We will just proceed with our investigation with what we have at the moment.

We will continue checking this and the previous information that provided is invaluable to us.

Please allow us some time to check on this. We hope for your understanding regarding this

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support


AlfredoS_Intel
Moderator
640 Views

Hi Tomasz,

Thank you for your patience in waiting for an update.

Due to the complexity of the issue, please allow us more time to check on your concern.

We will try to get back to you with an answer no later than 3 business days from now; although, we will reach out to you in case we need more information or if we already have developments regarding your concern.



Best Regards,

Alfred S

Intel® Customer Support


Tomasz
Beginner
634 Views

Dear Alfred,

 

Thank you for the update. I keep testing the motherboard and I started test transfers via the external card again. As you probably remember, my past tests (around 4 or 5 of them) have completed successfully. But I just had one transfer via the external network card, which completed this morning, and it also produced a bad fingerprint. This was the 3rd test in this batch.

I am stumped. If the error happens with a different card then it could be either the board itself, the chipset, memory or CPU. But I do not even know where to start. Since this is the second motherboard from Supermicro, I consider it to be unlikely that I have a bad unit. Could it be memory or the CPU itself? If yes, what would be the best procedure to verity? Since the fingerprints are calculated on the fly, before the data is saved to the hard drive, It seems unlikely that hard drives may have any influence, other that they additionally require the CPU to perform extra work in the background.

 

Best,

Tomasz

AlfredoS_Intel
Moderator
625 Views

Hi Tomasz,

Thank you for your initiative in doing more tests and for sharing the information to us.

This is a surprising development since from the way your testing went, the issue now may lie on either the board, the CPU, the memory or the Operating system. Do you have access to spare components? In this way, you can isolate which is one is causing the issue.


We look forward to your reply. Should we not get your reply, we will follow up after three business days.


Best Regards,

Alfred S

Intel® Customer Support


Tomasz
Beginner
621 Views

Hello Alfred,

 

Since this is my second board from Supermicro, it is unlikely it is a bad unit, other possibility might be a bad design and the error happens in the whole batch.

Memory has ECC and the system should be protected from errors. If this is the OS, doesn't Intel developers contribute to development of Linux kernel? Is possible that support for the chipset/CPU is not fully developed in Linux kernel? They are pretty recent hardware units.

Is possible that this chipset, if memory is buggy and has multi-bit ECC errors, operates in a way that the system is not notified about unrecoverable ECC errors?

I do not have access to spare parts, as I purchased parts for a single machine only. I would need to buy them, but the CPU is not very cheap.

 

Best,

Tomasz

AlfredoS_Intel
Moderator
609 Views

Hi Tomasz,

Thank you for providing that information.

We will continue checking this and the information that provided is invaluable to us.

Please allow us some time to check on this. We hope for your understanding regarding this

We will get back to you no later than 3 business days from now.



Best Regards,

Alfred S

Intel® Customer Support 


AlfredoS_Intel
Moderator
560 Views

Hi Tomasz,

Thank you for waiting for our update.

Here are the results of our investigation:

The issue that you are experiencing is not ethernet related and might be CPU related.

This is more of a system question since the issue can be reproduced with other 3rd part network adapter.

With SCP you could consider throttling the transfer rate to avoid filling the page cache before the disk has been able to catch up, see for example https://stackoverflow.com/questions/30020519/broken-pipe-error-on-scp.

You could also try making a thread in sourceforge and make their inquiry there.

We look forward to your reply. Should we not get your reply, we will follow up after three business days.



Best Regards,

Alfred S

Intel® Customer Support


AlfredoS_Intel
Moderator
358 Views

Hi Tomasz,

We are just following up.

It looks like you need more time to assess the answers that we have provided.

We will follow up again after 3 business days. Should we not hear from you, our system may automatically close the thread.



Best Regards,

Alfred S

Intel Customer Support 


Reply