We are having reliability issues with a couple Intel i211ATs connected to an Intel Atom processor on our motherboard. Essentially, we're seeing a small chance that the i211ATs will not be detected during boot (about one in a thousand power cycles).
The motherboard is running Linux with kernel version 4.8.3. The operating system was built using the project Yocto with the intel-meta layer. The network interfaces are managed by systemd-networkd. This board is based off the Minnowboard Turbot so the firmware is release 0.95 from https://firmware.intel.com/projects/minnowboard-max https://firmware.intel.com/projects/minnowboard-max.
One NIC is connected to PCIe port 1 and another is connected to PCIE port 2. The other two PCIe ports are unconnected on our design.
Both NICs appear to have an equal probability of not being detected on boot. I have a script that checks if both interfaces are up, logs the result, and reboots. I have let the system reboot thousands of times and see the failure around 0.1% of the time. Under the failure condition, typically the PCIe driver has not detected the device. Either PCI device 0000:03:00.0 or 0000:02:00.0 does not appear, depending on which interface failed, when checking with "ls /sys/bus/pci/devices/". I'm concerned that the i211 occasionally does no come out of reset.
We've had the design go through Intel's design review for the overall board. I'm fairly sure that the scope of the review was only concerned with direct connections to the SoC, which would just be the PCIe connections. Does Intel provide a separate review of the i211AT circuit/layout?
Could you please provide the following information?
1) Are the NICs manufactured by your company or by a third party company?
2) Do the NICs show the same behavior if they are tested on a different board?
3) Do the issue occurs if only one NIC is connected?
4) Have you compared the output of dmesg before and after the failure? If not please do it and make sure to enable the debug option of the i211 driver
5) Have you replicated the issue on a different Linux distribution? If not please try it, as is necessary to rule out an OS specific error.
Regarding your consultation about a layout and design review for i211 devices, I will consult with the people in charge of the review process.
I will be waiting for your feedback.
1) The motherboard is manufactured by our company and the i211ATs are directly soldered down to the board
2) We currently have around 50 boards manufactured. I have found this issue follows at least 3 boards that I've been able to test. The probabilities of failure are approximately equal.
3) Since both NICs are directly soldered down is not easy to test each i211 in isolation. I could turn off one of the PCIe ports in the firmware and run tests if that is valuable.
4) The output of dmesg for igb only differs in the presence of a block like so:
[ 3.233233] igb 0000:03:00.0: added PHC on eth1
[ 3.233235] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 3.233238] igb 0000:03:00.0: eth1: (PCIe:2.5Gb/s:Width x1) 70:b3:d5:44:60:75
[ 3.233242] igb 0000:03:00.0: eth1: PBA No: FFFFFF-0FF
[ 3.233244] igb 0000:03:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s)
When there is a failure, a block like this will be missing. I will reinsert the module with a higher debug level and report back.
5) I will attempt to reproduce with an debian based distro. This will take some time, but will report back after testing.
Thank you for contacting Intel Embedded Community.
In order to better understand this situation, we would like to address the following questions:
Could you please tell us the documents or resources that you have used to develop the affected i211-AT design?
Could you please give us top side markings pictures of the affected Intel(R) Ethernet Controllers I211-AT?
By the way, please follow the procedure stated at the https://edc.intel.com/Tools/Design-Review/Default.aspx?language=en Design Review Services web site to submit your design to be reviewed by Intel.
We hope that this information and waiting for your answer to the previous questions.
The overall board is based on a modified Minnowboard Max design where the where the i211 sections were based on the "I210-AT_I211-AT 1G-BASE-T REFERENCE DESIGN" found here: https://www.intel.com/content/www/us/en/embedded/products/networking/ethernet-controller-i210-i211-f... Intel® Ethernet Controller I210 and I211 Family: Technical Library
Here is a top side picture of one of the i211s on board:
In terms of the Design Review Service, we have had our board reviewed using this service back in February. That review did not cover the i211 in particular, just the PCIe lines between the Atom processor and the i211. I do not see an option for a more focused i211 review. I did see that the FAQ in section 2.7 mentions to "Submit your design review request through Intel Premier Support (IPS) against the I210/ I211 product", so that might be our next step here.
Hello, JCandy :
Thanks for your update.
We would like to validate the modifications that you have made, but the information to do it is handled only by the manufacturer of the design that you are using as a reference.
Due to this fact, we suggest as a reference send your design to be reviewed and validated by its developer at the channels listed to their https://minnowboard.org/community/mailing-list Mailing List.
We hope that this information may help you.
I've tried a couple methods for turning on the debug in the igb driver while in the failure condition and I'm not seeing any difference in the outputs (Question 4). The following output is where enp3s0 succeeded, but enp2s0 failed to be detected. Note that igb is set to the highest debug level in both the kernel command line and a modprobe.d file.
dmesg | grep igb
[ 0.000000] Command line: LABEL=Boot root=PARTUUID=be089a16-9bff-430c-8e90-68f0cebacbc7 rootwait rootfstype=ext4 console=tty0 quiet loglevel=2 systemd.show_status=0 i915.fastboot=1 igb.debug=16
[ 0.000000] Kernel command line: LABEL=Boot root=PARTUUID=be089a16-9bff-430c-8e90-68f0cebacbc7 rootwait rootfstype=ext4 console=tty0 quiet loglevel=2 systemd.show_status=0 i915.fastboot=1 igb.debug=16
[ 3.070448] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.3.0-k
[ 3.070451] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 3.108609] igb 0000:03:00.0: added PHC on eth0
[ 3.108612] igb 0000:03:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 3.108615] igb 0000:03:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 70:b3:d5:44:60:75
[ 3.108618] igb 0000:03:00.0: eth0: PBA No: FFFFFF-0FF
[ 3.108621] igb 0000:03:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s)
[ 3.140842] igb 0000:03:00.0 enp3s0: renamed from eth0
options igb debug=16
I've also tried disabling one of the PCIe interfaces in the firmware to see if that made a difference in reliability over reboots (Question 3). I have found that the remaining active interface is still occasionally not being detected.