"Mystery" GbE Controller in 6th Gen PCH (SKU PCH-U Premium) Interrupt misbehavior

GregBailey · ‎10-08-2021

Hello. I am among many other things a system programmer who has supplied my company's customers with high reliability operating systems, programming environments, and application code for many decades, all of which we have written from scratch. I'm accustomed to programming the Intel GbE chips and have had no trouble with them ... until now.

I'm in a box with 6th gen i3-6100U, and 6th gen PCH with device ID 9D48 and SKU of PCH-U Premium. In addition the box has an i211AT GbE chip.

The PCH has what I consider a "Mystery" GbE controller because neither this PCH nor any others I've seen recently actually identifies the controller, and none has documentation on same. All we have in the case of most of these PCHes is the identity of the PHY they like to use with the Mystery controller. For this particular configuration, the PHY in question is i219LM. For reference, device ID for the Mystery controller is 156F, although most of the services for identifying that device refer to the i219LM as though it were a GbE controller and not simply a PHY.

My problem is that the Mystery controller is not behaving correctly with its interrupts. In this configuration we are using the plain old 8259 PICs for interrupts. IRQ11 is shared by all of the interrupting devices in the PCH. It is correctly set up for level triggering and it works perfectly with the AHCI SATA controller and with my code fot the i211.

I would like someone who understands things explain to me why what I am about to describe does not indicate a significant design error in the Mystery controller.

After a period of apparently correct operation, the Mystery Controller (MC for short) stops producing interrupts. I examine the situation and find the following:

The interrupt mask register D0 is all 1's
Register 4100 seems to have been counting interrupts but is now not changing.
My own instrumentation within my interrupt code is showing no interrupts now being received either.
If I read register C0 (which is set for clear on read) I will typically get a reasonable value like x800000C0 on the first read and zero on the second. If we read the manual on any Intel GbE controller of 8257x / i350 / i211 range, bit 31 indicates the IRQ line is being driven by this controller when 1, and not driven when 0. So reading register C0 should have deasserted IRQ11 and, shortly thereafter, more packets should arrive and it should be asserted again; and, because this PIC IRQ is level triggered, that assertion should continue resulting in interrupts until I clear C0 (and it stays cleared). And in fact that is what I see... except that IRQ11, by my instrumentation, was actually not asserted. Register 4100 does count a small number of what the chip evidently believed were interrupt assertions, but again on the CPU side of the PIC I see no interrupts. The PIC is working normally; either of the other two interrupting devices currently active on IRQ11 (the AHCI SATA HBA or the i211 GbE NIC) are functioning just fine.
If I arrange for an interrupt generated by another of these devices to reach my code for the MC, I dutifully process all the incoming packets and before I'm done will have put most of them back into receive descriptors. However, regardless of having done that, the MC will never again actually assert the IRQ11 line on the PIC.

It is my belief that there is some poorly designed circuitry between the desire of the MC to assert its interrupt, as evinced by register C0 bit 31, and its belief that it has done so as counted in register 4100, and the final act of driving the IRQ line. I find no additional mechanism in any of the documented GbE NICs that would cause such an effect, and I have dumped and compared 65536 bytes of register space from before and after the failure. I found no surprising differences.

I am about to red-tag this particular MC so my customers can have reliable systems, but would greatly appreciate communicating with someone who is familiar with the design to learn if Intel is aware of such a failure mode and, if so, whether they can shed more light on it.

PLEASE do not ask me if I have updated my drivers, O/S, or whatever. I am also a chip designer and what I describe above is not a situation my software is capable of creating, by anything I understand from reading a lot of pages of Intel documentation and by validating that understanding in many successful implementations ... until this one.

Thanks in advance for any help. I will be happy to provide direct contact info and will be in a position to make experiments on rapid turn-around if asked. Obviously I'd like to make this MC usable rather than red-tagging it!

Mike_Intel · ‎10-11-2021

Hello GregBailey,

Thank you for posting in Intel Ethernet Communities.

Base on your inquiry, we have specific forum for these issues and I will be transferring this thread for faster response.

Please wait for their reply within 1 to 2 business days.

Best regards,

Michael L.

Intel® Customer Support Technician

CarlosAM_INTEL · ‎10-12-2021

Hello, @GregBailey:

Thank you for contacting Intel Embedded Community.
Based on your previous communication, we want to address the following questions:
Could you please clarify if the design related to this situation has been designed by you or by a third-party company?
Could you please let us know in case it is a third-party design the name of the manufacturer, model, and where to find the information related to the affected design?
Could you please list the sources that you used to implement the affected design and if it has been verified by Intel?
We are waiting for your answer to these questions.
Best regards,

@CarlosAM_INTEL.

GregBailey · ‎10-12-2021

Thanks, Carlos.

If you're talking about hardware, the device containing your chips is an AVALUE EPC-SKLU-61A1-24R and I was booted and given control by an American Megatrends BIOS core version 5.11. I presume that BIOS did not leave any SMM code lying around to mess with the NICs but, of course, how would I know? PCB and box documentation attached.

As for sources, what sources do you refer to? The syndrome I observed above was observed by manual inspection and changes of NIC memory mapped register space, and manual interrogation of data recorded in memory by my interrupt code using my interactive development environment (derived from polyFORTH, of which I was one of the architects). I am of course guessing about the model embodied in the "Mystery" controller, but models such as the 8257x and i211/350 don't document any way to prevent the MC from asserting the assigned IRQ line on the PIC while register xC0:31 is "1". (PCI register interrupt disable line is always in the correct state and is never intentionally changed by my code during operation). The only place in the whole structure that knows or cares about edges is the counter in register x4100. I of course have no idea how to force this MC into the state I describe after it has run normally for a good while, but it sure is in that state eventually!

Greg Bailey, President, ATHENA Programming, Inc. (971) 235-2385

CarlosAM_INTEL · ‎10-12-2021

Hello, @GregBailey:

Thanks for your reply.

Based on the provided information, you should contact the manufacturer of the affected device because we do not have the information of their implementations to give you support.

You should contact them as a reference using the channels listed on the following website:

https://www.avalue.com.tw/distributors

Best regards,

@CarlosAM_INTEL.