E3930 PCIe memory access issues

MZolo1 · ‎02-10-2018

Hi,

the tested E3930 SoC (microcode: sig=0x506c9, pf=0x1, revision=0x2c) has issues accessing the 32-bit non-prefetchable memory resource of Xilinx PCIe Endpoint (Vendor 10EE Device 0007). Within each block of 80-84 bytes only four starting dwords get read correctly, the next dword is the product of dword address, remaining data is zeroed. While the real memory contents are random non-zero data.

The problem appears only on E3930 (in a form of Qseven module), while the same PCIe endpoint works fine with E3845, E3815, x5-E8000. It's also reproducible with Linux (4.8, 4.14) as well as with EFI Shell, changing the PCI/PCIe configuration doesn't appear to help. No errors reported by endpoint/root port.

EFI Shell dump (zeroed data returned is incorrect):

Memory Address 0000000091200000 256 Bytes

91200000: D7 16 2A 88 05 24 64 21-64 80 01 03 00 4C 00 04 *..*..$d!d....L..*

91200010: 00 00 25 04 00 00 00 00-00 00 00 00 00 00 00 00 *..%.............*

91200020: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200030: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200040: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200050: 00 00 00 00 06 00 20 85-00 00 35 16 09 80 1E 21 *...... ...5....!*

91200060: A7 88 46 20 00 00 31 19-00 00 00 00 00 00 00 00 *..F ..1.........*

91200070: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200080: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200090: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912000A0: 00 00 00 00 00 00 00 00-E2 90 90 0A 00 00 21 2B *..............!+*

912000B0: 00 00 25 2C 00 28 AD 01-00 00 25 2E 00 00 00 00 *..%,.(....%.....*

912000C0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912000D0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912000E0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912000F0: 00 00 00 00 00 00 00 00-00 00 00 00 C3 B0 0C 71 *...............q*

91200100: 28 42 84 42 0A 20 00 87-08 81 49 00 20 55 54 81 *(B.B. ....I. UT.*

91200110: 00 00 25 44 00 00 00 00-00 00 00 00 00 00 00 00 *..%D............*

91200120: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200130: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200140: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200150: 00 00 00 00 20 8A 9C 81-00 00 35 56 08 21 C9 57 *.... .....5V.!.W*

91200160: 00 08 88 81 00 00 31 59-00 00 00 00 00 00 00 00 *......1Y........*

91200170: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200180: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200190: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912001A0: 00 00 00 00 00 00 00 00-A5 10 00 18 00 00 21 6B *..............!k*

912001B0: 00 00 25 6C 01 04 C4 40-00 00 25 6E 00 00 00 00 *..%l...@..%n....*

912001C0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912001D0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912001E0: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

912001F0: 00 00 00 00 00 00 00 00-00 00 00 00 2C 40 42 10 *............,@B.*

91200200: 00 12 84 0D 10 26 0C 01-79 C0 00 12 4A E0 8E 12 *.....&..y...J...*

91200210: 00 00 35 04 00 00 00 00-00 00 00 00 00 00 00 00 *..5.............*

91200220: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200230: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200240: 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 *................*

91200250: 00 00 00 00 81 21 *.....!*

CarlosAM_INTEL · ‎02-12-2018

Hello, mzolotarov:

Thank you for contacting Intel Embedded Community.

In order to be on the same page, could you please tell us if the affected design (Qseven module) has been designed by you or a third-party company? If it is a third-party design, could you please give me all the information related to it? In case that it has been designed by you, could you please confirm us if it has been reviewed by Intel?

Please give us the information that may answer the previous questions.

We really appreciate your collaboration.

Best regards,

Carlos_A.

MZolo1 · ‎02-12-2018

Hola, Carlos_A

we use QSeven modules available on the market from different suppliers, expecting maximum interchangeably. In this particular case, it was Q7-AL from MSC (its latest revision and BIOS).

CarlosAM_INTEL · ‎02-12-2018

Hello, mzolotarov:

Thanks for your update.

Based on your previous message, could you please give us all the information of these modules ( such manufacturer name, part number, model, and other details)?

Also, could you please tell us if the affected designs have implemented the guidelines stated in the https://cdrd.intel.com/v1/dl/getcontent/561514 Qseven Rev. 2.1 - Apollo Lake SoC 3.3V IO Compatibility Analysis document # 561514?

Waiting for your reply to these questions.

Best regards,

Carlos_A.

MZolo1 · ‎02-12-2018

Carlos_A,

here is the product URL:

https://www.msc-technologies.eu/products-solutions/products/boards/qseven/msc-q7-al.html MSC Q7-AL: MSC technologies an Avnet Company

We just use the E3930 version of this module.

The reason to think that the problem is Apollo Lake, not the Qseven module itself, is the following:

- the PCIe0..3 lanes of Apollo Lake are routed directly to Qseven connector (i.e. going outside of the module);

- no errors reported by Apollo Lake on PCIe level (see the attachment). Assuming any kind of failure on link, data or transaction layer, the error must have been reported. Since no error reported, we assume the PCIe connection between Apollo Lake and the endpoint (FPGA) to work fine;

- the Datasheet Addendum of Intel Atom Processor E3900 states the register-level compatibility to Intel Pentium and Celeron Processor N- and J- Series. Thus, nothing for PCIe drivers to change. However, the notable difference from previous Atom families (which are working fine) is the Time Coordinated Computing (TCC) technology introduction, which states to synchronize CPU with peripherals to achieve the maximum throughput.

In this connection, it would be very appreciated a hint on how to force the reverse compatibility of E3900 family. As a rough idea to try is disabling some of TCC features (but how?).

Thanks and waiting for your reply.

RWata1 · ‎02-16-2018

Can you provide a corresponding dump from a working CPU to compare with?

Lots of FPGAs take a while to power up and work properly (load their configuration).

One way to mitigate this is to delay booting until the FPGA is configured, by holding the module in reset using the Q7 PWGIN signal.

Another method would be to force the PCIe root port to retrain after the FPGA has loaded its configuration.

Hope that helps,

Ross

MZolo1 · ‎02-20-2018

rosswatanabe, thanks a lot for your response.

Indeed, the FPGA start-up is not fully smooth with PCIe hotplug on some modules. We work-around this by FPGA programming on early stage, followed by system hot-reset (causing the boot process to restart). The dump you've seen was made from EFI shell started after the mentioned system hot-reset. Just to make sure, I've repeated the whole testing with extra 10 seconds delay after the programming completion (and before the reboot) - the FPGA was running, according to its LED indications. Unfortunately, it didn't help.

Attaching the dump made with Congatec Qseven module conga-QA3 with E3845 that works just fine. https://www.congatec.com/en/products/qseven/conga-qa3.html Qseven Computer-On-Modules - congatec AG

The archive includes:

E3845/E3930-MemBAR-dump.txt - memory BAR dump (initial 256-bytes portion) for E3845 (conga-QA3) and E3930 (MSC Q7-AL) respectively;

E3845/E3930-FPGA-Endpoint.txt - PCI configuration space dumps for PCIe endpoint (FPGA);

E3845/E3930-Root-PCIe.txt - PCI configuration space dumps for corresponding PCIe Root Port.

I would really appreciate further ideas and comments.

Thanks again,

Mikhail

RWata1 · ‎02-20-2018

Since I happen to work for congatec, I might suggest you try this on a conga-QA5.

From looking at the dumps, nothing obvious stands out, however, you might try disabling relaxed ordering to see if that makes a difference.

Good luck,

Ross

MZolo1 · ‎03-04-2018

Ross, thanks for the hint. We have tried to play around PCI configuration, including the relaxed ordering - that didn't help.

Right, we're planning to follow the individual component changing strategy (Qseven module, base board, FPGA), hopefully getting some more information, not just working by coincidence.

rshal2 · ‎08-15-2018

Hi Mzolotarov,

We seem to have a quite similar issue with congatec MA5, but I don't see that your problem has been resolved

Is it resolved ?

Thanks,

Ran

CarlosAM_INTEL · ‎08-15-2018

Hello, ranchu:

Thank you for contacting Intel Embedded Community.

In order to receive the proper information to the affected device as a reference, please address your questions to the channels listed in the contact information of the https://www.congatec.com/us/support.html Support Congatec AG website.

We hope that this information is useful to you.

Best regards,

Carlos_A.

rshal2 · ‎08-19-2018

Hello Carlos,

We already contacted congatec, and they say it is appolo lake chip.

That's why I asked the openner of this ticket, if this issue is resolved.

Best Regards,

Ran

CarlosAM_INTEL · ‎08-20-2018

Hello, ranchu:

Thanks for your reply.

We suggested contacting the manufacturer of your device since we can provide only generic information that should be verified and/or confirmed by them.

In order to be on the same page, could you please let us know how many units are affected by this problem? could you please provide the part number and SKU of the affected processors? By the way, please give us pictures of the top side markings of the affected processors.

Also, please give us a detailed description and the steps that you should follow to detect this problem?

Waiting for the information that should answer these questions.

Best regards,

Carlos_A.

rshal2 · ‎08-21-2018

Hello Carlos,

I use the following:

congatec AG L132118 , intel atom E3930 , SR33Q

pn: 048012

Carrier conga-MEVAL;

congatec AG L132015

PN: 065400

Regards,

Ran

CarlosAM_INTEL · ‎08-22-2018

Hello, ranchu:

Thanks for your update.

We still want to know the detailed description and the steps to detect this problem, could you please provide this information?

Also, could you please let us know the Operating System (OS) and BIOS related to this situation?

Waiting for your reply with the information that should answer these questions.

Best regards,

Carlos_A.

JKOEN · ‎03-09-2018

Hi,

is the incorrect data stable or is it changing on subsequent reads? I'm asking because I recently had an issue with the CPUs variable MTRR setup

on Apollo Lake: in some configurations, the BIOS is configuring these MSRs incorrectly if there are PCIe device that request prefetchable memory.

Regards

JK