Re: Can you provide more details on the SEU testing results performed by Intel ? (https://www.intel.com/content/www/us/en/programmable/support/quality...

ARB · ‎07-03-2019

The SEU reference information (https://www.intel.com/content/www/us/en/programmable/support/quality-and-reliability/seu.html) declares following assumptions.

SEUs do not induce latch-up in Intel FPGAs

No SEU errors have been observed in hard CRC circuit and I/O registers

An SEU causes only single-bit errors within the configuration memory for products up to 65 nm and possibly multibit errors for 40 nm and beyond

The CRC circuit can detect all single-bit and multi-bit errors within the configuration memory

There's a Mean Time Between Functional Interrupt (MTBFI) of hundreds of years, even for very large, high-density FPGAs

My interest is limited to low density FPGA's (Cyclone V, Cyclone IV).

- Can you provide details on the test procedures you have performed (JESD-89) ?

- Can you provide numerical data - or some sort of report of your testing ?

- Can you elaborate on the statement to the error detection capability of the CRC circuit ?

I am aware that any suitably chosen CRC guarantees to detect any single bit error, but I fail to understand the claim on "detect all single and multi-bit errors".

Are you referencing to all multi bit errors you observed during testing, or to all possible multi bit errors ?

JohnT_Intel · ‎07-04-2019

Hi,

- Can you provide details on the test procedures you have performed (JESD-89) ?

The method we use to performed the testing is by bringing our board with the FPGA into the JESD-89 compliance lab and have the external factor such alpha particle to attack the FPGA device and observed the SEU.

- Can you provide numerical data - or some sort of report of your testing ?

Please refer to https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/rr/rr.pdf?gsa_pos=1&wt.oss_r=1&wt.oss=altera&_ga=2.6507250.1501909099.1553290013-1866113875.1553290013 for all the FPGA device reliability result.

- Can you elaborate on the statement to the error detection capability of the CRC circuit ?

For each Quartus design compilation bitstream will contain pre-computed CRC signature which will be compare to the readback stream of the CRAM bit which we will used to detect if there is any error and calculate how many bit error does occurs.

Are you referencing to all multi bit errors you observed during testing, or to all possible multi bit errors ?

The device will only flag if it is a single bit error or multiple bit error.

ARB · ‎07-04-2019

Hi,

- Can you provide details on the test procedures you have performed (JESD-89) ?

This question is sufficiently answered.

- Can you provide numerical data - or some sort of report of your testing ?

The reliability report you are referring to does not contain the data I was looking for.

The reliability report contains the data for life time testing under temperature and humidity cycling (mainly JESD22, as declared in Table II and Table II of the mentioned document).

It however does not contain results of the JESD-89 testing.

I am looking for a report explicitly dealing with the SEU testing mentioned on the website in my initial post.

- Can you elaborate on the statement to the error detection capability of the CRC circuit ?

I am not sure I fully understand your arguments.

I am quite aware how the CRC function within the FPGA works, and how it does read back the bitfile, constantly checking the CRC within the bitfile (desired value), with the one computed (actual value).

In case of a mismatch between those two an error is raised.

As an "end user" I do not mind whether the circuitry actually counts the number of errors or not. For any number of errors the mitigation is identical. The bitfile has to be (re-)programmed into the device.

A CRC however has (as any other error detection algorithm) a limited detection capability.

The total size of the configuration RAM within the device has a large amount of bits. An EP4CE40 device has an uncompressed raw binary size of 9'534'304 bits.

Therefore the (theoretical worst case) would be to have 9'534'304 errors (meaning each single bit has flipped)

Mathematically a CRC is only able to detect certain types of errors. It can for example detect any single bit error. That is regardless of which of the 9’534’304 bits flip, the CRC will detect it.

The situation however changes when multiple bits are flipped. In this case the error detection will only be able to detect the change with a certain probability.

The exact performance depends not only on the CRC polynomial but also on the length of the bit file stream and the location of the flipped bits (you may refer to https://users.ece.cmu.edu/~koopman/crc/ for some details).

If a single event can cause multiple bits to flip you need to apply some additional constrains to be able to claim 100% detection ratio.

Such an additional constraint might be that the flipped bits are located close to each other.

This is the kind of information I was asking for.

JohnT_Intel · ‎07-04-2019

Hi,

- Can you provide numerical data - or some sort of report of your testing ?

Unfortunately we do not have the number or data for it. The reason is that not all CRC error is critical. Let say if the SEU is happening on the location that does not have any function then you are safe to continue what you are performing as it is not impacting the device. What we are performing is only to make sure that if there is any real SEU is happening then will the CRC error been triggered. So the testing that has been performed will not be showing the real result as it will depend on when CRAM bit is being attack.

- Can you elaborate on the statement to the error detection capability of the CRC circuit ?

I understand your concern. But if all the bit is changed then the CRC signature will also be mismatch and it will be flag as multiple bit error. So most importantly when we performed SEU error and you are observing functionality failure then we recommend that you need to recover from the SEU as it has already impacted the CRAM bit of your design on the Cyclone IV/V series device. For the higher end device which has the capability to isolate the exact location and determine if this is critical bit or not then you will need to based on this information rather than only CRC_Error signal. The reason is that if the issue is happening on non-critical bit area then I would recommend you to continue the functionality of the device as it will not impact anything.

The SEU is able to detect it but not able to fully detect which bits is located when it is occurring on multiple bit error for Cyclone IV/V series.

If you look into AN866 (https://www.intel.com/content/www/us/en/programmable/documentation/drj1530911544883.html) in chapter - SEU FIT Parameters Report, we are providing you the report of the the SEU occurring based on your design rather than taking account of the full FPGA. The reason is that we do not want to trigger a false alarm when the SEU is happening on the unused logic.

ARB · ‎07-08-2019

Hi,

Please initiate a Private Messaging, as the issue now approaches device and design specific relevance.

JohnT_Intel · ‎07-08-2019

Private Message has been send.

Can you provide more details on the SEU testing results performed by Intel ? (https://www.intel.com/content/www/us/en/programmable/support/quality-and-reliability/seu.html)