Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
655 Discussions

CRAM SEU detection and mitigation support with OpenCL/oneAPI

LennartVH
New Contributor I
843 Views

Hi there,

I'm doing a large and long-running FPGA supercomputing project, with an average of 20 FPGAs running in parallel over the course of 5 months.

Our project is very sensitive to cosmic ray interference, so we want to take all possible steps to ensure our result is correct. We've already made sure to collect errors from our M20K blocks, and wish to ensure the same for the CRAM. Is CRAM error detection/scrubbing built into OpenCL and or oneAPI?

Kind regards,

Lennart

0 Kudos
9 Replies
LennartVH
New Contributor I
821 Views

So searching through the Intel OpenCL headers I came across this nice function which seems to suggest ECC CRAM SEU detection is built into the BSP. If that is the case then that would be amazing.

 

clSetDeviceExceptionCallbackIntelFPGA(
    cl_uint                   num_devices,
    const cl_device_id *      devices,
    CL_EXCEPTION_TYPE_INTEL   listen_mask,
    void (CL_CALLBACK *       pfn_exception_notify)(
        CL_EXCEPTION_TYPE_INTEL   exception_type,
        const void *              private_info,
        size_t                    cb,
        void *                    user_data),
    void *                    user_data);

I cannot find anything in the docs about this function, so perhaps you can help me.
A. Does this detect M20K eccs AND/OR CRAM ecc? I believe M20Ks in the OpenCL kernel are not covered as they are configured in Quad-Port mode. Mostly I'm interested in CRAM ecc detection.
B. Is the listen_mask inclusive or exclusive? Kind of important to know, and impossible to test. What should I pass to make sure I capture any and all detected CRAM ECC faults?

 Kind regards,
Lennart

0 Kudos
BoonBengT_Intel
Moderator
799 Views

Hi LennartVH,


Thank you for posting in Intel community forum and hope all is well.

Mind if I ask what are the device used from you end?

Also is the CRAM mention part of the built in of the device or separate connected?


Per my understanding the ECC are not built in on openCL/oneAPI, allow me to confirm on that.

And as for the code snippet mention in which header files did you found it?

Hope to hear from you soon.


Best Wishes

BB


0 Kudos
LennartVH
New Contributor I
776 Views

Hi again BoonBengT, Thank you for responding.

My apologies, I should have specified my hardware. I'm working on the Intel Stratix 10 GX 2800 on a Bittware N520 board. There are two types of errors I think are the most prevalent/important: (If I'm leaving out an important class please let me know)
- M20K errors, These are handled through built-in ECC in the M20K blocks. I handle these myself.
- Configuration RAM errors, aka errors in the FPGA fabric configuration. These are the ones I wish to detect. Intel documentation indicates that there should be dedicated hardware for automatic scanning. This hardware then drives the CRC_ERROR pin as I understand it.

I've looked in bittware's documentation, and confirmed that this BSP does handle the CRC_ERROR pin. The generated aoc quartus project also has the periodic integrity checking flag enabled.


I couldn't find it mentioned anywhere in the documentation though. What the documentation does mention is the -ecc flag (https://www.intel.com/content/www/us/en/docs/programmable/683846/21-4/compiling-your-kernel-with-memory-error.html) , which  adds M20K error detection in OpenCL code, but does not mention CRAM error detection.

I was able to find the above function in the intel header files at:
.../intelFPGA_pro/21.4.0/hld/host/include/CL/cl_ext_intelfpga.h

I think this is called when the CRC_ERROR pin is signaled. But I don't exactly know how this callback behaves. You can see that testing this function is not exactly trivial. (I don't think they'll let me into the data center with a gamma-ray cannon )

If you have any documentation or info on this, then please let me know. I can't afford even a single undetected error in this project.

Kind regards,
Lennart

0 Kudos
BoonBengT_Intel
Moderator
754 Views

Hi @LennartVH,


Noted with thanks on the explanation, and I think you got it covered on both M20K and CRAM.

As for the ECC flag in the mention document, per my understand that are only supported for M20K or eSRAM blocks.

Where by for CRAM it used an onchip EDC (error detection and correction) approached in term of hardware, more detail to implement that can be found in the link below:

- https://www.intel.com/content/www/us/en/docs/programmable/683602/21-3/mitigation-techniques-for-cram.html


As for the testing options, agree that gamma cannot are prohibited in the data center.

Would suggest perhaps we can test the CRC via sending error through the JTAG, I think there are some steps mention in the link here:

- https://www.intel.com/content/www/us/en/docs/programmable/683602/21-3/using-the-fault-injection-debugger.html


Note: unfortunately there are I did not see any document in related to enabling CRAM error detection in openCL, still confirming on that and trying to find the closes references, will keep you posted, thank you for the patients.


Hope that clarify.


Best Wishes

BB


0 Kudos
LennartVH
New Contributor I
744 Views

Yeah I do think M20K + CRAM together should cover most of them.

The one thing I'm looking for is documentation on this function:

clSetDeviceExceptionCallbackIntelFPGA(
    cl_uint                   num_devices,
    const cl_device_id *      devices,
    CL_EXCEPTION_TYPE_INTEL   listen_mask,
    void (CL_CALLBACK *       pfn_exception_notify)(
        CL_EXCEPTION_TYPE_INTEL   exception_type,
        const void *              private_info,
        size_t                    cb,
        void *                    user_data),
    void *                    user_data);

Specifically if the listen mask is inclusive or exclusive.
Which inputs allow me to receive ALL error signals?

I've not found any documentation on it. Could you perhaps get me in touch with one of the developers of the Intel OpenCL library?

0 Kudos
BoonBengT_Intel
Moderator
709 Views

Hi @LennartVH,

 

With some conversation going on with internal team, we understand that it is not possible to monitor internal hardware circuitry inside the FPGA with OpenCL code.

However as mention to monitor CRAM error then you can monito the CRC_ERROR pin on hardware. (more details can be found in the link below)

https://www.intel.com/content/www/us/en/docs/programmable/683869/current/cram-error-detection-settings-reference.html

 

Note: As a reference to the openCL code on managing memory, here are also some useful information.

 

Best Wishes

BB

 

0 Kudos
BoonBengT_Intel
Moderator
678 Views

Hi @LennartVH,


Good day, just checking in to see if there is any further doubts in regards to this matter.

Hope we have clarify your doubts.


Best Wishes

BB


0 Kudos
BoonBengT_Intel
Moderator
658 Views

Hi @LennartVH,


Greetings, as we do not receive any further clarification on what is provided, we would assume challenge are overcome. For new queries, please feel free to open a new thread and we will be right with you. Pleasure having you here.


Best Wishes

BB


0 Kudos
LennartVH
New Contributor I
645 Views

Thank you for replying @BoonBengT_Intel ,

I have just returned from vacation. This issue has not been resolved. The documentation you linked does not cover what I asked for.

I'm aware of the CRC_ERROR pin, as mentioned in the post above. I've confirmed the N520 board has hardware reading this pin, the only part left is intel's host code. I'm asking specifically about a function in intel's host code library for HLS development, noted below:

clSetDeviceExceptionCallbackIntelFPGA(
    cl_uint                   num_devices,
    const cl_device_id *      devices,
    CL_EXCEPTION_TYPE_INTEL   listen_mask,
    void (CL_CALLBACK *       pfn_exception_notify)(
        CL_EXCEPTION_TYPE_INTEL   exception_type,
        const void *              private_info,
        size_t                    cb,
        void *                    user_data),
    void *                    user_data);

This function can be found in the header file at the following path:
.../intelFPGA_pro/21.4.0/hld/host/include/CL/cl_ext_intelfpga.h

I am looking for documentation on this specific function. Searching on google or intels own search function yields no results. I have not been able to find it mentioned anywhere in the documentation either.

The question is: What arguments must I pass to this function to catch all CRC exceptions?

Kind regards,
Lennart

0 Kudos
Reply