Advice on CNN inference on Agilex 7 using oneAPI

Björne2 · ‎10-04-2024

Greetings everyone,

I'm tasked with porting an CNN trained with PyTorch to an Agilex 7 FPGA using HLS. I think the right tool for the job is oneAPI. Since this is not a completely novel task, I wonder if there are any existing implementations, libraries, or similar material I can reuse? Like, I prefer to not to have to implement everything from loading weights, to max pooling layers, to fixed points numerics from scratch. I'd be happy with any pointers to materials or tutorials you might have.

Thanks in advance.

whitepau_altera · ‎10-04-2024

Do you have a board in mind? Keep in mind that if you wish to use oneAPI FPGA Acceleration, you should choose an acceleration card with a supported BSP. We have a list of vendor cards on our homepage:

https://cdrdv2.intel.com/v1/dl/getContent/824530

Instead of building a full oneAPI BSP, you can also use the oneAPI DPC++/C++ compiler to create IP that you can integrate using a platform designer system. We demonstrate this in the Platform Designer code sample, and the Nios V reference design. You can learn more about IP interface customization by studying the HLS Flow Interfaces code samples as well.

Manually integrating your IP with Platform Designer (or SystemVerilog/VHDL if you are so inclined) gives you the ability to accelerate the embedded HPS, so you are not tied to an x86-64 host CPU.

Björne2 · ‎10-12-2024

> Do you have a board in mind? Keep in mind that if you wish to use
> oneAPI FPGA Acceleration, you should choose an acceleration card
> with a supported BSP. We have a list of vendor cards on our
> homepage:

Yeah, the board is a DE10 Agilex 7 from Terasic. The exact model is
AGF 7 014 B2E2_8GBx4.

> Instead of building a full oneAPI BSP, you can also use the oneAPI
> DPC++/C++ compiler to create IP that you can integrate using a
> platform designer system.

Well, I have a server license for Quartus Prime 21.2. Previously I
have used the aoc (Intel(R) FPGA SDK for OpenCL(TM) Kernel Compiler)
command to build FPGA bitstreams from OpenCL code so I think I already
have a suitable BSP installed. What I'm missing is how to "connect"
icpx (Intel(R) oneAPI DPC++/C++ Compiler) to the FPGA. It was easy
with OpenCL. I just compiled the kernel with aoc and then loaded it
onto the FPGA with OpenCL host code. It appears it is not that easy
with SYCL.

whitepau_altera · ‎10-14-2024

There is an additional step; with OpenCL (and indeed, earlier versions of oneAPI) we shipped some popular BSPs along with the tools, but since 2022 we stopped that to limit the installation size. You should be able to get a BSP from Terasic (indeed it should have been provided when you purchased it). Once you install the BSP, you can point the compiler to it when you compile your code. We explain how to do this in the code samples.

Managing an FPGA Board

FPGA Compile Code Sample

FYI: that BSP depends on an older version of Quartus, and unless Terasic updates the BSP, it will fall out of the support window.

Björne2 · ‎10-14-2024

Thanks for the advice. Does the BSP have to be specific to SYCL or is
a BSP for OpenCL sufficient? Anyway, I install OneAPI as described
here:

https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&linux-install-type=offline

Then I source the ~/intel/oneapi/setvars.sh script to run SYCL tools
and I create the Vector Add project. Compiling it for cpu-gpu works
fine, but not when I compile it for FPGA targets:

icpx -fsycl -fintelfpga -Xshardware -Xstarget=Agilex7 -v -Wsycl-strict vector-add-buffers.cpp

I get the following error message:

llvm-foreach --out-ext=aocx --in-file-list=/tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt --in-replace=/tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt --out-file-list=/tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx --out-replace=/tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx --out-increment=a.prj -- /vol/opt/intelFPGA_pro/21.2/hld/bin/aoc -o /tmp/icpx-003d458ab4/vector-add-buffers-1dc7d3.aocx /tmp/icpx-003d458ab4/vector-add-buffers-bc1783.txt -sycl -dep-files=/tmp/icpx-003d458ab4/vector-add-buffers-f000eb.d -output-report-folder=a.prj -g -hardware -target=Agilex7
AOCL_TMP_DIR directory was specified at /home/bjourne/.cache/aocl.
Ensure Linux and Windows compiles do not share the same directory as files may be incompatible.
InvalidModule: Invalid SPIR-V module: unsupported SPIR-V version number 'unknown (66560)'. Range of supported/known SPIR-V versions is 1.0 (65536) - 1.3 (66304)
Error: SPIRV to LLVM IR FAILED

I'm using the Intel(R) FPGA SDK for OpenCL(TM), 64-Bit Offline
Compiler version 21.2.0. Perhaps there is some mismatch between what
the OneAPI tools expects and what the offline compiler is capable of?

whitepau_altera · ‎10-15-2024

It looks like you have some mismatched tools installed. You don't need the FPGA SDK for OpenCL; here is what you need (from our homepage

oneAPI DPC++ base toolkit 2024.2
FPGA support Package 2024.2 (the FPGA compiler is no longer distributed with the base toolkit since 2024.2)
Quartus Prime 21.2 (with Agilex 7 Device support)
BSP for DE10-Agilex card

For best results, use one of the FPGA code samples, as those are the ones we regression test:

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/GettingStarted/fpga_compile

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/GettingStarted/fpga_template

Björne2 · ‎10-15-2024

Thanks for your advice. With the FPGA support package I can now
compile SYCL code for FPGA targets. It seems aoc (from the FPGA
support package) uses the correct BSP:

$ aoc -list-boards
Board list:
  B1E1_8GBx4
     Board Package: /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex

  B2E2_8GBx4 (default)
     Board Package: /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex

  ...

I create a pre-synthesis report from the FGPA vector_add example like this:

$ icpx -v -fsycl -fintelfpga -Xshardware -Xstarget=Agilex7  -fsycl-link=early vector_add.cpp -o reportz

When I open the report it says "Report has invalid data. Ok to
proceed?" If I do so I get a report that is not quite right. In
particular, the global memory bandwidth estimates are wrong (see
screenshot). Is there something else I need to do? Like add something
to the icpx command to get it to use the right BSP?

whitepau_altera · ‎10-15-2024

The "Report has invalid data. Ok to proceed?" error is unrelated to the bandwidth notes you are seeing; this is a known defect in the reports that you may ignore.

The incorrect Global Memory Bandwidth estimates might be an issue with your BSP; does the board_spec.xml that you got from Terasic have any value specified for Global Memory Bandwidth?

Björne2 · ‎10-15-2024

I think so. /vol/opt/intelFPGA_pro/21.2/hld/board/de10_agilex/hardware/B2E2_8GBx4/board_spec.xml contains:

<!-- DDR4-2666 -->
<global_mem name="DDR" max_bandwidth="85312" interleaved_bytes="1024" config_addr="0x018">
    <interface name="board" port="kernel_mem0" type="slave" width="512" maxburst="16" address="0x00000000" size="0x200000000" latency="240" waitrequest_allowance="6"/>
    <interface name="board" port="kernel_mem1" type="slave" width="512" maxburst="16" address="0x200000000" size="0x200000000" latency="240" waitrequest_allowance="6"/>
    <interface name="board" port="kernel_mem2" type="slave" width="512" maxburst="16" address="0x400000000" size="0x200000000" latency="240" waitrequest_allowance="6"/>
    <interface name="board" port="kernel_mem3" type="slave" width="512" maxburst="16" address="0x600000000" size="0x200000000" latency="240" waitrequest_allowance="6"/>
</global_mem>

Moreover, if I compile an equivalent OpenCL kernel (aoc -bsp-flow=flat -rtl vector_add.cl) global memory bandwidth is estimated correctly (see screenshot).

whitepau_altera · ‎10-15-2024

Thanks Bjorne; I'll forward this to engineering, you may have found a report bug.

Björne2 · ‎10-16-2024

Hello again. I might have solved my problem. Apparently the -Xstarget option should name a specific board and not an FPGA family. So with -Xstarget=B2E2_8GBx4 the generated report looks much better. When synthesis is done in a few hours I'll check if I can run the generated bitstream on the FPGA.

Feng-Y-28 · ‎11-10-2024

Hi Bjorne

May I get your environment set for DE10_Agilex?

My issue was stuck for half a month without process.

Could you give me the following enviro information, which can run the generated bitstream on the FPGA:

oneAPI DPC++ base toolkit version
Quartus Prime version

I tried the 24.2 and 24.1 base toolkits, but all failed.

Regards

Feng

Feng-Y-28 · ‎10-30-2024

Hi whitepau

I also use the DE10-Agilex card; the generated binary files are invalid when I upgrade the OneAPI to 24.2. I found the file: Intel® oneAPI DPC++/C++ Compiler System Requirements, which said the Quartus support version from 22.3 -24.2, not 22.1.

I want to know whether the system with such a Configuration can work on the DE10-Agilex card :

oneAPI DPC++ base toolkit 2024.2
FPGA support Package 2024.2 (the FPGA compiler is no longer distributed with the base toolkit since 2024.2)
Quartus Prime 21.2 (with Agilex 7 Device support)
BSP for DE10-Agilex card

Regards

Feng

whitepau_altera · ‎10-31-2024

I replied to your other thread.

https://www.intel.com/content/www/us/en/developer/articles/release-notes/intel-oneapi-compiler-fpga-add-on-release-notes-2024.html#release-notes-2024-2-release-notes

According to the release notes we deprecated Quartus® Prime 21.4 and earlier in 2024.2 (but only removed in 2025.0) so the DE10 Agilex™ should still work with that release.

haoyanwa · ‎10-04-2024

Thank you for reaching out! I recommend checking out HLS4ML: HLS4ML GitHub Repository.

It is a framework designed to convert machine learning models from popular libraries like PyTorch and Keras into FPGA binaries. It integrates seamlessly with oneAPI by utilizing the DPC++/C++ compiler in the backend to generate IP blocks that represent the different components of your model, such as layers, activation functions, and more.

While HLS4ML is still a work in progress, it offers a good start point for which you can save massive time from implementing everything from scratch, including handling weights, pooling layers, and fixed-point arithmetic.

For a step-by-step guide on how to get started, you can explore their tutorials here: HLS4ML Tutorials. These Jupyter Notebooks walk you through the process from building and training a model to emulating it on an FPGA.

Let us know if you have any further questions.

BoonBengT_Altera · ‎10-22-2024

Hi @Björne2,

Good to know that it is working now, as we see no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support.

Thank you for the questions and as always pleasure having you here.

Best Wishes

BB

Advice on CNN inference on Agilex 7 using oneAPI

oneAPI (SW Development| BSP | IP Integration)