oneAPI on Cyclone10gx

StefanoC · ‎12-26-2023

Hi all,

the official Intel fpga requirement page says the Cyclone10gx fpga is supported by oneAPI so I downloaded the latest version on my Ubuntu20 (Quartus Prime also installed), I tried to compile a sample-adder, the compiler (targeting the fpga) works but then when I run simple-add-buffer.fpga I get:

tetto@ubuntuoffice:~/simple-add/build$ ./simple-add-buffers.fpga
An exception is caught while computing on device.
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Aborted (core dumped)

yuguen · ‎02-08-2024

Hey Stefano,

When you compile using an FPGA family (such as "Cyclone10GX"), the compiler enters an HLS flow: it only generates an IP that you need to manually integrate into your own RTL pipeline.

The fpga binary that you obtained is not executable: it is only produced for you to inspect the performance of the IP after quartus compiled it (fmax, resource usage, etc.)

Here is a code sample demonstrating how one can integrate an HLS IP into an RTL pipeline to be able to run such IP on an FPGA: https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer

You were expecting the compiler to produce a binary that could be directly executed on the FPGA: to do so, the compiler needs to understand the interface between the IP and your FPGA. This is what we call the "BSP".

Some FPGA board vendors do provide BSPs with their FPGA boards, which would have allowed you to compile your program using "icpx ... -Xstarget=<path to your BSP> ..." rather than "-Xstarget=Cyclone10GX".

In that case, and in that case only, the FPGA binary produced could have been run on the FPGA natively.

You can have a look at this documentation page to better understand the difference between the "FPGA acceleration flow" and the "HLS flow": https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-0/intel-oneapi-fpga-development-flow.html

StefanoC · ‎02-18-2024

@yuguen that really clarified everything, thanks!

In the document you pointed:

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer

something is unclear to me, you create a project in Quartus but it seems you do not create a Top-level entity, since when I do "Start Analysis and Elaboration" I get an error about that, I wonder if a top-level entity should be created anyway and what should look like. In one screenshot from the previous github I see that the top level entity is named "add" when you create the project, but then nothing is mentioned about it (what should just contain?)

yuguen · ‎02-19-2024

Hey @StefanoC

This sample demonstrates how to integrate a generated IP into an existing RTL pipeline.

In this case, the existing RTL files are located in https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer/add-quartus-sln

The "add" top level entity is found in the add.sv file.

"Step 2." tells you to copy this "add.sv" file into your Quartus project folder:

cp add-quartus-sln/add.sv add-quartus

Then, to add it to your Quartus project (step 2.v):

Following these steps (with all the other steps), Quartus should be able to understand that the "add" top level module is in this file.

StefanoC · ‎02-19-2024

Thanks @yuguen for clarifying the content of add.sv

However I had followed the instructions and after I issue "make report" the following files are generated (there is no add.sv but rather add_report_di.sv which internally has a different module name and subsequently Quartus complains about missing top node. Why do I get add_report_di.sv rather than add.sv?

-rw-rw-r-- 1 tetto tetto   527 Feb 19 17:08 sys_description.txt
-rw-rw-r-- 1 tetto tetto  3831 Feb 19 17:08 sys_description.legend.txt
-rw-rw-r-- 1 tetto tetto  1860 Feb 19 17:08 sys_description.hex
-rw-rw-r-- 1 tetto tetto   105 Feb 19 17:08 opencl.ipx
-rw-rw-r-- 1 tetto tetto  4758 Feb 19 17:08 kernel_system.v
-rw-rw-r-- 1 tetto tetto  3388 Feb 19 17:08 kernel_system.tcl
-rw-rw-r-- 1 tetto tetto  7882 Feb 19 17:08 kernel_system.qip
-rw-rw-r-- 1 tetto tetto    31 Feb 19 17:08 kernel_system_import.tcl
-rw-rw-r-- 1 tetto tetto    70 Feb 19 17:08 kernel_report.tcl
-rw-rw-r-- 1 tetto tetto  1197 Feb 19 17:08 ipinterfaces.xml
-rw-rw-r-- 1 tetto tetto    72 Feb 19 17:08 ip_include.tcl
-rw-rw-r-- 1 tetto tetto    28 Feb 19 17:08 compiler_metrics.out
-rw-rw-r-- 1 tetto tetto  1197 Feb 19 17:08 board_spec.xml
-rw-rw-r-- 1 tetto tetto     0 Feb 19 17:08 add_report.v
-rw-rw-r-- 1 tetto tetto  3759 Feb 19 17:08 add_report_sys.v
-rw-rw-r-- 1 tetto tetto 18924 Feb 19 17:08 add_report_sys_hw.tcl
-rw-rw-r-- 1 tetto tetto   231 Feb 19 17:08 add_report.log
-rw-rw-r-- 1 tetto tetto 19180 Feb 19 17:08 add_report_di.sv
-rw-rw-r-- 1 tetto tetto  1307 Feb 19 17:08 add_report_di_inst.v
-rw-rw-r-- 1 tetto tetto 17892 Feb 19 17:08 add_report_di_hw.tcl
-rw-rw-r-- 1 tetto tetto  8393 Feb 19 17:08 add_report.bc.xml
drwxrwxr-x 2 tetto tetto  4096 Feb 19 17:08 ip
drwxrwxr-x 3 tetto tetto  4096 Feb 19 17:08 reports
drwxrwxr-x 3 tetto tetto  4096 Feb 19 17:08 kernel_hdl
drwxrwxr-x 3 tetto tetto  4096 Feb 19 17:08 include
drwxrwxr-x 3 tetto tetto  4096 Feb 19 17:08 linux64

Also in the instruction it is sad to copy these:

cp add-quartus-sln/add.sv add-quartus
$> cp add-quartus-sln/jtag.sdc add-quartus

but after "make report" the folder add-quartus-sln is not there, so also miss jtag.sdc

yuguen · ‎02-20-2024

Hey @StefanoC,

When doing "make report", you are generating RTL for the SYCL code that is in the add-oneapi folder.

This RTL top module can indeed be found in "add_report_di.sv".

This is the IP that you need to integrate into an existing RTL pipeline.

The "add_quartus_sln" folder already contains RTL, and is there to mimic your own RTL pipeline. So the "add.sv" file is already there, before you do "make report" as this is not a generated file, this is the existing RTL pipeline. You can peak into this file and see it is making a led turn on on the FPGA based on another signal. This is not possible to express using SYCL.

This tutorial shows how to connect the generated RTL from SYCL (the add_report_di.sv IP) with the existing "add.sv" RTL pipeline.

So "add" from add.sv is the top level module, that depends on the SYCL generated RTL.

The steps in the README tells you to:

1/ generate the SYCL IP

Create a Quartus project with the existing RTL files:

This also sets the top level module to "add" which is contained in the add.sv that was just copied from add-quartus-sln

Then, import the SYCL generated IP:

Then connects the two in the following steps, etc.

StefanoC · ‎02-21-2024

Ok @yuguen now I better understand the workflow, and I could make progress thanks to your explanation.

Unfortunately I am stuck at the last step of the tutorial as I am on the Cyclone10 (while the tutorial is for Arria) so when setting the pins I really don't know which ones to select (PIN_AM10 does not exist on the Cyclone10). Also I feel I would need the a file jtag.sdc made for Cyclone10 (I cannot seem to find it here: https://github.com/altera-opensource/ghrd-socfpga)

On a slightly different note, I tried to modify the C++ source file adding arrays to be added (rather than primitive int), in my new add_kernel_wrapper I observe I got an "Avalon Memory Mapped Host" while previously (like in the tutorial) I only had "Avalon Memory Mapping Agent"); I wonder how the new "Avalon Memory Mapped Host" should be connected, as generating the HDL I see warning about "Avalon Memory Mapped Host" must be connected to an Avalon-MM agent or exported.

whitepau_altera · ‎02-22-2024

Hi Stefano,

Can you please share what board you are using?

I will assume you are using the Terasic Cyclone 10 GX devkit. There are some sample designs here: https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=253&No=1147&PartNo=3#contents

According to the user guide in there, there are 2 100MHz clocks you can use, C10_CLKUSR and C10_REFCLK2:

As far as the JTAG.sdc, I can't find one to use for that devkit, but it is not strictly necessary for functionality. The timing analyzer will complain about timing failure since it will try to constrain the JTAG lines but I think you can ignore those warnings.

StefanoC · ‎02-22-2024

Hi @whitepau_altera

yes I am using the Cyclone 10GX, the link you provided contains various things but I do not see any oneAPI example. So far I think I am near to having a working sample using the "add" tutorial (https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer/add-oneapi/src based on the Arria board), I will just provide the pin clock you suggested and bypass the jtag.sdc

One clarification I need is about the avalon memory mapping, while using the "add" tutorial I added in the C source file a few arrays to be added/multiplied rather the int a and int b. It seems adding some arrays in the C code changed the way the IP got generated, see the attached picture, I have avm_mem_gmem0_0port_0_0rw. What is the nature of this additional host? I can't connect the Avalon Agent to two hosts, shall I drop the JTAG to Avalon Master Bridge Intel FPGA IP host in favor of the avm_mem_gmem0_0port_0_0rw?

thanks a lot!

p.s. in the picture read nbody as add (I had edited the name)

whitepau_altera · ‎02-22-2024

if you use arrays as your kernel arguments, the compiler will map them to an mm host interface, that reaches out to a mm agent interface. for more info see this code sample: https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Features/ip_authoring_interfaces/component_interfaces_comparison

You need to add some kind of memory for the host to connect to. You are getting into system design that is beyond the scope of oneAPI SYCL HLS /IP creation

There is a sample here that shows a simple system with an on-chip memory:
https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/ReferenceDesigns/niosv

StefanoC · ‎02-22-2024

@whitepau_altera my source code is exactly as the "Naive" solution:

https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Features/ip_authoring_interfaces/component_interfaces_comparison/naive/src/vector_add.cpp

I can compile it with oneAPI and produce IPs; I "just" miss how to how to deal with the avalon thing, as the link below fairly well describes what to do for primitive data type only: https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/Tools/platform_designer

is there an equivalent platform design, module connection for the vectorial add ("naive implementation")?

StefanoC

whitepau_altera · ‎02-22-2024

is there an equivalent platform design, module connection for the vectorial add ("naive implementation")?

No, as I mentioned, the naive solution in that link requires some memory-mapped agent to read the input vectors from and write the output vectors to. You can add an on-chip memory IP to the Platform Designer sample. The IP will have one or more Avalon Agent (slave) interfaces. You need to connect the host (master) interface from your vector_add IP to this agent, and then connect the master interface from the jtag avalon IP to the on-chip memory agent as well; you can use the JTAG interface to fill the on-chip memory with data and then configure the vector_add ip to access it. This is of course a lot of JTAG commands!

An easier solution is the design in the Nios® V softcore processor sample that I shared:
https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/ReferenceDesigns/niosv

This uses a soft CPU in place of all the JTAG commands i mentioned, so it fills the on-chip memory with some data and configures and starts the oneAPI IP. In the niosv sample, the oneAPI IP does a simple memory copy rather than a vector add.

The JTAG UART IP in this design is similar to the JTAG Avalon Master IP in the Platform Designer sample. It allows the Nios soft processor to be controlled through a JTAG interface.

of course, you don't have to use on-chip memory; if you want you can use an EMIF IP to connect to the DRAM chips on your Cyclone 10 GX board (but I have no experience using that so someone else will need to help you with that :))

StefanoC · ‎02-22-2024

@whitepau_altera I think the niosV will fit my case, I will give a try today, one last clarification, as that userguide mention "..demonstrates how to simulate an FPGA IP produced with the Intel® oneAPI DPC++/C++" and scrolling down I see that towards the end it actually "Generate Testbench System". Will I be able to synthetize on the fpga rather than just simulate?

thanks!

StefanoC

StefanoC · ‎02-22-2024

Forgot to say that Cyclone10gx is not mentioned in the previous niosV thing, as you indicated that link, shall I assume that approach will work on my board?

whitepau_altera · ‎02-22-2024

yeah that project has no board-specific settings.

like I said: watch out for the on-chip block RAM getting too big.

StefanoC · ‎02-22-2024

@whitepau_altera I will check it out, assuming simulation will work, I could them synthetize on my board, right?

StefanoC

whitepau_altera · ‎02-22-2024

If you set the pins and the BRAM size, it should work on a board. You will need to use the nios tools to connect to the niosv and monitor it over the JTAG bus. There are guides for that though

https://cdrdv2-public.intel.com/784469/an-784468-784469.pdf

StefanoC · ‎02-22-2024

@whitepau_alterathe simulation does work. I will edit the sample dma to actually do something useful rather than copying/comparing elements. You mentioned I will have to use the nios tools to connect to the soft core; however I was planning on running my code on the computer CPU with some part of my algorithm memory mapped to an IP that will result from oneAPI compiler. So in this case, if I understand correctly, I don't need to connect to the niosV (unless I want to inspect/debug), am I right?

thanks!

StefanoC · ‎02-24-2024

Hi @whitepau_altera I am still running the plain niosv sample simple dma (as it is in the repository without any modification), trying to run in on the hardware and have the cpu (C) interact with the fpga (I confirm the simulation, as detailed in the github repository works).

I added this top node entity:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity test_system is
  port (
    clk : in STD_LOGIC;
    rst : in STD_LOGIC
  );
end entity test_system;

architecture Behavioral of test_system is
    component pd_system is
        port (
            clk_clk                                          : in  std_logic                     := 'X';
            reset_reset                                      : in  std_logic                     := 'X';
            simple_dma_accelerator_device_exception_bus_data : out std_logic_vector(63 downto 0)
        );
    end component pd_system;

    signal pd_system_clk : std_logic;
    signal pd_system_rst : std_logic;

begin
    u0 : pd_system
        port map (
            clk_clk                                          => pd_system_clk,
            reset_reset                                      => pd_system_rst,
            simple_dma_accelerator_device_exception_bus_data => open
        );
    pd_system_clk <= clk;
    pd_system_rst <= rst;

end architecture Behavioral;

I synthesized the project onto my board, I got a few warnings:

1) Critical Warning(12677): No exact pin location assignment(s) for 1 pins of 2 total pins. For the list of pins please refer to the I/O Assignment Warnings table in the fitter report

2) No user constrained base clocks found in the design. Calling "derive_clocks -period 1.0"

3)Timing requirements not met

clk -6.334 -18110.791 7371 Slow 900mV 100C Model 1
altera_reserved_tck -1.788 -410.899 380 Slow 900mV 100C Model 2

About 1) I think it's referring to the clock/reset signals, I tried to put clock location in the pin planner as you suggested "C10_CLKUSR" but that value it's not accepted, scrolling the dropdown menu I selected "PIN_C10 I/O Bank 2k"; I haven't yet assigned the reset that's why one location is not assigned.

I then used the USB blaster JTAG to program the board (successfully).

then I compiled /kernels/simple_dma/ using "make fpga" rather than "make report" and I got an executable, but when I run it I got this:

tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$ ./simple_dma.fpga
Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Invalid device program image: size is zero -30 (PI_ERROR_INVALID_VALUE)
Aborted (core dumped)
tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$ sudo env "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" ./simple_dma.fpga
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Aborted

I tried with sudo as I suspected it couldn't find the board; I exported a variable because without it could complain about a missing library, however it screams about a runtime error.

Am I missing something to run this sample on the board and have the C interact with it?

thanks!!

StefanoC · ‎02-26-2024

@whitepau_altera

I successfully synthesized the niosV example (as it is), adding this top node:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity test_system is
  port (
    clk : in STD_LOGIC;
    rst : in STD_LOGIC
  );
end entity test_system;

architecture Behavioral of test_system is
    component pd_system is
        port (
            clk_clk                                          : in  std_logic                     := 'X';
            reset_reset                                      : in  std_logic                     := 'X';
            simple_dma_accelerator_device_exception_bus_data : out std_logic_vector(63 downto 0)
        );
    end component pd_system;

    signal pd_system_clk : std_logic;
    signal pd_system_rst : std_logic;

begin
    u0 : pd_system
        port map (
            clk_clk                                          => pd_system_clk,
            reset_reset                                      => pd_system_rst,
            simple_dma_accelerator_device_exception_bus_data => open
        );
    pd_system_clk <= clk;
    pd_system_rst <= rst;

end architecture Behavioral;

However when I tried (a few things) to run the software companion produced by oneAPI I got:

tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$ ./simple_dma.fpga
Running on device: SimulatorDevice : Multi-process Simulator (aclmsim0)
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$
tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$ sudo ./simple_dma.fpga
[sudo] password for tetto:
./simple_dma.fpga: error while loading shared libraries: libdspba_mpir.so.23: cannot open shared object file: No such file or directory
tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$ sudo LD_LIBRARY_PATH=$LD_LIBRARY_PATH  ./simple_dma.fpga
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Aborted
tetto@ubuntuoffice:~/oneAPI-samples/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/niosv/kernels/simple_dma/build$

Coming back to this example, I was also playing with:

https://github.com/oneapi-src/oneAPI-samples/blob/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/GettingStarted/fpga_compile/part3_dpcpp_lambda_usm/src/vector_add.cpp

oneAPI seems to produce a quartus project completed with the top node (that I can compile and synthesize), but again when I run the executable I encounter a runtime error. In that vector_add.src it seems this "unified memory" would just do the job, but I recall you mentioned earlier one has to edit the RTL design (although in this part3 the top node seems properly generated and needed IPs/avalon devices are there and instantiated).

whitepau_altera · ‎02-26-2024

Stefano, let me refer you once again to this comment: https://community.intel.com/t5/Intel-High-Level-Design/oneAPI-on-Cyclone10gx/m-p/1573813/highlight/true#M3497

The code you got from the code sample is the dark green boxes in this picture (Host code and Application Kernel). without all the other stuff in the middle, that generated executable file will not work.

The executable that oneAPI emits will not work without a supported BSP. Since you have selected -Xstarget=Cyclon10GX, you have created an IP, which is just the Application Kernel in that picture. This means that when you use the Application Kernel, it is treated like any other IP that was written using Verilog or VHDL.

To 'run' your IP, you will need to program the Nios-based design you created onto your board and execute Nios code to control the IP.

oneAPI seems to produce a quartus project completed with the top node (that I can compile and synthesize), but again when I run the executable I encounter a runtime error. In that vector_add.src it seems this "unified memory" would just do the job, but I recall you mentioned earlier one has to edit the RTL design (although in this part3 the top node seems properly generated and needed IPs/avalon devices are there and instantiated).

That generated Intel® Quartus® Prime project is only for estimating fMAX of your IP. The point of it is to assign the pins of the IP to virtual pins so that Intel® Quartus® Prime will place and route it without actually connecting it to physical pins. Here is an explanation of virtual pin assignments https://www.youtube.com/watch?v=QET0lC-jdAQ

If you wish to build a custom BSP so that you can install your C10GX PCIe card into a computer and communicate with it through the OpenCL runtime (and have oneAPI host code control it), we have guidance on custom BSP creation here:

Getting started : https://ofs.github.io/latest/hw/common/user_guides/oneapi_asp/ug_oneapi_asp/

Reference Manual: https://ofs.github.io/latest/hw/common/reference_manual/oneapi_asp/oneapi_asp_ref_mnl/

BoonBengT_Altera · ‎02-12-2024

Hi Stefano,

Following through the previous clarification to see if there are any further doubts in regards to this matter.

Hope your doubts have been clarified.

Best Wishes

BB

High Level Synthesis (BSP | Compiler | IP Integration)