Solved: Re: HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus

romabo · ‎10-02-2024

Hello,

I am new to the tool chain and was wondering if there was a method to generate RTL IP to integrate in Quartus Prime Pro in similar way to HLS i++ compiler but with oneAPI, as the former is being deprecated.

I just need standalone soft IP cores to integrate in RTL design in Quartus so as I don't need the kernel formalism and would like to simply generate .ip from multiple functions or files.

I don't find much documentation on that specific RTL-oriented approach so as I would appreciate some help. Basically, from the fpga_template sample I would want to get a simple IP VectorAdd with a,b inputs instead of the wrapped kernel generated by icpx that integrates the stream interface.

If i just simply use HLS syntax and compile with icpx -fsycl or dpcpp I get a warning because no kernel is specified and generated .prj shows empty description (only basic signals in instanciation are shown like reset/clock).

Thank you for your time.

whitepau · ‎10-09-2024

> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).

Can you please share the code you wrote? I am not sure I understand what you are trying to describe.

In general though, in i++, you could describe a simple adder like this:

component
int add(int a, int b) {
    return a + b;
}

This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

View solution in original post

whitepau · ‎10-02-2024

Thanks for your question.

The SYCL boilerplate is required even for the SYCL HLS IP-authoring mode of our oneAPI DPC++/C++ compiler. We have a number of code samples that can help you create and customize IP.

To quickly get started, there is a platform_designer code sample that walks you through a minimal design that can be exported to platform designer. A video walkthrough is available on our YouTube channel: https://www.youtube.com/watch?v=-f2LA_dyQg0

We have a more complete customer journey in the design hub:
oneAPI Guided Journey

You may be interested in the code samples that pertain to HLS interface customization:
HLS Flow Interfaces Code Samples

romabo · ‎10-07-2024

Hello,

Thank you for the clarification and links.

I took some time to play a bit with examples and look at the different interfaces but I am still uncertain about what interface would be best for my applications.

I basically want to use the HLS tools to generate optimized computation modules operating and communicating only with RTL. The main idea is to save development time by having HLS automatically enabling registers in DSP and synchronize the intermediate computations. I find this method more practical than generating the DSP IPs in Quartus, that would require to code some HDL to synchronize intermediate computations and estimation of registers to enable for the target fmax. With HLS I could simply describe the operations to be computed and obtain the latency of the module and fmax quickly.

To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).

So I was wondering what was the most suited streaming data interface for RTL communication with minimal logic inference (basically just register input/output with start/done control) instead of the pipe implementing avalon_mm protocol of the example.

Thank you for your time.

BoonBengT_Intel · ‎10-06-2024

Hi @romabo,

Greetings, just checking in to see if there is any further doubts in regards to this matter.

Hope your doubts have been clarified.

Best Wishes

BB

BoonBengT_Intel · ‎10-08-2024

Hi @romabo,

Thanks for sharing your knowledge on how you did it

That would be a great help to others which may have the similar challenges, appreciate that.

Please do let us know if there is any further doubts in regards to this matter.

Hope your doubts have been clarified.

Best Wishes

BB

whitepau · ‎10-09-2024

> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).

Can you please share the code you wrote? I am not sure I understand what you are trying to describe.

In general though, in i++, you could describe a simple adder like this:

component
int add(int a, int b) {
    return a + b;
}

This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

romabo · ‎10-26-2024

Sorry for the late reply I didn't receive email notifications.

The code I used corresponds to the basic example shown in the video you linked that I adapted to single precision floating point multiplication addition.

#include <iostream>

// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>

#include "exception_handler.hpp"

using PipeOutProps = decltype(sycl::ext::oneapi::experimental::properties(
  sycl::ext::intel::experimental::protocol<
  sycl::ext::intel::experimental::protocol_name::avalon_mm_uses_ready>));

class PipeOutResID;
using PipeOutRes = sycl::ext::intel::experimental::pipe<PipeOutResID, float, 0, PipeOutProps>;

class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
  float a;
  float b;
  float c;

  void operator()() const {
    float res = a * b + c;
    PipeOutRes::write(res);
  }

  auto get(sycl::ext::oneapi::experimental::properties_tag) {
    return sycl::ext::oneapi::experimental::properties{
        sycl::ext::intel::experimental::streaming_interface_remove_downstream_stall};
  }
};

int main() {
  bool passed = false;

  try {

    // Use compile-time macros to select either:
    //  - the FPGA emulator device (CPU emulation of the FPGA)
    //  - the FPGA device (a real FPGA)
    //  - the simulator device
#if FPGA_SIMULATOR
    auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
    auto selector = sycl::ext::intel::fpga_selector_v;
#else  // #if FPGA_EMULATOR
    auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif

    sycl::queue q(selector, fpga_tools::exception_handler,
                  sycl::property::queue::enable_profiling{});

    auto device = q.get_device();

    // make sure the device supports USM host allocations
    if (!device.has(sycl::aspect::usm_host_allocations)) {
      std::cerr << "This design must either target a board that supports USM "
                   "Host/Shared allocations, or IP Component Authoring. "
                << std::endl;
      std::terminate();
    }

    std::cout << "Running on device: "
              << device.get_info<sycl::info::device::name>().c_str()
              << std::endl;

    // Kernel inputs
    float a = 10.1;
    float b = 20.1;
    float c = 30.1;

    q.single_task<MulAddFp32ID>(MulAddFp32{a,b,c}).wait();
    float res = PipeOutRes::read(q);

    // verify that VC is correct
    passed = true;

    float expected = a*b + c;
    if (res != expected) {
      std::cout << "result " << res << ", expected ("
                << expected 
                << ") A=" << a << " + B=" << b << " + C=" << c << std::endl;
      passed = false;
    }

    std::cout << (passed ? "PASSED" : "FAILED") << std::endl;
  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code.
    std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.code().value() == CL_DEVICE_NOT_FOUND) {
      std::cerr << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
      std::cerr << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }

  return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}

Which gives the following estimated resource utilization summary

Name	ALMs	ALUTs	FFs	MLABs	RAMs	DSPs
Pipe resources	1	2	33	0	0	0
MulAddFp32ID	40	80	146	0	1	1

* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp

The RAM usage is apparently used for cluster logic and extra logic is used for computation as shown below:

When running fpga compilation Quartus fitter I get the following resource utilization summary.

Name	ALMs	ALUTs	FFs	MLABs	RAMs	DSPs
Quartus Fitter: Device Image	129	180	565	0	0	0
MulAddFp32ID	119	165	510	0	0	0

* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp

Let alone the fact that DSP is not inferred by the fitter, I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.

I also tried with I/O pipes which appeared to be more suited for my application but I also get a similar operation.

// Pipe In
struct io_pipe_read_a_id    { static constexpr unsigned id = 0; };
struct io_pipe_read_b_id    { static constexpr unsigned id = 0; };
struct io_pipe_read_c_id    { static constexpr unsigned id = 0; };
using PipeReadA  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_a_id, float, 0>;
using PipeReadB  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_b_id, float, 0>;
using PipeReadC  = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_c_id, float, 0>;

// Pipe Out
struct io_pipe_write_res_id { static constexpr unsigned id = 1; };
using PipeWriteRes = sycl::ext::intel::kernel_writeable_io_pipe<io_pipe_write_res_id, float, 0>;

class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
  auto get(sycl::ext::oneapi::experimental::properties_tag) {
    return sycl::ext::oneapi::experimental::properties{
        sycl::ext::intel::experimental::streaming_interface<>};
  }

  float a = PipeReadA::read();
  float b = PipeReadB::read();
  float c = PipeReadC::read();

  void operator()() const { 
    float res = a * b + c;
    PipeWriteRes::write(res);
  }
};

@whitepau wrote:

If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.

Thank you for confirming. I was mostly interested in the details of what streaming interface/data I should use to get minimal "kernel" operation so as the generate IP would basically operate as an RTL module registers input/outputs with start/done signals.

Thank you for your time.

whitepau · ‎10-29-2024

> I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.

You can learn about clusters in the Scheduling section of the FPGA Concepts chapter of our documentation. You can force stall-enabled clusters using the use_stall_enable_clusters kernel attribute.

As far as the 24-cycle pipe write, this is a common issue with any of the hyperflex-enabled FPGAs (Stratix™ 10, Agilex™ 7 and Agilex™ 5). In order to optimize for the high fMAX that these chips are capable of, the compiler tends to insert lots of pipeline registers, which can have negative impacts on latency (in particular small designs like yours). The compiler supports different optimization strategies. You can bias the compiler to prefer low latency over high fMAX using the -Xsoptimize=latency compiler flag.

Here is the schedule view after biasing for latency:

As far as the DSP block not being used; that is concerning. I've filed a bug report on this.

romabo · ‎11-10-2024

Thank you very much for the resources and explanations, that answered all my questions on that matter!

BoonBengT_Intel · ‎10-15-2024

Hi Paul,

Noted with thanks for adding in additional inputs.

Hi @romabo,

Greetings, as we do not receive any further clarification/updates on the matter, hence would assume challenge are overcome. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. For new queries, please feel free to open a new thread and we will be right with you. Pleasure having you here.

Best Wishes

BB

HLS i++ style IP generation with oneAPI for RTL IP integration in Quartus

High Level Synthesis (BSP | Compiler | IP Integration)

oneAPI (SW Development| BSP | IP Integration)