- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am new to the tool chain and was wondering if there was a method to generate RTL IP to integrate in Quartus Prime Pro in similar way to HLS i++ compiler but with oneAPI, as the former is being deprecated.
I just need standalone soft IP cores to integrate in RTL design in Quartus so as I don't need the kernel formalism and would like to simply generate .ip from multiple functions or files.
I don't find much documentation on that specific RTL-oriented approach so as I would appreciate some help. Basically, from the fpga_template sample I would want to get a simple IP VectorAdd with a,b inputs instead of the wrapped kernel generated by icpx that integrates the stream interface.
If i just simply use HLS syntax and compile with icpx -fsycl or dpcpp I get a warning because no kernel is specified and generated .prj shows empty description (only basic signals in instanciation are shown like reset/clock).
Thank you for your time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
Can you please share the code you wrote? I am not sure I understand what you are trying to describe.
In general though, in i++, you could describe a simple adder like this:
component
int add(int a, int b) {
return a + b;
}
This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your question.
The SYCL boilerplate is required even for the SYCL HLS IP-authoring mode of our oneAPI DPC++/C++ compiler. We have a number of code samples that can help you create and customize IP.
To quickly get started, there is a platform_designer code sample that walks you through a minimal design that can be exported to platform designer. A video walkthrough is available on our YouTube channel: https://www.youtube.com/watch?v=-f2LA_dyQg0
We have a more complete customer journey in the design hub:
oneAPI Guided Journey
You may be interested in the code samples that pertain to HLS interface customization:
HLS Flow Interfaces Code Samples
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thank you for the clarification and links.
I took some time to play a bit with examples and look at the different interfaces but I am still uncertain about what interface would be best for my applications.
I basically want to use the HLS tools to generate optimized computation modules operating and communicating only with RTL. The main idea is to save development time by having HLS automatically enabling registers in DSP and synchronize the intermediate computations. I find this method more practical than generating the DSP IPs in Quartus, that would require to code some HDL to synchronize intermediate computations and estimation of registers to enable for the target fmax. With HLS I could simply describe the operations to be computed and obtain the latency of the module and fmax quickly.
To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
So I was wondering what was the most suited streaming data interface for RTL communication with minimal logic inference (basically just register input/output with start/done control) instead of the pipe implementing avalon_mm protocol of the example.
Thank you for your time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @romabo,
Greetings, just checking in to see if there is any further doubts in regards to this matter.
Hope your doubts have been clarified.
Best Wishes
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @romabo,
Thanks for sharing your knowledge on how you did it
That would be a great help to others which may have the similar challenges, appreciate that.
Please do let us know if there is any further doubts in regards to this matter.
Hope your doubts have been clarified.
Best Wishes
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> To take a concrete example, I adapted the add_oneapi example to compute a single precision floating-point mul add (that would use only 1 DSP in Quartus using the dedicated IP). So to the best of my knowledge, the invocation and data interface should be streaming pipes, but it creates some extra logic shown in the report (to handle pipes or avalon mm supposedly as with directive -Xsdsp-mode=prefer-dsp it should use internal registers of the DSP and perform all computation in the DSP slice).
Can you please share the code you wrote? I am not sure I understand what you are trying to describe.
In general though, in i++, you could describe a simple adder like this:
component
int add(int a, int b) {
return a + b;
}
This would give you an IP that had two inputs synchronized to a start/busy handshake, and a single output synchronized to a done/stall handshake.
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the late reply I didn't receive email notifications.
The code I used corresponds to the basic example shown in the video you linked that I adapted to single precision floating point multiplication addition.
#include <iostream>
// oneAPI headers
#include <sycl/ext/intel/fpga_extensions.hpp>
#include <sycl/sycl.hpp>
#include "exception_handler.hpp"
using PipeOutProps = decltype(sycl::ext::oneapi::experimental::properties(
sycl::ext::intel::experimental::protocol<
sycl::ext::intel::experimental::protocol_name::avalon_mm_uses_ready>));
class PipeOutResID;
using PipeOutRes = sycl::ext::intel::experimental::pipe<PipeOutResID, float, 0, PipeOutProps>;
class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
float a;
float b;
float c;
void operator()() const {
float res = a * b + c;
PipeOutRes::write(res);
}
auto get(sycl::ext::oneapi::experimental::properties_tag) {
return sycl::ext::oneapi::experimental::properties{
sycl::ext::intel::experimental::streaming_interface_remove_downstream_stall};
}
};
int main() {
bool passed = false;
try {
// Use compile-time macros to select either:
// - the FPGA emulator device (CPU emulation of the FPGA)
// - the FPGA device (a real FPGA)
// - the simulator device
#if FPGA_SIMULATOR
auto selector = sycl::ext::intel::fpga_simulator_selector_v;
#elif FPGA_HARDWARE
auto selector = sycl::ext::intel::fpga_selector_v;
#else // #if FPGA_EMULATOR
auto selector = sycl::ext::intel::fpga_emulator_selector_v;
#endif
sycl::queue q(selector, fpga_tools::exception_handler,
sycl::property::queue::enable_profiling{});
auto device = q.get_device();
// make sure the device supports USM host allocations
if (!device.has(sycl::aspect::usm_host_allocations)) {
std::cerr << "This design must either target a board that supports USM "
"Host/Shared allocations, or IP Component Authoring. "
<< std::endl;
std::terminate();
}
std::cout << "Running on device: "
<< device.get_info<sycl::info::device::name>().c_str()
<< std::endl;
// Kernel inputs
float a = 10.1;
float b = 20.1;
float c = 30.1;
q.single_task<MulAddFp32ID>(MulAddFp32{a,b,c}).wait();
float res = PipeOutRes::read(q);
// verify that VC is correct
passed = true;
float expected = a*b + c;
if (res != expected) {
std::cout << "result " << res << ", expected ("
<< expected
<< ") A=" << a << " + B=" << b << " + C=" << c << std::endl;
passed = false;
}
std::cout << (passed ? "PASSED" : "FAILED") << std::endl;
} catch (sycl::exception const &e) {
// Catches exceptions in the host code.
std::cerr << "Caught a SYCL host exception:\n" << e.what() << "\n";
// Most likely the runtime couldn't find FPGA hardware!
if (e.code().value() == CL_DEVICE_NOT_FOUND) {
std::cerr << "If you are targeting an FPGA, please ensure that your "
"system has a correctly configured FPGA board.\n";
std::cerr << "Run sys_check in the oneAPI root directory to verify.\n";
std::cerr << "If you are targeting the FPGA emulator, compile with "
"-DFPGA_EMULATOR.\n";
}
std::terminate();
}
return passed ? EXIT_SUCCESS : EXIT_FAILURE;
}
Which gives the following estimated resource utilization summary
Name | ALMs | ALUTs | FFs | MLABs | RAMs | DSPs |
Pipe resources | 1 | 2 | 33 | 0 | 0 | 0 |
MulAddFp32ID | 40 | 80 | 146 | 0 | 1 | 1 |
* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp
The RAM usage is apparently used for cluster logic and extra logic is used for computation as shown below:
When running fpga compilation Quartus fitter I get the following resource utilization summary.
Name | ALMs | ALUTs | FFs | MLABs | RAMs | DSPs |
Quartus Fitter: Device Image | 129 | 180 | 565 | 0 | 0 | 0 |
MulAddFp32ID | 119 | 165 | 510 | 0 | 0 | 0 |
* -Xstarget=Agilex5 -Xsclock=200MHz -Xsdsp-mode=prefer-dsp
Let alone the fact that DSP is not inferred by the fitter, I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.
I also tried with I/O pipes which appeared to be more suited for my application but I also get a similar operation.
// Pipe In
struct io_pipe_read_a_id { static constexpr unsigned id = 0; };
struct io_pipe_read_b_id { static constexpr unsigned id = 0; };
struct io_pipe_read_c_id { static constexpr unsigned id = 0; };
using PipeReadA = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_a_id, float, 0>;
using PipeReadB = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_b_id, float, 0>;
using PipeReadC = sycl::ext::intel::kernel_readable_io_pipe<io_pipe_read_c_id, float, 0>;
// Pipe Out
struct io_pipe_write_res_id { static constexpr unsigned id = 1; };
using PipeWriteRes = sycl::ext::intel::kernel_writeable_io_pipe<io_pipe_write_res_id, float, 0>;
class MulAddFp32ID; // Kernel name
struct MulAddFp32 {
auto get(sycl::ext::oneapi::experimental::properties_tag) {
return sycl::ext::oneapi::experimental::properties{
sycl::ext::intel::experimental::streaming_interface<>};
}
float a = PipeReadA::read();
float b = PipeReadB::read();
float c = PipeReadC::read();
void operator()() const {
float res = a * b + c;
PipeWriteRes::write(res);
}
};
@whitepau wrote:
If you want to get a similar IP with SYCL HLS, you specify a streaming invocation interface using a kernel property, and streaming data interfaces using pipes.
Thank you for confirming. I was mostly interested in the details of what streaming interface/data I should use to get minimal "kernel" operation so as the generate IP would basically operate as an RTL module registers input/outputs with start/done signals.
Thank you for your time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> I don't really understand the schedule view and what the cluster exactly corresponds to along with the numerous write cycles.
You can learn about clusters in the Scheduling section of the FPGA Concepts chapter of our documentation. You can force stall-enabled clusters using the use_stall_enable_clusters kernel attribute.
As far as the 24-cycle pipe write, this is a common issue with any of the hyperflex-enabled FPGAs (Stratix™ 10, Agilex™ 7 and Agilex™ 5). In order to optimize for the high fMAX that these chips are capable of, the compiler tends to insert lots of pipeline registers, which can have negative impacts on latency (in particular small designs like yours). The compiler supports different optimization strategies. You can bias the compiler to prefer low latency over high fMAX using the -Xsoptimize=latency compiler flag.
Here is the schedule view after biasing for latency:
As far as the DSP block not being used; that is concerning. I've filed a bug report on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for the resources and explanations, that answered all my questions on that matter!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Paul,
Noted with thanks for adding in additional inputs.
Hi @romabo,
Greetings, as we do not receive any further clarification/updates on the matter, hence would assume challenge are overcome. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. For new queries, please feel free to open a new thread and we will be right with you. Pleasure having you here.
Best Wishes
BB
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page