Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
704 Discussions

Intel OpenCL compiler (aoc) does not coalesce global memory reads anymore

Björne2
Novice
627 Views

The two screenshots says it all. The old screenshot is generated with aoc 21.2.0. Note how it coalesces the 16 float reads into one 512 bit DDR read. The new screenshot is generated with aoc 2024.2.1. It does not coalesce the 16 float reads and instead creates 16 individual read ports. Afaict, that is quite bad for performance and it wastes a lot of hardware resources.

 

Is there a way to make aoc 2024.2.1 coalesce, exactly like the old compiler did?

0 Kudos
8 Replies
yuguen
Employee
493 Views

The OpenCL SDK for Intel FPGAs is no longer distributed since 22.4.

Therefore, 22.4 is the last version of the compiler to officially support OpenCL as an input language.

 

How did you get a 2024.2.1 version of aoc? I'm guessing that you got that binary from a oneAPI SYCL compiler for FPGA install.

The SYCL compiler uses aoc internally, but is not expected to work as a standalone.

0 Kudos
BoonBengT_Intel
Moderator
434 Views

Hi @Björne2,


Good day, just following up on the previous clarification.

By any chances did you managed to look into it?

Hope to hear from you soon.


Best Wishes

BB


0 Kudos
BoonBengT_Intel
Moderator
358 Views

Hi @Björne2,


Greetings, just checking in to see if there is any further doubts in regards to this matter.

Hope your doubts have been clarified.


Best Wishes

BB


0 Kudos
BoonBengT_Intel
Moderator
308 Views

Hi @Björne2,


Greetings, as we do not receive any further clarification/updates on the matter, hence would assume challenge are overcome. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. For new queries, please feel free to open a new thread and we will be right with you. Pleasure having you here.


Best Wishes

BB


0 Kudos
Björne2
Novice
199 Views

Hi BoonBengT and yuguen. I know that OneAPI only support SYCL. But SYCL is horrible for FPGA work so I'm sticking with OpenCL. Besides, SYCL is just a thin layer on top of OpenCL anyway. I solved my issue by removing the "volatile" keyword. Apparently, in recent versions volatile prevents memory coalescing.

0 Kudos
yuguen
Employee
167 Views

There is no guarantee that anything coming out of aoc will be functional as this tool is now deprecated.

What are your complaints about SYCL compared to OpenCL?

0 Kudos
Björne2
Novice
147 Views

I know that, but the replacement SYCL tools are that bad. The main problem is that icpx embeds the FPGA image into the host code so you can't have one binary that switches between multiple images via command line parameters. Nor one binary with kernels for multiple different devices. icpx also takes 10 seconds for simple examples which compile instantly in OpenCL. It wouldn't be so bad if you could use a regular C++ compiler for the host code and just use icpx for the device code, but I haven't found any (easy) way of accomplishing that.

0 Kudos
yuguen
Employee
132 Views

@Björne2 wrote:

The main problem is that icpx embeds the FPGA image into the host code so you can't have one binary that switches between multiple images via command line parameters. 



You can extract the aocx from the produced binary: 
https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-1/extracting-the-fpga-hardware-configuration-aocx.html

 


@Björne2 wrote:

Nor one binary with kernels for multiple different devices.


This is purpose of oneAPI (one API to target multiple devices) so I'm not sure what you are referring to here


@Björne2 wrote:

icpx also takes 10 seconds for simple examples which compile instantly in OpenCL. 


The legacy OpenCL compiler uses the same internal compiler as the SYCL compiler. So it should not be much slower.
I have two example designs:
- In SYCL: "time icpx -fsycl -fsycl-link=early [...]"

real 0m9.221s
user 0m7.064s
sys 0m0.814s

- In OpenCL: "time aoc -rtl [...]"

real 0m7.636s
user 0m3.490s
sys 0m0.724s

 

There is a difference, but not as big as 10 seconds to instant.

 


@Björne2 wrote:

It wouldn't be so bad if you could use a regular C++ compiler for the host code and just use icpx for the device code, but I haven't found any (easy) way of accomplishing that.


SYCL is a superset of C++, so the host code written in SYCL is basically C++.

You can have a look at the SYCL code samples: https://github.com/oneapi-src/oneAPI-samples/tree/development/DirectProgramming/C%2B%2BSYCL_FPGA

In particular, you can have a look at the GettingStarted/fpga_compile sample to see a step by step C++ to SYCL example.
The SYCL example is 90% C++: https://github.com/oneapi-src/oneAPI-samples/blob/development/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/GettingStarted/fpga_compile/part2_dpcpp_functor_usm/src/vector_add.cpp

 

0 Kudos
Reply