- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I have an issue when I try to run my oneapi kernel by passing my arguments with "stable annotated_arg". I try to use a "for" loop with those "stable" arguments as variable in simulation but it is very slow and doesn't work very well whereas when i use classic "int" declared in the kernel without using an argument variable I don't have this issue in the "for" loop and the simulation work fine and fast. Do you have an idea of what could be the issue ? Thank you !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sharing the report, @DorianL .
It looks like the loop at line 137 was pipelined with II=1, but it was constrained to serial execution.
This means that this outer loop is effectively un-pipelined. This doesn't explain the gaps you are seeing in the simulation waveform though.
I also see that you are getting a memory system with lots of arbitration:
After some experimenting, I discovered that the warning about the variable 'fenetre' is a bit of a red herring here. I would expect for a line buffer like you are describing to have a memory system with multiple banks, and each bank having a dedicated load/store unit (LSU). From the image above, we can see that the memory system is not efficiently selecting banks. I tried using the bank_bits attribute to constrain this, but it appears the compiler is ignoring this attribute now.
I was able to get the compiler to partition your 2d array by swapping the dimensions (transposing) so that the dimension to be split into banks (i.e. accessed simultaneously by different unrolled loop iterations) was in the least significant place. This appears to result in the desired memory system (don't forget to swap the accesses too!!)
OLD:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[8][NB_COLONNE_MAX];
NEW:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[NB_COLONNE_MAX][8];
* Note that I changed the dimension from 5 to 8: the compiler complains if you try to create a memory system with a non-power-of-2 number of banks. Changing to 8 is ok because the compiler sees that the extra 3 banks aren't used and it optimizes them away.
The new memory system looks a lot better now:
The sim looks a lot better too:
I think i know how to solve these 2-cycle dips but I'm still waiting for the test to finish.
I suspect it's a side-effect of using a loop nest instead of using a while(1) loop to iterate across image pixels.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi DorianL,
Can try to check out the optimization reference from the guide below:
https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/optimization-guide/2023-1/loops.html
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @aikeu,
Thank you for your reply, I already checked the documentation and I think the issue is that my inner loop is dynamic because it is an input of my IP (the number of column of the picture to process). I tried to solve the issue by bounding the loop with a random number like it is made in the example of the documentation, but my kernel doesn't accept the rand() function because it's an external SYCL function. Do you have an idea of how I could make my kernel work ?
Thank you !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @DorianL,
Greetings, just to further understand the issues you mention that you have a inner loop issues.
Would you be able to share the code example that you have for the kernels and also what are the error/issues that you are seeing? And is it correct to assume that the error happens on the compilation?
If you can provide the compilation command it would also be very helpful.
That would better help us to understand the situation.
Thanks,
Regards
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @BoonBengT_Intel,
template <typename flux_in, typename flux_tempo>
struct travail_sur_voisinage {
sycl::ext::oneapi::experimental::annotated_arg<
int , decltype(sycl::ext::oneapi::experimental::properties{
stable})>
taille_h;
sycl::ext::oneapi::experimental::annotated_arg<
int , decltype(sycl::ext::oneapi::experimental::properties{
stable})>
taille_v;
auto get(sycl::ext::oneapi::experimental::properties_tag) {
return sycl::ext::oneapi::experimental::properties{
streaming_interface<>};
}
void operator()() const {
//Compteurs ligne pixel
//Entree Sortie
[[intel::fpga_register]]
unsigned int pixel_a_traiter;
[[intel::fpga_register]]
unsigned int pixel_a_envoyer;
[[intel::fpga_register]]
unsigned int pixel_apres_traitement;
//Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]]
unsigned int line_buffer[5][NB_COLONNE_MAX];
//Voisinnage
[[intel::fpga_register]]
unsigned int fenetre[5][5];
[[intel::initiation_interval(1)]]
for (int num_lig = 0; num_lig < taille_v + 2; num_lig++)
{
[[intel::initiation_interval(1)]]
//[[intel::speculated_iterations(0)]]
for (int num_col = 0; (num_col < taille_h + 2); num_col++)
{
if (num_lig < taille_v && num_col < taille_h)
{
pixel_a_traiter = flux_in::read();
//Gestion ligne a retard
fpga_tools::UnrolledLoop<0,4>([&](auto l)
{
line_buffer[l][num_col] = line_buffer[l + 1][num_col];
});
line_buffer[4][num_col] = pixel_a_traiter;
//Fin gestion ligne a retard
//Fenetre video glissante
fpga_tools::UnrolledLoop<0,5>([&](auto li)
{
// #pragma unroll
fpga_tools::UnrolledLoop<0,4>([&](auto co)
{
fenetre[li][co] = fenetre[li][co + 1];
});
fenetre[li][4] = line_buffer[li][num_col];
});
//Fin Fenetre video glissante
}
pixel_apres_traitement = traitement_5x5(fenetre);
if ((num_lig >= 2) && (num_col >= 2))
{
pixel_a_envoyer = 0;
if (((num_lig >= 4) && (num_lig < taille_v) && (num_col >= 4) && (num_col < taille_h)))
{
pixel_a_envoyer = pixel_apres_traitement;
}
flux_tempo::write(pixel_a_envoyer);
}
}
}
}
};
the compilation command is :
tp3_video.fpga_sim: kernel_sim.o
icpx -fsycl -fintelfpga -Xsclock=400MHz -Xsoptimize=latency -Xssimulation -Xsghdl=0 -Xstarget=Agilex7 -Xsv $^ -o $@
kernel_sim.o: src/tp3_video.cpp
icpx -fsycl -O3 -g -std=c++17 -Wall -I include -v -fintelfpga -Xsprofile -Xssimulation -DFPGA_SIMULATOR -o $@ -c src/tp3_video.cpp
Have a great day,
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi DorianL,
I not sure it is directly related to your application request or not based on your previous feedback but can check out the method below on getting random number:
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi DorianL,
As we do not receive any response from you on the previous question/reply/answer that we have provided. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.
Thanks.
Regards,
Aik Eu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @aikeu ,
I was waiting for an answer following my last post with my code and compilation commands as asked by @BoonBengT_Intel in a previous post. Could we wait an answer from him before closing the thread ? Thank you !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @DorianL,
Appreciate the hold, based on the explanation above on the first part when you are facing challenges in working with the loop's optimization, the Loop Analysis Report provides a good insight on the loop structure that you have written and would give some ideas on where the bottleneck is, more details on the report can be found in the following link below:
- https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/2024-1/loop-analysis.html
On the other hand, when you are trying the inner loop optimization and facing another issues when trying to implement the external function, you can use the 'std::default_random_engine' instead in c++.
More details of the implementation of inner loop optimization could be found in our git repo below:
- https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/C%2B%2BSYCL_FPGA/Tutorials/DesignPatterns/optimize_inner_loop
Hope that clarifies
Regards
BB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @BoonBengT_Intel,
Thank you for the information but the issue is that in the kernel code I sent before the variable "taille_v" and "taille_h" are input of my IP and it increases the latency of the loop compared to when both variable are defined and not input anymore. Is this an issue that you have ever faced ? Thank you !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @DorianL
I think that the piece of documentation you referred to is a bit confusing: https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/current/optimize-inner-loop-throughput.html
The recommendation in that document is to bound your loop with a constant, not a random number. The rand() function is just used to indicate a value that is not known at compile time, and is equivalent to your taille_h and taille_v kernel arguments.
Your line 49 loop could be written like this:
for (int num_col = 0; (num_col < 64) && (num_col < taille_h + 2); num_col++)
This code looks like you are trying to create a local filter. We have a convolution 2d reference design that may be helpful to you: https://github.com/oneapi-src/oneAPI-samples/tree/development/DirectProgramming/C%2B%2BSYCL_FPGA/ReferenceDesigns/convolution2d
Importantly, this design replaces your loop nest with a single loop. Instead of
for (int row = 0; row < num_rows; row++) {
for(int col = 0; col < num_cols; col++) {
<...>
}
}
the convolution design uses a while(1) control loop, and updates the current row/column variables depending on sideband signals that are inserted when the input image is read. This also makes for a design that is able to recover from defective packets (e.g. when a camera is unexpectedly unplugged).
If you prefer the simplicity of a double loop nest and don't need to recover from stream interruptions, you can coalesce the loop nest with the loop_coalesce attribute.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @whitepau,
Thank you for your answer, I tried to bound the loop with a constant number and it doesn't change anything on the simulation, the read_rdy signal of the pipe is sometimes low whereas it is always high when i use a constant size of image instead of kernel arguments.
The issue is still the same the inner loop is very slow when I compare between kernel argument and constant value.
I tried to coalesce the loop nest but I have worse results in simulation than when I don't use coalesce loops.
I already tried to use the convolution_2d github example and I can't use a picture with higher dimension than the original example (test_0.bmp,test_1.bmp...). Is this possible to use other pictures ?
Thank you,
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your screenshot mixes the ready/data signals from flux_in with the valid signal from flux_out.
It almost looks like the compiler was not able to schedule your loop with II=1. You can upload your report.zip so we can look at it.
https://www.intel.com/content/www/us/en/docs/oneapi-fpga-add-on/developer-guide/current/review-the-report-html-file.html
You may want to consider the max_reinvocation_delay attribute (code sample)
You can replace the images but make sure they have the same filenames as the originals. (test_0.bmp, test_1.bmp. etc.). I think all test images must be the same size too, since the testbench passes a sequence of images but only sets the image dimensions once.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @whitepau,
Thank you for your answer, I was just talking about the read_rdy from flux_in wich is not constant. I just uploaded the report with this post if you can find something wrong in it, but suprisingly the compiler can reach a II=1 whereas it is clearly not the case in simulation.
I tried with the max_reinvocation_delay pragma but it doesn't change anything on it and it is still slow.
Thank you for the information about the convolution_2d example, I will try to replace images and keep you in touch !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sharing the report, @DorianL .
It looks like the loop at line 137 was pipelined with II=1, but it was constrained to serial execution.
This means that this outer loop is effectively un-pipelined. This doesn't explain the gaps you are seeing in the simulation waveform though.
I also see that you are getting a memory system with lots of arbitration:
After some experimenting, I discovered that the warning about the variable 'fenetre' is a bit of a red herring here. I would expect for a line buffer like you are describing to have a memory system with multiple banks, and each bank having a dedicated load/store unit (LSU). From the image above, we can see that the memory system is not efficiently selecting banks. I tried using the bank_bits attribute to constrain this, but it appears the compiler is ignoring this attribute now.
I was able to get the compiler to partition your 2d array by swapping the dimensions (transposing) so that the dimension to be split into banks (i.e. accessed simultaneously by different unrolled loop iterations) was in the least significant place. This appears to result in the desired memory system (don't forget to swap the accesses too!!)
OLD:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[8][NB_COLONNE_MAX];
NEW:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[NB_COLONNE_MAX][8];
* Note that I changed the dimension from 5 to 8: the compiler complains if you try to create a memory system with a non-power-of-2 number of banks. Changing to 8 is ok because the compiler sees that the extra 3 banks aren't used and it optimizes them away.
The new memory system looks a lot better now:
The sim looks a lot better too:
I think i know how to solve these 2-cycle dips but I'm still waiting for the test to finish.
I suspect it's a side-effect of using a loop nest instead of using a while(1) loop to iterate across image pixels.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @whitepau,
Thank you for your work and answer, everything is working good now ! I never thought it could be a memory issue on that, I will check the kernel memory viewer more often now. The simulation is perfect now and the rdy signal is as it should be. Have a great day !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
wait... no those blips are intentional: in your inner loop you have a conditional statement intentionally causes two stalls at the end of each row of pixels.
for (int num_col = 0; (num_col < NB_COLONNE_MAX) && (num_col < taille_h + 2); num_col++) {
if (num_lig < taille_v && num_col < taille_h) {
pixel_a_traiter = flux_in::read();
<...>
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
glad it's working @DorianL
When i was looking for an answer for you, I saw that we don't actually have a code sample demonstrating any of these things. Would it be alright if we used the code you sent in as the basis for a sample to demonstrate kernel memory optimization?
Also, please mark my post as the solution to close the thread
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @whitepau,
Of course you can use it as an example for kernel memory optimization. Thank you !
DorianL
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page