- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
I have an issue when I try to run my oneapi kernel by passing my arguments with "stable annotated_arg". I try to use a "for" loop with those "stable" arguments as variable in simulation but it is very slow and doesn't work very well whereas when i use classic "int" declared in the kernel without using an argument variable I don't have this issue in the "for" loop and the simulation work fine and fast. Do you have an idea of what could be the issue ? Thank you !
DorianL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for sharing the report, @DorianL .
It looks like the loop at line 137 was pipelined with II=1, but it was constrained to serial execution.
This means that this outer loop is effectively un-pipelined. This doesn't explain the gaps you are seeing in the simulation waveform though.
I also see that you are getting a memory system with lots of arbitration:
After some experimenting, I discovered that the warning about the variable 'fenetre' is a bit of a red herring here. I would expect for a line buffer like you are describing to have a memory system with multiple banks, and each bank having a dedicated load/store unit (LSU). From the image above, we can see that the memory system is not efficiently selecting banks. I tried using the bank_bits attribute to constrain this, but it appears the compiler is ignoring this attribute now.
I was able to get the compiler to partition your 2d array by swapping the dimensions (transposing) so that the dimension to be split into banks (i.e. accessed simultaneously by different unrolled loop iterations) was in the least significant place. This appears to result in the desired memory system (don't forget to swap the accesses too!!)
OLD:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[8][NB_COLONNE_MAX];
NEW:
// Ligne a retard
[[intel::fpga_memory("BLOCK_RAM")]] // memory
unsigned int line_buffer[NB_COLONNE_MAX][8];
* Note that I changed the dimension from 5 to 8: the compiler complains if you try to create a memory system with a non-power-of-2 number of banks. Changing to 8 is ok because the compiler sees that the extra 3 banks aren't used and it optimizes them away.
The new memory system looks a lot better now:
The sim looks a lot better too:
I think i know how to solve these 2-cycle dips but I'm still waiting for the test to finish.
I suspect it's a side-effect of using a loop nest instead of using a while(1) loop to iterate across image pixels.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi whitepau,
Thanks for the help and contribuition into the issue!
Hi DorianL,
I’m glad that your question has been addressed, I now transition this thread to community support. If you have a new question, Please login to ‘https://supporttickets.intel.com/s/?language=en_US’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support. The community users will be able to help you on your follow-up questions.
Thanks.
Regards,
Aik Eu

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »