- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to make the matrix multiplication example from the oneAPI sample on FPGA. But when I do the report compilation I get that I am using up more ram and DSP memory then it is on the board. But I did some calculations and should not be using that much memory. If I run it with only the parallel_for it does not give anything weird, but when I try to run it with the single_task I get a lot more memory usage. What am I doing wrong?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The 8 GB you're talking about is the global memory size, and that doesn't seem to be the problem here. The problem is that you are exceeding the on-chip BRAM blocks available on your FPGA, which is not 8 GB. Assuming that you are using an Arria 10 or Stratix 10 device, the BRAM is an order of Mega Bytes and way less than 8 GB In addition, you also have much more DSP usage and not just memory. When I saw your code, you had excessive loop unrolling in the design exceeding your hardware usage. Your loops over "P" or "M" like this one (for (int jway = 0; jway < P; jway++) ) cannot be fully unrolled for your FPGA device. Fully unrolling this loop means you need 4096 instances of this loop in parallel, replicating all the loop body, including memory and DSP. For instance, a floating point multiplication within this loop will require 4096 DSP slices for this single unrolling. But you have several full-unrolled loops.
Also, the compiler may implement a cache for your memory accesses using BRAM blocks. Your unrolling might explode the memory usage as well.
You need to specify your unrolling factor and tune it so that the design can fit into the FPGA device. You can remove all the unrolling first to see how it fits, and then optimize your design by tuning the unrolling factor.
Best regards,
Daouda
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can you please share error message screen shot?
Thank you,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just get a warning that it is using more the max capacity of RAM. I get an error if I remove the -Xsdont-error-if-large-area-est flag. Here is a picture of the area estimation from the report. I have also done some calculations on it, and with the 8 GB of RAM it should be able to host all the three matrices in its largest forms. 4 byte floats x 4096^2 elements x 3 matrices = 402653184 Bytes < 8 GB, so I don't get why the reports says I'm using 264 GB of RAM.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The 8 GB you're talking about is the global memory size, and that doesn't seem to be the problem here. The problem is that you are exceeding the on-chip BRAM blocks available on your FPGA, which is not 8 GB. Assuming that you are using an Arria 10 or Stratix 10 device, the BRAM is an order of Mega Bytes and way less than 8 GB In addition, you also have much more DSP usage and not just memory. When I saw your code, you had excessive loop unrolling in the design exceeding your hardware usage. Your loops over "P" or "M" like this one (for (int jway = 0; jway < P; jway++) ) cannot be fully unrolled for your FPGA device. Fully unrolling this loop means you need 4096 instances of this loop in parallel, replicating all the loop body, including memory and DSP. For instance, a floating point multiplication within this loop will require 4096 DSP slices for this single unrolling. But you have several full-unrolled loops.
Also, the compiler may implement a cache for your memory accesses using BRAM blocks. Your unrolling might explode the memory usage as well.
You need to specify your unrolling factor and tune it so that the design can fit into the FPGA device. You can remove all the unrolling first to see how it fits, and then optimize your design by tuning the unrolling factor.
Best regards,
Daouda
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
i need your project details means are working on your own design or some example one?
if it's possible to share your project files share. it helps to identify the issue.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
https://jupyter.oneapi.devcloud.intel.com/user/u166468/doc/tree/Martin's%20folder/GEMM
I forgot to reply to your answer. Hope this works!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
i think you got solution if you don't have any queries i'll close case. please confirm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thanks for conformation. now i am closing the case. if you reopen the case pls follow bellow link
Please login to ‘https://supporttickets.intel.com
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page