- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I would like to ask you a question about the performances on the FPGA board Arria 10.
I'm using DPC++ and adapted the sample code of the matrix multiplication
to run on FPGAs, as described in
https://pp4fpgas.readthedocs.io/en/latest/devcloud.html
The kernel is the following:
h.parallel_for(range(M, P), [=](auto index) { |
// Get global position in Y direction. |
int row = index[0]; |
// Get global position in X direction. |
int col = index[1]; |
float sum = 0.0f; |
// Compute the result of one element of c |
for (int i = 0; i < width_a; i++) { |
sum += a[row][i] * b[i][col]; |
} |
c[index] = sum; |
});
Changing the matrix size from 128 to 4096 and running the kernel on GPUs, CPUs and FPGAs I have observed the performance on Arria 10 is always below 1 GFlops, while on GPUs and CPUs I can reach far better performances.
I've recently found in
https://software.intel.com/content/www/us/en/develop/download/oneapi-fpga-optimization-guide.html
(Section 4.2.2 ), that I probably need to specify the work group size manually, but I always get performances below 1 GFlops.
Could you please tell me if I need to change some lines in the kernel or if I need to use particular compilation flags for optimization?
Many thanks. Any suggestion is very welcome.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi ,
Please let us know which CPU and GPU you are comparing the A10 against.
Also in the section 4 of the optimization guide there are different optimization methods mentioned. Are you getting the same results while applying the optimizations. Like when you say less than 1 G Flops , does it vary like 800 M Flops ,900 M Flops etc.
Please let us know further details.
Thanks and Regards
Anil

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page