Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
506 Discussions

performances on Arria 10




I would like to ask you a question about the performances on the FPGA board Arria 10.


I'm using DPC++ and adapted the sample code of the matrix multiplication

to run on FPGAs, as described in

The kernel is the following:

h.parallel_for(range(M, P), [=](auto index) {
// Get global position in Y direction.
int row = index[0];
// Get global position in X direction.
int col = index[1];
float sum = 0.0f;
// Compute the result of one element of c
for (int i = 0; i < width_a; i++) {
sum += a[row][i] * b[i][col];
c[index] = sum;


Changing the matrix size from 128 to 4096 and running the kernel on GPUs, CPUs and FPGAs I have observed the performance on Arria 10 is always below 1 GFlops, while on GPUs and CPUs I can reach far better performances.


I've recently found in

(Section  4.2.2 ), that I probably need to specify the work group size manually, but I always get performances below 1 GFlops.


Could you please tell me if I need to change some lines in the kernel or if I need to use particular compilation flags for optimization?


Many thanks. Any suggestion is very welcome.


0 Kudos
1 Reply

Hi ,

Please let us know which CPU and GPU you are comparing the A10 against.

Also in the section 4 of the optimization guide there are different optimization methods mentioned. Are you getting the same results while applying the optimizations. Like when you say less than 1 G Flops , does it vary like 800 M Flops ,900 M Flops etc.

Please let us know further details.

Thanks and Regards