Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
674 Discussions

performances on Arria 10




I would like to ask you a question about the performances on the FPGA board Arria 10.


I'm using DPC++ and adapted the sample code of the matrix multiplication

to run on FPGAs, as described in

The kernel is the following:

h.parallel_for(range(M, P), [=](auto index) {
// Get global position in Y direction.
int row = index[0];
// Get global position in X direction.
int col = index[1];
float sum = 0.0f;
// Compute the result of one element of c
for (int i = 0; i < width_a; i++) {
sum += a[row][i] * b[i][col];
c[index] = sum;


Changing the matrix size from 128 to 4096 and running the kernel on GPUs, CPUs and FPGAs I have observed the performance on Arria 10 is always below 1 GFlops, while on GPUs and CPUs I can reach far better performances.


I've recently found in

(Section  4.2.2 ), that I probably need to specify the work group size manually, but I always get performances below 1 GFlops.


Could you please tell me if I need to change some lines in the kernel or if I need to use particular compilation flags for optimization?


Many thanks. Any suggestion is very welcome.


0 Kudos
1 Reply

Hi ,

Please let us know which CPU and GPU you are comparing the A10 against.

Also in the section 4 of the optimization guide there are different optimization methods mentioned. Are you getting the same results while applying the optimizations. Like when you say less than 1 G Flops , does it vary like 800 M Flops ,900 M Flops etc.

Please let us know further details.

Thanks and Regards


0 Kudos