Community
cancel
Showing results for 
Search instead for 
Did you mean: 
amrmesh
Beginner
473 Views

report shows no DSP usage for OpenCL kernel

Hello, I'm new to OpenCL for fpgas, I wrote this code for matrix and vector multiplication, after compiling with aoc 19.3, the report shows that no DSP is used for calculation of the kernel, anyone know what I'm doing wrong?

 

__kernel void matvec(global float* restrict matrix_a, global float* restrict vectors_b, global float* restrict result, uint n, uint vec_count) {   float localmatrix[BLOCK_SIZE][BLOCK_SIZE]; float localvector[BLOCK_SIZE]; float localresult[BLOCK_SIZE]; float presult[II_CYCLES+1];   #pragma unroll 16 for(uint e=0 ; e<n ; e++) result[e] = 0;   //iteration over matrix blocks rows uint bi = 0; for(uint bi=0 ; bi<n ; bi+=BLOCK_SIZE) {   //initializing the localresult #pragma unroll for(uint e=0 ; e<n ; e++) localresult[e] = 0;   //iteration over matrix blocks colomns uint bj = 0; for(uint bj=0 ; bj<n ; bj+=BLOCK_SIZE) { //loading block of matrix to local #pragma unroll 16 for(uint ei=0 ; ei<BLOCK_SIZE ; ei++) for(uint ej=0 ; ej<BLOCK_SIZE ; ej++) localmatrix[ei][ej] = matrix_a[(bi+ei)*n+(bj+ej)];   //itteration over vectors for(uint k=0 ; k<vec_count ; k++) { //loading one block of one vector to local #pragma unroll 16 for(uint e=0 ; e<BLOCK_SIZE ; e++) localvector[e] = vectors_b[k*n+(bj+e)];   //***localresult[j] += localmatrix[i][j] * localvector[j]*** //iteration over matrix colomns for(uint i=0 ; i<BLOCK_SIZE ; i++) { //initializing presult #pragma unroll for(uint e=0 ; e<II_CYCLES+1 ; e++) presult[e] = 0;   //iteration over matrix rows for(uint j=0 ; j<BLOCK_SIZE ; j++) { //localresult[i] += localmatrix[i][j] * localvector[j]; presult[II_CYCLES] = localmatrix[i][j] * localvector[j] + presult[0];   #pragma unroll for(uint e = 0 ; e<II_CYCLES+1 ; e++) presult[e] = presult[e+1]; } #pragma unroll for(uint e=0 ; e<II_CYCLES+1 ; e++) localresult[i] += presult[e]; } } }   //Writing the block of result back to main memory #pragma unroll 16 for(uint e=0 ; e<BLOCK_SIZE ; e++) result [bi+e] = localresult[e];   } }

 

Tags (2)
0 Kudos
10 Replies
MEIYAN_L_Intel
Employee
171 Views

Hi,

I would need some time to check internally about the information of DSP block implementation.

Thanks

amrmesh
Beginner
171 Views

HRZ
Valued Contributor II
171 Views

Have you verified the functional correctness of your code using emulation? It seems the v19.3 compiler is optimizing out most of your code. However, v16.1.2 which I still use for my main development behaves differently and does not optimize out your code but it generates very helpful warnings that could help finding the problem in your code:

test.cl:28: Compiler Warning: Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled in kernel matvec test.cl:67: Compiler Warning: removing out-of-bounds accesses to presult

Even thought v19.3 also generates the first warning, it does not generate the second one which could in fact be the source of your problem. It is possible that v19.3 is assigning a value of zero to the out-of-bound index and since you are shifting the buffer, it is assuming the whole buffer is being zeroed out and hence, it is optimizing out the computation in your kernel. Maybe @MeiYanL_Intel​ can elaborate why the newer versions of the compiler are excluding such critical warnings, forcing programmers to run in circles trying to discover issues in their code.

amrmesh
Beginner
171 Views

Hi,

Thank you very much for your time testing my code,

for functional correctness, the host code has a method to verify it, but apparently, that code also has some problems since it passed that test.

I got your point, and thanks, I will change the inner loop to see if the it will fix the problem,

a fast test (removing the shift register) proves your point is correct, by removing shift register, the report shows some DSP usage,

 

The strange thing is that even in the report there is no warning about this out-of-bound access !

 

amrmesh
Beginner
171 Views

Hello again,

Sorry, I don't understand why at line 67 the compiler complains about out-of-bounds, do you know which part causes out-of-bounds access?

at first I thought I'm mixing shift register size with block size, but I don't see any problem with that, do you know which part causing it?

 

HRZ
Valued Contributor II
171 Views

Sorry, I had to manually define BLOCK_SIZE and II_CYCLES to test your code and forgot to adjust the line numbers accordingly. You should deduct 3 from the line numbers I posted above to match your code. The second warning is on line 64 on your code where "presult[e+1]" would be out of bounds for e=II_CYCLES.

amrmesh
Beginner
171 Views

Thanks for your reply,

After fixing the out of bound access, report generation takes long time,

"aoc: Optimizing and doing static analysis of code..."

at this stage the compiler seems having a hard time trying to optimize the code, do you know why this is happ

#define II_CYCLES 16 #define BLOCK_SIZE 64   __kernel void matvec(global float* restrict matrix_a, global float* restrict vectors_b, global float* restrict result, uint n, uint vec_count) {   float localmatrix[BLOCK_SIZE][BLOCK_SIZE]; float localvector[BLOCK_SIZE]; float localresult[BLOCK_SIZE]; float presult[II_CYCLES+1];   #pragma unroll 16 for(uint e=0 ; e<n ; e++) result[e] = 0;   //iteration over matrix blocks rows uint bi = 0; for(uint bi=0 ; bi<n ; bi+=BLOCK_SIZE) {   //initializing the localresult #pragma unroll for(uint e=0 ; e<BLOCK_SIZE ; e++) localresult[e] = 0;   //iteration over matrix blocks colomns uint bj = 0; for(uint bj=0 ; bj<n ; bj+=BLOCK_SIZE) { //loading block of matrix to local #pragma unroll 16 for(uint ei=0 ; ei<BLOCK_SIZE ; ei++) for(uint ej=0 ; ej<BLOCK_SIZE ; ej++) localmatrix[ei][ej] = matrix_a[(bi+ei)*n+(bj+ej)];   //itteration over vectors for(uint k=0 ; k<vec_count ; k++) { //loading one block of one vector to local #pragma unroll 16 for(uint e=0 ; e<BLOCK_SIZE ; e++) localvector[e] = vectors_b[k*n+(bj+e)];   //***localresult[j] += localmatrix[i][j] * localvector[j]*** //iteration over matrix colomns for(uint i=0 ; i<BLOCK_SIZE ; i++) { //initializing presult #pragma unroll for(uint e=0 ; e<II_CYCLES+1 ; e++) presult[e] = 0;   //iteration over matrix rows for(uint j=0 ; j<BLOCK_SIZE ; j++) { //localresult[i] += localmatrix[i][j] * localvector[j]; presult[II_CYCLES] = presult[0] + localmatrix[i][j] * localvector[j];   #pragma unroll for(uint e = 0 ; e<II_CYCLES ; e++) presult[e] = presult[e+1]; } #pragma unroll for(uint e = 0 ; e<II_CYCLES ; e++) localresult[i] += presult[e]; } } }   //Writing the block of result back to main memory #pragma unroll 16 for(uint e=0 ; e<BLOCK_SIZE ; e++) result [bi+e] = localresult[e];   } }

ening? and is this long report generation time normal? or I'm doing something wrong again?

HRZ
Valued Contributor II
171 Views

I just put your new code into the v19.3 compiler on my environment targeting Arria 10 and the OpenCL compilation finished in less than a minute and the report was generated. How long does it take on your side? I have had cases where the OpenCL compilation takes even more than half an hour, but that is for some very specific cases. The compilation time will also depend on your processor speed.

 

P.S. What FPGA are you targeting and what compiler version are you using?

amrmesh
Beginner
171 Views

The emulation takes about 5 min and the actual synthesis takes about 4 hours,

I'm using a board with Intel Stratix 10,

however the reports looks fine now, except 1 problem:

The initialization at line 29 on localresult and the store operation at line 71 seems to cause conflict when doing pipelining

also the load operation on localresult at line 79 seems to conflict with store operation at line 71,

 

Do you have any idea how can I solve these conflict? do I need to change my algorithm completely?

HRZ
Valued Contributor II
171 Views

You should not expect to be able to pipeline the loops over block rows and block columns unless you allocate separate on-chip resources for every block row and every block column which would be impossible (area is limited and the loop bounds are not fixed). The unroll on line 36 is in the wrong place; you should unroll the consecutive inner loop at line 38, and NOT the outer loop at line 37. The load on line 47 is unnecessary, you can directly load the data from the global vectors_b buffer in line 63. You should partially unroll the j loop over the matrix rows to get the same effect as the load on 47. Doing so, however, will likely conflict with the presult shift register. To fix this, use the manual unrolling described in Section 3.2.2.1/Fig. 3-5 from the following document:

 

https://arxiv.org/abs/1810.09773

 

You should eventually be able to pipeline all the loops except the loops over block rows and block columns. When you reach that state, then you can go ahead and manually (i.e. do NOT use pargma unroll) unroll the loop over block columns and replicate on-chip resources based on the unroll factor until you saturate the device resources. You will, however, need to come up with a systolic array-style data movement to avoid hitting the external memory bandwidth bottleneck. Another approach to achieve block-level parallelism is to use the autorun kernel type alongside with blocking channels.

 

This document is also a very good source on effective loop transformations on FPGAs (though mostly targeted at Xilinx HLS but should also work fine with Intel HLS/OpenCL):

 

https://arxiv.org/abs/1805.08288v5

Reply