Solved: Re: How does AOC compiler map fixed point MAC operations to DSP ip blocks?

mvemp · ‎10-12-2018

Hello,

I am using Arria 10 GX 1150 FPGA board which contains 1518 DSP blocks.

I am trying to do MAC operations in 16 bit using "short" data type as shown in the below program

typedef short DTYPE;

__kernel

__attribute__((task))

__attribute__((max_global_work_dim(0)))

void multiply_input(

// Params Ports

__global volatile DTYPE *restrict a_in,

__global volatile DTYPE *restrict b_in,

__global volatile DTYPE *restrict c_out

)

{

int partial_sum[8];

for(uint i = 0; i< 8; i++) {

# pragma unroll

for(int j=0 ; j<512; j++){

partial_sum[i] += (a_in[j]* b_in[j]);

}

c_out[i] = 0xFFFF & (partial_sum[i]>>0x01);

}

If j = 512, if would require 256 DSP blocks to perform the MAC operations. AOC compiler maps it perfectly.

if j= 1024, AOC compiler must map 512 DSP blocks to perform 1024 16bit MAC operations. But the compiler fails to do that and logic utilization increases dramatically! Why does this happen?

Compiler fails to infer when j>759 (Total number of DSP = 1518 and 759 *2= 1518 ) Really Really Strange!

j = 759 report : (As intended)

!===========================================================================

! The report below may be inaccurate. A more comprehensive

! resource usage report can be found at conv_pipe/reports/report.html

!===========================================================================

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 31% ;

; ALUTs ; 16% ;

; Dedicated logic registers ; 16% ;

; Memory blocks ; 25% ;

; DSP blocks ; 25% ;

+----------------------------------------+---------------------------;

aoc: First stage compilation completed successfully.

Compiling for FPGA. This process may take a long time, please be patient.

j = 1024

!===========================================================================

! The report below may be inaccurate. A more comprehensive

! resource usage report can be found at conv_pipe/reports/report.html

!===========================================================================

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 82% ;

; ALUTs ; 56% ;

; Dedicated logic registers ; 32% ;

; Memory blocks ; 30% ;

; DSP blocks ; 25% ;

+----------------------------------------+---------------------------;

DSP blocks usage does not go beyond 380 for some reason.

This also happens when I use char (8 bit). Am i misssing some kind of mask to convince the compiler? Any suggestions to solve this case?

Similar problem in previous post but looks like problem is not solved from v16.1 to 18.1

https://forums.intel.com/s/question/0D50P00003yyTf2SAE/how-to-share-dsp-correctly-

HRZ · ‎10-14-2018

With 16.1.2, whether DTYPE is short or int, as long as partial_sum is int, I get 1024 DSPs for j=512.

Based on what you are saying, it seems the OpenCL compiler (v17+) does not correctly instantiate the IPs and infers 32-bit MULs instead of 16-bit ones (while the estimation in the report is based on 16-bit MULs), and for j > 759, since it thinks DSPs are fully utilized, it instantiates the logic-based IP instead of the DSP-based one . For cases where j <= 759, it seems the mapper is smart enough to pack the operations properly and reduce DSP usage, but for j > 759, since the compiler has instantiated logic-based IP, the mapper does not convert the logic-based IP to DSP-based and you get high logic usage but low DSP utilization.

I tested all the way to v18.1. With DTYPE=short and partial_sum defined as int, the DSP usage gets capped at 379 (379.5 actually!!!) for j > 759 and the rest of the operations are mapped to logic. However, with both defined as short, 17.0+ seems to correctly map two MULs per DSP but DSP usage gets capped at 50% this time. It seems there is still some code somewhere in the compiler (probably the linker) that has not been updated correctly with the new way DSP mapping happens, and is still using the old codes (pre-17.0) of two DSPs per MUL for DTYPE=short and partial_sum=int, and one DSP per MUL for DTYPE=short and partial_sum=short.

It is certainly possible to use all the DSPs on Arria 10 with 16-bit arithmetic. For example, you can look at the paper from Altera which claims to do so in OpenCL and another one which does so using an HDL library wrapped in OpenCL (both linked in the other thread). The behavior you are observing here seems to be a bug in the OpenCL compiler.

View solution in original post

HRZ · ‎10-14-2018

As I also mentioned in the other thread (though the usernames have not been carried over after the forum was moved from Altera to Intel), you CANNOT do two MACs per DSP on Arria 10, regardless of bit width, since each DSP has only one adder. You can, however, do two 18-bit (or smaller) MULs per DSP.

In your case, since your reduction variable (partial_sum) is int, all the computation is treated as int regardless of DTYPE. The integer additions are mapped to logic, while each integer multiplication uses 2 DSPs. Hence, j = 512 uses 1024 DSPs in reality, and j > 759 will fully utilize the DSPs with all the extra MULs mapped to logic. This is also exactly what the report from v16.1.2 shows. The report in 17+ seems to be broken when it comes to estimating the DSP usage in such cases (add this to the huge list of things that were broken in the transition from 16.1 to 17). I tried to see if I can get the DSP usage down by bit masking but in my quick test, it did not seem to work. The only thing that seemed to work was defining partial_sum as short. In that case, I got one MUL per DSP (but not two).

At the end of the day, the numbers in the report are estimations. The mapper might actually be able to correctly pack the operations and reduce the number of DSPs that are used. You should try place and routing the design to see what happens. You might have better luck using the new Arbitrary Precision Integers (Programming Guide, Section 5.6) extension.

mvemp · ‎10-14-2018

You have mentioned that the compiler doesn't take DTYPE decalaration into account because partial_sum is int. When I change DTYPE to int and try j = 128 the resource estimation as follows :

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 24% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 12% ;

; Memory blocks ; 21% ;

; DSP blocks ; 17% ;

+----------------------------------------+---------------------------;

DSP blocks : 256 as two DSP blocks required for performing 32 x 32 multplication.

when i change DTYPE to short with j = 128 the resource estimation is as follows:

+--------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 20% ;

; ALUTs ; 10% ;

; Dedicated logic registers ; 10% ;

; Memory blocks ; 15% ;

; DSP blocks ; 4% ;

+----------------------------------------+---------------------------;

DSP blocks = 64 as one dsp block performs two 16x16 multiplication with one addition. I think changing DTYPE does make difference in the number of DSP usage from the above results.

In j = 512, after the complete compilation process, top.fit.summary shows the DSP usage as 256 blocks and not 1024 using aoc v 17.1. This is exactly what we expected as each dsp block is capable of performing two 18 x 18 multiplication accumulating 32 bit result (In our case partial_sum). This is demonstrated in section 3.1.2 in the following document.

(https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_nfp_dsp.pdf). Do you think the compiler also uses the same mode for mapping the DSP blocks?

But this doesn't happen in the case j > 759. Somehow compiler gets confused and maps the multiplications to logic after using 379.5 DSPs (1518 /4). I tried checking after complete compilation (j = 768). First the aoc compiler increases the logic utilization. Thus, synthesis step takes lot of time (In my case 1 day). The top.fit.place.rpt also shows the same number of DSP blocks. I feel the throughput would also decrease as extra multiplication are performed on logic.

I tried using arbitrary precision integer extension and tried changing pratial_sum to int27_t. I dont think aoc allows me to change kernel declarations with arbitrary precision data types. No Luck again :(

AOC compiler performs well with mapping floating point multiplications. But did you come across programs which were able to use more than 50% DSP blocks with 8 bit or 16 bit fixed point multiplications on Arria 10 or any other FPGA device?

HRZ · ‎10-14-2018

With 16.1.2, whether DTYPE is short or int, as long as partial_sum is int, I get 1024 DSPs for j=512.

Based on what you are saying, it seems the OpenCL compiler (v17+) does not correctly instantiate the IPs and infers 32-bit MULs instead of 16-bit ones (while the estimation in the report is based on 16-bit MULs), and for j > 759, since it thinks DSPs are fully utilized, it instantiates the logic-based IP instead of the DSP-based one . For cases where j <= 759, it seems the mapper is smart enough to pack the operations properly and reduce DSP usage, but for j > 759, since the compiler has instantiated logic-based IP, the mapper does not convert the logic-based IP to DSP-based and you get high logic usage but low DSP utilization.

I tested all the way to v18.1. With DTYPE=short and partial_sum defined as int, the DSP usage gets capped at 379 (379.5 actually!!!) for j > 759 and the rest of the operations are mapped to logic. However, with both defined as short, 17.0+ seems to correctly map two MULs per DSP but DSP usage gets capped at 50% this time. It seems there is still some code somewhere in the compiler (probably the linker) that has not been updated correctly with the new way DSP mapping happens, and is still using the old codes (pre-17.0) of two DSPs per MUL for DTYPE=short and partial_sum=int, and one DSP per MUL for DTYPE=short and partial_sum=short.

It is certainly possible to use all the DSPs on Arria 10 with 16-bit arithmetic. For example, you can look at the paper from Altera which claims to do so in OpenCL and another one which does so using an HDL library wrapped in OpenCL (both linked in the other thread). The behavior you are observing here seems to be a bug in the OpenCL compiler.