Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16596 Discussions

how to share DSP correctly ?

Altera_Forum
Honored Contributor II
3,287 Views

16bit or 8bit multiply should only use 0.5 DSP. 

I have try to implement two char type MAC, and the result will be store into int type. 

 

The DSP usage should be 64x16/2=1024/2=512 

and in report.html, it also report I use 512 DSP. 

However, after compile to aocx, the DSP usage is 1024, so there is no DSP sharing. 

 

And, although the report.html report the kernel have DSP sharing, 

when I try to increase parallelism to 64x32, the DSP usage should be 64x32/2=1024 

It should fit on a10 1150, which has 1518 DSP. 

However, the report.html report that I use 759 DSP, which is half of 1518. 

and also a lot of "add", "and" logic, I guess the compiler us logic to implement rest of (1024-759) DSP. 

which means although report.html know that I want to have DSP sharing,  

but it didn't have ability to do that. 

the report have underestimate the DSP usage. 

I use quartus 17.0 and also 17.1, the result is same. 

how to share DSP correctly ? 

 

__kernel 

__attribute__((task))  

void mul(){ 

 

int sum = 0; 

int partial_sum; 

char w_in[64][16]; 

char d_in[16]; 

 

for(){  

partial_sum = 0; 

# pragma unroll 

for(int i=0 ; i<64 ; i++){ 

# pragma unroll 

for(int j=0 ; j<16 ; j++){ 

partial_sum += (short) (w_in[i][j] * d_in[j]); 

sum += partial_sum; 

}
0 Kudos
8 Replies
Altera_Forum
Honored Contributor II
1,978 Views

You cannot do two MACs per DSPs, regardless of your data size, since each DSP only has one adder. If your data size is smaller than 18 bits, you can do a maximum of two MULs per DSP, or a dot product like "±(ax * ay) + (bx * by)". Refer to the following document for the different modes of computation for each DSP: 

 

https://www.altera.com/en_us/pdfs/literature/ug/ug_nfp_dsp.pdf
0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

Thanks HRZ 

 

Did you mean I have to change code like this ? 

and do you know why report.html tells that my DSP usage is 512 when I use 64x16 unroll MAC ? 

 

 

__kernel 

__attribute__((task))  

void mul(){# pragma unroll 

for(int i=0 ; i<64 ; i++){ 

# pragma unroll 

for(int j=0 ; j<8; j++){ 

partial_sum += (short) ((w_in[j] * d_in[j]) + (w_in[j+8] * d_in[j+8])); 

}
0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

What I was trying to say was that what you want to do is likely not possible due to hardware limitations of the DSP. I modified your original example to prevent the compiler from optimizing everything out (since the kernel has no arguments) and I got a DSP usage of 1024 in the report. I am not sure how or why you got 512. Can you archive and attach the report folder with 512 DSPs?

0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

I use quartus opencl SDK 17.0 or 17.1 have same result. 

 

the file 32bit_256 is  

 

int partial_sum; char w_in,d_in;# pragma unroll for(int j=0 ; j<256; j++){ partial_sum +=(w_in * d_in); }  

 

compiler auto convert w_in and d_in to same type as partial_sum, 

so report.html shows 32 bit Mul and Add. 

However, DSP usage is still half of 256, is shows 128 DSP usage. 

 

and the file 32bit_800 is increase parallelism to 800, it should use 400 DSP. 

However, the report.html shows that I have 759 Mul, DSP usage is 379.5, and it use LEs to implement rest of Mul. 

which means, the maximum of 32 bit Mul is only half of total 1518 DSP.  

While the report shows it cost me 0.5 DSP every 32 bit Mul, it cost me 2 DSP in real. 

the DSP usage is underestimate by report.html. 

 

 

 

 

I modify the code to force use 16 bit Mul with parallelism 800. 

it works well and the report shows I have 800 Mul and DSP usage is 400. 

but when further increase parallelism to 1600, the same thing happened, I can have 1518 Mul, and it use LEs to implement rest of Mul. 

While the report shows it cost me 0.5 DSP every 16 bit Mul, it cost me 1 DSP in real. 

 

int partial_sum; char w_in,d_in;# pragma unroll for(int j=0 ; j<256; j++){ partial_sum +=(short)(w_in * d_in); }  

 

 

My question is if my data type is char or short, I can consider DSP resource as twice it provide, right? 

I have seen some paper with Arria 10 GX 1150,  

when data type is float, their DSP is 1518, and when data type if FP16 or FP8, their DSP is 3036. 

and how to use it correctly?
0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

This is an interesting observation. However, it seems there is some issue with resource estimation in Quartus v17/17.1. Using 16.1.2, I get 512 DSPs for 32bit_256 in the report, and full DSP utilization for 32bit_800. Also I get 800 DSPs for 16bit_800, and full DSP utilization for 16bit_1600. 

 

Based on Intel's documentation, each DSP on Arria 10 can do a maximum of one 27-bit x 27-bit MUL, or two independent 18-bit x 18-bit MULs. This means that multiplying two 32-bit integers requires two DSPs. However, it should be possible to perform two 16-bit x 16-bit multiplications per DSP, but for some reason, the compiler is failing to correctly infer this. 

 

Intel has added a new extension in v17 for Arbitrary Precision Integers (Programming Guide, Section 5.6). You might be able to do what you want using that extension. Just make sure to follow the instructions for casting the variables from the documentation. 

 

Finally, regarding the case with 3036 MULs, are you talking about Intel's own paper here? 

 

https://arxiv.org/abs/1701.03534 

 

That paper uses fixed-point with shared exponent, and inferring that data type probably requires complex bit masking. In fact, it is possible that they also used undocumented features of the compiler to achieve that behavior. 

 

P.S. It is also possible that the "printf" in your kernel is preventing the mapper from correctly packing the DSPs. For the sake of completeness, you should try removing the printf and see what happens.
0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

I have remove printf, and result is the same. 

 

I am talking about  

https://wicil.ece.wisc.edu/wp-content/uploads/2017/02/jzhang_fpga17_cnn.pdf 

 

the author use SDK 16.0.211, so I think maybe it work well with SDK 16.0.211. 

if the report.html can show two 16-bit x 16-bit multiplications per DSP, it shouldn't require complex bit masking. 

However, my BSP only support quartus 17.0, so I can't try 16.0.211. 

 

by the way, the author says they limit the maximum fan-out to 100 to increase frequency, 

and they got a very high fmax of 385 MHz, do you know how they achieve that? 

I always get Fmax 200 ~ 300 MHz, I also try use "aoc -c" first, and use quartus to set fan-out, but still have l
0 Kudos
Altera_Forum
Honored Contributor II
1,978 Views

Oh, that paper... That is not really an OpenCL design. They have coded pretty much everything in System Verilog, and then packaged it into an OpenCL kernel as an HDL library. The could do two 16-bit MULs per DSP since they were describing their computation in a low-level language, and that is also how they managed to achieve such high operating frequency. The paper from Intel, however, claims to describe the design purely in OpenCL. 

 

And yes, the OpenCL report claims it is implementing 16-bit x 16-bit MUL, but as you saw yourself, it is still not actually capable of packing two such MULs into one DSP and after placement and routing, you still get only one MUL per DSP. That is why I think achieving such behavior in OpenCL might require bit masking.
0 Kudos
mvemp
Novice
1,978 Views

I am also facing exactly the same issue? Was the probelm solved? How did you increase the number of 32 bit multiplications beyond 1518? Is there anyway? Were you using hard HDL ip?

0 Kudos
Reply