Showing results for

- Intel Community
- FPGAs and Programmable Solutions
- Intel® Quartus® Prime Software
- how to share DSP correctly ?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-10-2018
09:46 AM

1,398 Views

how to share DSP correctly ?

16bit or 8bit multiply should only use 0.5 DSP.

I have try to implement two char type MAC, and the result will be store into int type. The DSP usage should be 64x16/2=1024/2=512 and in report.html, it also report I use 512 DSP. However, after compile to aocx, the DSP usage is 1024, so there is no DSP sharing. And, although the report.html report the kernel have DSP sharing, when I try to increase parallelism to 64x32, the DSP usage should be 64x32/2=1024 It should fit on a10 1150, which has 1518 DSP. However, the report.html report that I use 759 DSP, which is half of 1518. and also a lot of "add", "and" logic, I guess the compiler us logic to implement rest of (1024-759) DSP. which means although report.html know that I want to have DSP sharing, but it didn't have ability to do that. the report have underestimate the DSP usage. I use quartus 17.0 and also 17.1, the result is same. how to share DSP correctly ? __kernel __attribute__((task)) void mul(){ int sum = 0; int partial_sum; char w_in[64][16]; char d_in[16]; for(){ partial_sum = 0; # pragma unroll for(int i=0 ; i<64 ; i++){ # pragma unroll for(int j=0 ; j<16 ; j++){ partial_sum += (short) (w_in[i][j] * d_in[j]); } } sum += partial_sum; }Link Copied

8 Replies

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-10-2018
10:58 AM

89 Views

You **cannot** do two MACs per DSPs, regardless of your data size, since each DSP only has one adder. If your data size is smaller than 18 bits, you can do a maximum of two MULs per DSP, or a dot product like "±(ax * ay) + (bx * by)". Refer to the following document for the different modes of computation for each DSP:

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-10-2018
11:26 AM

89 Views

Thanks HRZ

Did you mean I have to change code like this ? and do you know why report.html tells that my DSP usage is 512 when I use 64x16 unroll MAC ? __kernel __attribute__((task)) void mul(){# pragma unroll for(int i=0 ; i<64 ; i++){ # pragma unroll for(int j=0 ; j<8; j++){ partial_sum += (short) ((w_in
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-10-2018
06:52 PM

89 Views

Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-11-2018
05:16 AM

89 Views

I use quartus opencl SDK 17.0 or 17.1 have same result.

the file 32bit_256 is```
int partial_sum;
char w_in,d_in;# pragma unroll
for(int j=0 ; j<256; j++){
partial_sum +=(w_in * d_in);
}
```

compiler auto convert w_in and d_in to same type as partial_sum, so report.html shows 32 bit Mul and Add. However, DSP usage is still half of 256, is shows 128 DSP usage. and the file 32bit_800 is increase parallelism to 800, it should use 400 DSP. However, the report.html shows that I have 759 Mul, DSP usage is 379.5, and it use LEs to implement rest of Mul. which means, the maximum of 32 bit Mul is only half of total 1518 DSP. While the report shows it cost me 0.5 DSP every 32 bit Mul, it cost me 2 DSP in real. the DSP usage is underestimate by report.html. I modify the code to force use 16 bit Mul with parallelism 800. it works well and the report shows I have 800 Mul and DSP usage is 400. but when further increase parallelism to 1600, the same thing happened, I can have 1518 Mul, and it use LEs to implement rest of Mul. While the report shows it cost me 0.5 DSP every 16 bit Mul, it cost me 1 DSP in real. ```
int partial_sum;
char w_in,d_in;# pragma unroll
for(int j=0 ; j<256; j++){
partial_sum +=(short)(w_in * d_in);
}
```

My question is if my data type is char or short, I can consider DSP resource as twice it provide, right? I have seen some paper with Arria 10 GX 1150, when data type is float, their DSP is 1518, and when data type if FP16 or FP8, their DSP is 3036. and how to use it correctly?
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-11-2018
04:38 PM

89 Views

This is an interesting observation. However, it seems there is some issue with resource estimation in Quartus v17/17.1. Using 16.1.2, I get 512 DSPs for 32bit_256 in the report, and full DSP utilization for 32bit_800. Also I get 800 DSPs for 16bit_800, and full DSP utilization for 16bit_1600.

Based on Intel's documentation, each DSP on Arria 10 can do a maximum of one 27-bit x 27-bit MUL, or two independent 18-bit x 18-bit MULs. This means that multiplying two 32-bit integers requires
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-11-2018
05:20 PM

89 Views

I have remove printf, and result is the same.

I am talking about https://wicil.ece.wisc.edu/wp-content/uploads/2017/02/jzhang_fpga17_cnn.pdf the author use SDK 16.0.211, so I think maybe it work well with SDK 16.0.211. if the report.html can show two 16-bit x 16-bit multiplications per DSP, it shouldn't require complex bit masking. However, my BSP only support quartus 17.0, so I can't try 16.0.211. by the way, the author says they limit the maximum fan-out to 100 to increase frequency, and they got a very high fmax of 385 MHz, do you know how they achieve that? I always get Fmax 200 ~ 300 MHz, I also try use "aoc -c" first, and use quartus to set fan-out, but still have l
Altera_Forum

Honored Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

04-11-2018
06:31 PM

89 Views

Oh, that paper... That is not really an OpenCL design. They have coded pretty much everything in System Verilog, and then packaged it into an OpenCL kernel as an HDL library. The could do two 16-bit MULs per DSP since they were describing their computation in a low-level language, and that is also how they managed to achieve such high operating frequency. The paper from Intel, however, claims to describe the design purely in OpenCL.

And yes, the OpenCL report claims it is implementing 16-bit x 16-bit MUL, but as you saw yourself, it is still not actually capable of packing two such MULs into one DSP and after placement and routing, you still get only one MUL per DSP. That is why I think achieving such behavior in OpenCL might require bit masking.
mvemp

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-12-2018
05:54 PM

89 Views

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.