- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Can anyone tell me why my resource usage increase non-linearly when I double parallelism.

I have set KERNEL_PARALLEL=64, FILTER_PARALLEL=8, so I expect every clock will have 64x8=512 MAC. and report.html report that ALUTs, FF, RAM, DSP 32-bit Integer Add (x640) 15981 0 0 0 32-bit Integer Multiply (x512) 0 0 0 256 As expect have 512 Multiply,**my question is , why only use 256 dsp? and why 32 bit add is 640?**------------------------------------------ Furthermore, when I set KERNEL_PARALLEL=64, FILTER_PARALLEL=16, the resource usage become very large, that I can't understand. How to explain this usage? Add(x13986) 456398 0 0 0 And(x13248) 145728 0 0 0 Mul(x00610) 000000 0 0 305

```
typedef struct{
char kk;
} kernel_parallel;
typedef struct{
kernel_parallel ff;
} filter_parallel;
__kernel(){
int result_buffer;
filter_parallel w_in,data_in;
# pragma unroll
for(int i=0; i<FILTER_PARALLEL; i++){
# pragma unroll
for(int j=0; j<KERNEL_PARALLEL; j++){
​ result_buffer += w_in.ff.kk * data_in.ff.kk;
}
}
}
```

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Each DSP has two 18x18 multipliers. Since your data is *char*, it is possible to do two multiplications per DSP. Hence, 512 multiplications only requires 256 DSPs. Apart from that, since you have a reduction on result_buffer and the each index j in this buffer is written to at every iteration of i, apart from the addition inside of the j loop, you also need extra adders between the iterations in the i loop to get the final values of result_buffer for each index. However, based on my calculations, the total number of adders should be 64 x 8 + 64. However, the compiler is for some reason instantiating 64 extra adders.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Is this code writing style will have tree reduction effect?

# pragma unroll for(int i=0; i<FILTER_PARALLEL; i++){ # pragma unroll for(int j=0; j<KERNEL_PARALLEL; j++){ ​ result_buffer[j] += w_in.ff*.kk[j] * data_in.ff*.kk[j]; } } Is there any way can further save resource usage? And is that possible compiler use ALUTs to make multiplier when MAC per clock exceed certain range ? because 64*8 uses 256 DSP and 64*16 should use 512 DSP but it only use 305 DSP. for KERNEL_PARALLEL=64, FILTER_PARALLEL=16, I have change result_buffer data type from Int to short, and the resource usage become smaller, (but result would be wrong when use short to store two char MAC) the total usage of DSP is 512, but didn't seem like two multiplication share one DSP, I still can't explain how it calculate resource usage. 16-bit Integer Add (x959) 8196 0 0 319 16-bit Integer Mul (x193) 0 0 0 193 I have follow Best Practice Guide and try use Mask to save some resource when result_buffer is Int. I only need 25 bit that can full hold my data. result_buffer[j] += 0x01ffffff & w_in.ff

*.kk[j] * data_in.ff*.kk[j]; the report.html still report that I am using 32-bit Add and 32-bit mul, even I change my mask to 0x0000ffff. and also the signed bit will also be mask. do you have more information about how to do this?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page