Re: abnormal hardware resource usage

Altera_Forum · ‎03-14-2018

Can anyone tell me why my resource usage increase non-linearly when I double parallelism.

I have set KERNEL_PARALLEL=64, FILTER_PARALLEL=8, so I expect every clock will have 64x8=512 MAC.

and report.html report that

ALUTs, FF, RAM, DSP

32-bit Integer Add (x640)

15981

0

32-bit Integer Multiply (x512)

0

256

As expect have 512 Multiply, my question is , why only use 256 dsp? and why 32 bit add is 640?

------------------------------------------

Furthermore, when I set KERNEL_PARALLEL=64, FILTER_PARALLEL=16, the resource usage become very large, that I can't understand.

How to explain this usage?

Add(x13986) 456398 0 0 0

And(x13248) 145728 0 0 0

Mul(x00610) 000000 0 0 305

typedef struct{
    char kk;
} kernel_parallel;
typedef struct{
    kernel_parallel ff;
} filter_parallel;
__kernel(){
    int result_buffer;
    filter_parallel w_in,data_in;
   # pragma unroll
    for(int i=0; i<FILTER_PARALLEL; i++){
       # pragma unroll
        for(int j=0; j<KERNEL_PARALLEL; j++){
    &#8203;        result_buffer +=  w_in.ff.kk * data_in.ff.kk;
        }
    }
}

Altera_Forum · ‎03-14-2018

Each DSP has two 18x18 multipliers. Since your data is char, it is possible to do two multiplications per DSP. Hence, 512 multiplications only requires 256 DSPs. Apart from that, since you have a reduction on result_buffer and the each index j in this buffer is written to at every iteration of i, apart from the addition inside of the j loop, you also need extra adders between the iterations in the i loop to get the final values of result_buffer for each index. However, based on my calculations, the total number of adders should be 64 x 8 + 64. However, the compiler is for some reason instantiating 64 extra adders.

Regarding the excessive usage with FILTER_PARALLEL=16, I am not sure what is happening there. Maybe there is some device limitation with respect to the carry chains that is increasing the usage.

Altera_Forum · ‎03-14-2018

Is this code writing style will have tree reduction effect?

# pragma unroll

for(int i=0; i<FILTER_PARALLEL; i++){

# pragma unroll

for(int j=0; j<KERNEL_PARALLEL; j++){

 result_buffer[j] += w_in.ff.kk[j] * data_in.ff.kk[j];

}

Is there any way can further save resource usage?

And is that possible compiler use ALUTs to make multiplier when MAC per clock exceed certain range ?

because 64*8 uses 256 DSP and 64*16 should use 512 DSP but it only use 305 DSP.

for KERNEL_PARALLEL=64, FILTER_PARALLEL=16,

I have change result_buffer data type from Int to short, and the resource usage become smaller, (but result would be wrong when use short to store two char MAC)

the total usage of DSP is 512, but didn't seem like two multiplication share one DSP, I still can't explain how it calculate resource usage.

16-bit Integer Add (x959) 8196 0 0 319

16-bit Integer Mul (x193) 0 0 0 193

I have follow Best Practice Guide and try use Mask to save some resource when result_buffer is Int.

I only need 25 bit that can full hold my data.

result_buffer[j] += 0x01ffffff & w_in.ff.kk[j] * data_in.ff.kk[j];

the report.html still report that I am using 32-bit Add and 32-bit mul, even I change my mask to 0x0000ffff.

and also the signed bit will also be mask.

do you have more information about how to do this?