Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)

Fixed point optimization

Altera_Forum
Honored Contributor II
1,008 Views

Hello,  

 

I have written two kernels to notice the difference in fixed and floating point operations.  

 

a)  

__kernel 

__attribute__((task)) 

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

b) 

__kernel 

__attribute__((task)) 

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
265 Views

Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

0 Kudos
Altera_Forum
Honored Contributor II
265 Views

I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-  

 

Kernel 1 - 8 bit (char) resource usage according to quartus  

Total registers 68810 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 1,983,656 / 55,562,240 ( 4 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Kernel 2 - 32 bit (float) resource usage according to quartus  

Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % ) 

Total registers 157318 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 10,365,736 / 55,562,240 ( 19 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?
0 Kudos
Altera_Forum
Honored Contributor II
265 Views

I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.

0 Kudos
Reply