Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Fixed point optimization

Altera_Forum
Honored Contributor II
884 Views

Hello,  

 

I have written two kernels to notice the difference in fixed and floating point operations.  

 

a)  

__kernel 

__attribute__((task)) 

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

b) 

__kernel 

__attribute__((task)) 

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
141 Views

Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

Altera_Forum
Honored Contributor II
141 Views

I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-  

 

Kernel 1 - 8 bit (char) resource usage according to quartus  

Total registers 68810 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 1,983,656 / 55,562,240 ( 4 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Kernel 2 - 32 bit (float) resource usage according to quartus  

Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % ) 

Total registers 157318 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 10,365,736 / 55,562,240 ( 19 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?
Altera_Forum
Honored Contributor II
141 Views

I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.

Reply