Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.

Fixed point optimization

Altera_Forum
Honored Contributor II
843 Views

Hello,  

 

I have written two kernels to notice the difference in fixed and floating point operations.  

 

a)  

__kernel 

__attribute__((task)) 

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

b) 

__kernel 

__attribute__((task)) 

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
100 Views

Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

Altera_Forum
Honored Contributor II
100 Views

I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-  

 

Kernel 1 - 8 bit (char) resource usage according to quartus  

Total registers 68810 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 1,983,656 / 55,562,240 ( 4 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Kernel 2 - 32 bit (float) resource usage according to quartus  

Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % ) 

Total registers 157318 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 10,365,736 / 55,562,240 ( 19 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?
Altera_Forum
Honored Contributor II
100 Views

I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.

Reply