Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17268 Discussions

Fixed point optimization

Altera_Forum
Honored Contributor II
1,331 Views

Hello,  

 

I have written two kernels to notice the difference in fixed and floating point operations.  

 

a)  

__kernel 

__attribute__((task)) 

void test_multiplier(global char *restrict in, global char *restrict weights, global int *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

b) 

__kernel 

__attribute__((task)) 

void test_multiplier(global float *restrict in, global float *restrict weights, global float *restrict out) { 

 

 

int output = 0; 

# pragma unroll 100 

for(int i=0; i<VEC_SIZE; i++){ 

output += in * weights

 

 

*out = output; 

 

 

Both the kernels give me the same number of DSPs, i.e 100 (unroll factor). I was expecting 25 DSPs in the 8 bit (char argument) case. Does aoc compiler optimize well for fixed point quantizations?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
588 Views

Quartus/AOC v16.1.2 and below do not seem to be able to infer 8-bit and 16-bit operations correctly. Your first code example only uses 50 DSPs in 17.0.2 and above. However, it is probably best to define "out" and "output" as short rather than int.

0 Kudos
Altera_Forum
Honored Contributor II
588 Views

I have used aoc 17.1.2. Initial report after static analysis has predicted 50DSPs. After synthesis the quartus compilation report shows the following :-  

 

Kernel 1 - 8 bit (char) resource usage according to quartus  

Total registers 68810 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 1,983,656 / 55,562,240 ( 4 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Kernel 2 - 32 bit (float) resource usage according to quartus  

Logic utilization (in ALMs) 128,593 / 427,200 ( 30 % ) 

Total registers 157318 

Total pins 173 / 960 ( 18 % ) 

Total virtual pins 0 

Total block memory bits 10,365,736 / 55,562,240 ( 19 % ) 

Total DSP Blocks 100 / 1,518 ( 7 % ) 

Total HSSI RX channels 8 / 72 ( 11 % ) 

Total HSSI TX channels 8 / 72 ( 11 % ) 

Total PLLs 78 / 144 ( 54 % ) 

 

Why does the resource usage increase from static analysis to synthesis? Are there like any directives to restrict the number of DSPs?
0 Kudos
Altera_Forum
Honored Contributor II
588 Views

I see, I remember someone else also reported a similar situation before. This is indeed strange. Try using short or char for "output" and "out" and see what happens. I would expect using int for these variables might "promote" all the multiplications to int, since the output is int. Furthermore, you can take a look at "Intel FPGA SDK for OpenCL Best Practices Guide, Section 3.3.1 Floating-Point versus Fixed-Point Representations" and follow the guidelines to mask out bits to see if you can get the desired results. If none helped, I recommend opening a ticket with Altera directly.

0 Kudos
Reply