Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16623 Discussions

Short vs Int vs Floating Point usage in Kernels

Altera_Forum
Honored Contributor II
1,183 Views

I'm just wondering if anyone else has successfully used smaller fixed point representations for data in their kernels compared to floating point. By that I mean I'm playing around with changing my algorithm from floating point to fixed point integer. This obviously lowered the resource usage as it is a simpler operation for the FPGA, however the issue I'm finding is when I attempt to move smaller than 32-bit integers. I actually only need about 8 bits of resolution for my data so I was hoping to save available logic blocks by changing my kernel data from using 'int' to 'short' and finally 'char', however I'm finding the exact opposite is happening. Whenever I move to a smaller data type, it takes up more resources. I wanted to isolate the issue to make sure there was no overhead from my algorithm so I came up with a few tests and got some weird results. I created a simple kernel to do matrix addition and/or matrix dot product: 

 

 

# define N_VALUES 100 

 

# define TYPE float 

 

 

 

 

typedef TYPE DATATYPE; 

 

 

__kernel __attribute__((reqd_work_group_size(N_VALUES,1,1))) 

__attribute__((num_compute_units(50))) 

__attribute__((num_simd_work_items(1))) 

void DotProductKernel(__global DATATYPE * restrict a, __global DATATYPE * restrict b, __global DATATYPE * restrict result) { 

int id = get_global_id(0); 

 

// Now let's just do the dot product 

(result[id]) = (a[id]) * (b[id]); 

 

I then amp'd up the number of compute units so that the hardware usage would be significant enough to see and changed the '#define TYPE float' to be 'int', 'short', and 'char'. Here are the results I found during compilation: 

 

 

Data Type 

Logic Elements 

Flip Flops 

RAMS 

DSPs 

Logic Utilization % 

Dedicated Logic Register % 

Memory Block % 

DSP % 

 

 

Floating Point Matrix Addition 

168561 

221098 

2200 

88% 

39% 

101% 

0% 

 

 

Floating Point Matrix Dot Product 

155611 

206598 

2100 

50 

84% 

37% 

97% 

20% 

 

 

Integer Matrix Addition 

141311 

181498 

2000 

79% 

34% 

93% 

0% 

 

 

Integer Matrix Dot Product 

140661 

192398 

2000 

100 

80% 

35% 

93% 

39% 

 

 

Short Matrix Addition 

176911 

208448 

2000 

89% 

37% 

93% 

0% 

 

 

Short Matrix Dot Product 

176511 

209248 

2000 

50 

89% 

37% 

93% 

20% 

 

 

Char Matrix Addition 

252311 

265398 

2000 

109% 

43% 

93% 

0% 

 

 

Char Matrix Dot Product 

252111 

265798 

2000 

50 

109% 

43% 

93% 

20% 

 

 

 

 

Is there some overhead being introduced by the compiler that is using default 32-bit integers and then masking away the excess bits? Or can anyone explain why the resource usage is so much higher for the smaller bit width operations?
0 Kudos
10 Replies
Altera_Forum
Honored Contributor II
500 Views

Is the resource usage gotten after the "place and route"? or just the estimation?  

 

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

Thanks for your reply wzk6_3_8. 

 

 

--- Quote Start ---  

Is the resource usage gotten after the "place and route"? or just the estimation?  

--- Quote End ---  

 

 

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.  

 

 

--- Quote Start ---  

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help. 

--- Quote End ---  

 

 

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

Thanks for your reply wzk6_3_8. 

 

 

 

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.  

 

 

 

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change? 

--- Quote End ---  

 

 

The width of the arithmetic operations impacts the DSP usage as seen in your experiments. 

 

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

The width of the arithmetic operations impacts the DSP usage as seen in your experiments. 

 

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel. 

--- Quote End ---  

 

 

That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task?
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated. 

 

Thanks,  

Rudy
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task? 

--- Quote End ---  

 

 

The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit. 

 

Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory.
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated. 

 

Thanks,  

Rudy 

--- Quote End ---  

 

 

I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact.
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact. 

--- Quote End ---  

 

 

Actually, I see a slight increase in logic (due to masking operations), and a reduction in DSP usage due to narrower multiply operation.
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit. 

 

Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory. 

--- Quote End ---  

 

 

Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back. 

 

I was curious so I expanded my experiment to the vector data types (char4) as well as the solution of packing 4 chars into a 32-bit integer. The vector solution is attached as matrixmult_char4.txt (I couldn't upload a .cl file for some reason). The packing into 'int' solution is attached as matrixmult_int.txt. 

 

Compiling these for the above tests (dot product and simple addition) I get the following: 

 

 

Data Type 

Logic Elements 

Flip Flops 

RAMS 

DSPs 

Logic Utilization % 

Dedicated Logic Register % 

Memory Block % 

DSP % 

 

 

Char Matrix Addition Compact 

143611 

181098 

2000 

79% 

34% 

93% 

0% 

 

 

Char Matrix Dot Product Compact 

164211 

271998 

2000 

200 

94% 

44% 

93% 

78% 

 

 

Char4 Vector Matrix Addition 

141311 

181498 

2000 

79% 

34% 

93% 

0% 

 

 

Char4 Vector Matrix Dot Product 

141461 

192698 

2000 

200 

80% 

36% 

93% 

78% 

 

 

 

I can't exactly explain why the DSP increased for the dot product other than the idea that there are 4 more multiplications in each kernel compared to the int version. However, the results do favor Outku's explanation of the load/store alignments. 

 

 

--- Quote Start ---  

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated. 

 

 

Thanks,  

Rudy 

--- Quote End ---  

 

 

Thanks for the suggestion. I did attempt to use static masks on larger data types to get the 8bit (char) and 16bit (short) examples, however I ended up with the same result. It fixed the alignment issue, but there was a lot of wasted space with loading/storing all 32-bits and only using 8 bits. Using the vector data types, though, seems to have solved the issue. 

 

Thanks everyone for your help!
0 Kudos
Altera_Forum
Honored Contributor II
500 Views

 

--- Quote Start ---  

Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back. 

 

I can't exactly explain why the DSP increased for the dot product 

--- Quote End ---  

 

 

I was just referring to storing chars as integers in memory, i.e. using only 8-bits out of 32.  

 

Integer addition does not use DSP blocks, but multiplication does, hence, the difference in DSP usage.
0 Kudos
Reply