Short vs Int vs Floating Point usage in Kernels

Altera_Forum · ‎03-24-2015

I'm just wondering if anyone else has successfully used smaller fixed point representations for data in their kernels compared to floating point. By that I mean I'm playing around with changing my algorithm from floating point to fixed point integer. This obviously lowered the resource usage as it is a simpler operation for the FPGA, however the issue I'm finding is when I attempt to move smaller than 32-bit integers. I actually only need about 8 bits of resolution for my data so I was hoping to save available logic blocks by changing my kernel data from using 'int' to 'short' and finally 'char', however I'm finding the exact opposite is happening. Whenever I move to a smaller data type, it takes up more resources. I wanted to isolate the issue to make sure there was no overhead from my algorithm so I came up with a few tests and got some weird results. I created a simple kernel to do matrix addition and/or matrix dot product:

# define N_VALUES 100

# define TYPE float

typedef TYPE DATATYPE;

__kernel __attribute__((reqd_work_group_size(N_VALUES,1,1)))

__attribute__((num_compute_units(50)))

__attribute__((num_simd_work_items(1)))

void DotProductKernel(__global DATATYPE * restrict a, __global DATATYPE * restrict b, __global DATATYPE * restrict result) {

int id = get_global_id(0);

// Now let's just do the dot product

(result[id]) = (a[id]) * (b[id]);

}

I then amp'd up the number of compute units so that the hardware usage would be significant enough to see and changed the '#define TYPE float' to be 'int', 'short', and 'char'. Here are the results I found during compilation:

Data Type

Logic Elements

Flip Flops

RAMS

DSPs

Logic Utilization %

Dedicated Logic Register %

Memory Block %

DSP %

Floating Point Matrix Addition

168561

221098

2200

0

88%

39%

101%

0%

Floating Point Matrix Dot Product

155611

206598

2100

50

84%

37%

97%

20%

Integer Matrix Addition

141311

181498

2000

0

79%

34%

93%

0%

Integer Matrix Dot Product

140661

192398

2000

100

80%

35%

93%

39%

Short Matrix Addition

176911

208448

2000

0

89%

37%

93%

0%

Short Matrix Dot Product

176511

209248

2000

50

89%

37%

93%

20%

Char Matrix Addition

252311

265398

2000

0

109%

43%

93%

0%

Char Matrix Dot Product

252111

265798

2000

50

109%

43%

93%

20%

Is there some overhead being introduced by the compiler that is using default 32-bit integers and then masking away the excess bits? Or can anyone explain why the resource usage is so much higher for the smaller bit width operations?

Altera_Forum · ‎03-25-2015

Is the resource usage gotten after the "place and route"? or just the estimation?

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.

Altera_Forum · ‎03-25-2015

Thanks for your reply wzk6_3_8.

--- Quote Start ---

Is the resource usage gotten after the "place and route"? or just the estimation?

--- Quote End ---

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.

--- Quote Start ---

the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.

--- Quote End ---

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?

Altera_Forum · ‎03-25-2015

--- Quote Start ---

Thanks for your reply wzk6_3_8.

This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation.

But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?

--- Quote End ---

The width of the arithmetic operations impacts the DSP usage as seen in your experiments.

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.

Altera_Forum · ‎03-25-2015

--- Quote Start ---

The width of the arithmetic operations impacts the DSP usage as seen in your experiments.

The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.

--- Quote End ---

That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task?

Altera_Forum · ‎03-25-2015

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated.

Thanks,

Rudy

Altera_Forum · ‎03-26-2015

--- Quote Start ---

That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task?

--- Quote End ---

The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit.

Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory.

Altera_Forum · ‎03-26-2015

--- Quote Start ---

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated.

Thanks,

Rudy

--- Quote End ---

I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact.

Altera_Forum · ‎03-26-2015

--- Quote Start ---

I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact.

--- Quote End ---

Actually, I see a slight increase in logic (due to masking operations), and a reduction in DSP usage due to narrower multiply operation.

Altera_Forum · ‎03-26-2015

--- Quote Start ---

The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit.

Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory.

--- Quote End ---

Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back.

I was curious so I expanded my experiment to the vector data types (char4) as well as the solution of packing 4 chars into a 32-bit integer. The vector solution is attached as matrixmult_char4.txt (I couldn't upload a .cl file for some reason). The packing into 'int' solution is attached as matrixmult_int.txt.

Compiling these for the above tests (dot product and simple addition) I get the following:

Data Type

Logic Elements

Flip Flops

RAMS

DSPs

Logic Utilization %

Dedicated Logic Register %

Memory Block %

DSP %

Char Matrix Addition Compact

143611

181098

2000

0

79%

34%

93%

0%

Char Matrix Dot Product Compact

164211

271998

2000

200

94%

44%

93%

78%

Char4 Vector Matrix Addition

141311

181498

2000

0

79%

34%

93%

0%

Char4 Vector Matrix Dot Product

141461

192698

2000

200

80%

36%

93%

78%

I can't exactly explain why the DSP increased for the dot product other than the idea that there are 4 more multiplications in each kernel compared to the int version. However, the results do favor Outku's explanation of the load/store alignments.

--- Quote Start ---

In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated.

Thanks,

Rudy

--- Quote End ---

Thanks for the suggestion. I did attempt to use static masks on larger data types to get the 8bit (char) and 16bit (short) examples, however I ended up with the same result. It fixed the alignment issue, but there was a lot of wasted space with loading/storing all 32-bits and only using 8 bits. Using the vector data types, though, seems to have solved the issue.

Thanks everyone for your help!

Altera_Forum · ‎03-30-2015

--- Quote Start ---

Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back.

I can't exactly explain why the DSP increased for the dot product

--- Quote End ---

I was just referring to storing chars as integers in memory, i.e. using only 8-bits out of 32.

Integer addition does not use DSP blocks, but multiplication does, hence, the difference in DSP usage.