- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm just wondering if anyone else has successfully used smaller fixed point representations for data in their kernels compared to floating point. By that I mean I'm playing around with changing my algorithm from floating point to fixed point integer. This obviously lowered the resource usage as it is a simpler operation for the FPGA, however the issue I'm finding is when I attempt to move smaller than 32-bit integers. I actually only need about 8 bits of resolution for my data so I was hoping to save available logic blocks by changing my kernel data from using 'int' to 'short' and finally 'char', however I'm finding the exact opposite is happening. Whenever I move to a smaller data type, it takes up more resources. I wanted to isolate the issue to make sure there was no overhead from my algorithm so I came up with a few tests and got some weird results. I created a simple kernel to do matrix addition and/or matrix dot product:
# define N_VALUES 100 # define TYPE float typedef TYPE DATATYPE; __kernel __attribute__((reqd_work_group_size(N_VALUES,1,1))) __attribute__((num_compute_units(50))) __attribute__((num_simd_work_items(1))) void DotProductKernel(__global DATATYPE * restrict a, __global DATATYPE * restrict b, __global DATATYPE * restrict result) { int id = get_global_id(0); // Now let's just do the dot product (result[id]) = (a[id]) * (b[id]); } I then amp'd up the number of compute units so that the hardware usage would be significant enough to see and changed the '#define TYPE float' to be 'int', 'short', and 'char'. Here are the results I found during compilation: Data Type Logic Elements Flip Flops RAMS DSPs Logic Utilization % Dedicated Logic Register % Memory Block % DSP % Floating Point Matrix Addition 168561 221098 2200 0 88% 39% 101% 0% Floating Point Matrix Dot Product 155611 206598 2100 50 84% 37% 97% 20% Integer Matrix Addition 141311 181498 2000 0 79% 34% 93% 0% Integer Matrix Dot Product 140661 192398 2000 100 80% 35% 93% 39% Short Matrix Addition 176911 208448 2000 0 89% 37% 93% 0% Short Matrix Dot Product 176511 209248 2000 50 89% 37% 93% 20% Char Matrix Addition 252311 265398 2000 0 109% 43% 93% 0% Char Matrix Dot Product 252111 265798 2000 50 109% 43% 93% 20% Is there some overhead being introduced by the compiler that is using default 32-bit integers and then masking away the excess bits? Or can anyone explain why the resource usage is so much higher for the smaller bit width operations?Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the resource usage gotten after the "place and route"? or just the estimation?
the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply wzk6_3_8.
--- Quote Start --- Is the resource usage gotten after the "place and route"? or just the estimation? --- Quote End --- This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation. --- Quote Start --- the kernel only has one arithmetic operation, whose resource usage is much smaller than that of the kernel overhead. Try more arithmetic operations in the kernel. BTW, more CUs cannot help. --- Quote End --- But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change?- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Thanks for your reply wzk6_3_8. This is just the estimation. I'm in the process of running the full compilations to compare, however in my experience kernels with estimated usages of over 100% typically fail hardware compilation. But if everything else remains constant (only the type of the data changes), why would the overhead of the kernel change? --- Quote End --- The width of the arithmetic operations impacts the DSP usage as seen in your experiments. The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The width of the arithmetic operations impacts the DSP usage as seen in your experiments. The difference in ALM usage (i.e. logic) is not in the kernel datapath, it is in the load/store units that access the memory. The alignment of loads/stores impacts the ALM usage. With char* pointers, each load/store access is only 1-byte aligned and this does not allow much optimization. With short* pointers, each address is 2-byte aligned (i.e. the least significant address bit is zero) and this allows Quartus to perform some optimizations. The difference for each load/store unit is a few hundred ALMs (depends on the alignment). With 3 load/store * 50 copies, this overhead becomes big, considering there is nothing else in the kernel. --- Quote End --- That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated.
Thanks, Rudy- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- That would also explain why the same algorithm with the boolean operator exploded in size. Would it be better to optimize fixed point kernels by loading/storing them as 32-bit integers (as 4 chars packed together) and then separating them only for the internal arithmetic of the kernel to keep the alignment at 4 bytes? Or would a char4 vector data type accomplish the same task? --- Quote End --- The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit. Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated. Thanks, Rudy --- Quote End --- I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I think the example from the optimization guide is too small to show any benefits of masking. There is really nothing in the kernel other than two loads, one multiply and one store. If the datapath was bigger, I believe we will start seeing same impact. --- Quote End --- Actually, I see a slight increase in logic (due to masking operations), and a reduction in DSP usage due to narrower multiply operation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The boolean operator case is different. Because of the logical dependence, the second load operation has a control dependence on the first one. This uses a different (and more expensive) type of load/store unit. Yes, loading/storing larger types (int, or char4) would solve the alignment problem at the expense of wasted memory. --- Quote End --- Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back. I was curious so I expanded my experiment to the vector data types (char4) as well as the solution of packing 4 chars into a 32-bit integer. The vector solution is attached as matrixmult_char4.txt (I couldn't upload a .cl file for some reason). The packing into 'int' solution is attached as matrixmult_int.txt. Compiling these for the above tests (dot product and simple addition) I get the following: Data Type Logic Elements Flip Flops RAMS DSPs Logic Utilization % Dedicated Logic Register % Memory Block % DSP % Char Matrix Addition Compact 143611 181098 2000 0 79% 34% 93% 0% Char Matrix Dot Product Compact 164211 271998 2000 200 94% 44% 93% 78% Char4 Vector Matrix Addition 141311 181498 2000 0 79% 34% 93% 0% Char4 Vector Matrix Dot Product 141461 192698 2000 200 80% 36% 93% 78% I can't exactly explain why the DSP increased for the dot product other than the idea that there are 4 more multiplications in each kernel compared to the int version. However, the results do favor Outku's explanation of the load/store alignments. --- Quote Start --- In the optimization guide there is a section on fixed point operation, page 14, which suggests statically masking your 32-bit integers to the desired precision. If I understood this right, the AOC would be able to disregard the extra bits during hardware generation thus will reduce the amount of logic (minimal in this case) but still a reduction. You may want to look into that and see if it helps you out any. I was curious so I ran the example listed in the guide for 17-bit precision. I had an increase in logic for the fixed point version over the straight 32-bit version and I don't think this should be the case. I wouldn't think the load/store units that access the memory wouldn't be an issue as Outku suggested in the case of the original poster. Any insight would be appreciated. Thanks, Rudy --- Quote End --- Thanks for the suggestion. I did attempt to use static masks on larger data types to get the 8bit (char) and 16bit (short) examples, however I ended up with the same result. It fixed the alignment issue, but there was a lot of wasted space with loading/storing all 32-bits and only using 8 bits. Using the vector data types, though, seems to have solved the issue. Thanks everyone for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Why would there be wasted memory if you're packing 4 char values into an integer? Or do you mean wasted memory in terms of logic elements used to convert (mask) from the 32 bits down to the chars and back. I can't exactly explain why the DSP increased for the dot product --- Quote End --- I was just referring to storing chars as integers in memory, i.e. using only 8-bits out of 32. Integer addition does not use DSP blocks, but multiplication does, hence, the difference in DSP usage.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page