About Altera OpenCL Compilation

Altera_Forum · ‎12-07-2014

Hello everyone,

I am just wondering if anyone know the difference between compiling kernels in Linux or Windows (in term of run-time performance and total compilation time). I know i could use .aocx file compiled from windows to run in Linux, but there seems to be a difference in performance sometimes. I am trying to figure out which OS is better for compilation.

Also, I am wondering if kernels that utilize task parallelism (1 workitem per workgroup) usually takes longer and uses more memory to compile than kernels utilizing data parallelism (many workitems per workgroup).

In addition, I am just wondering what is the latency to access the local memory in term of clock cycles; or is the latency heavily dependent on the size of local memory/how the memory is accessed?

Thanks!

Ryan

Altera_Forum · ‎12-09-2014

Hi,

I do not have any data on Windows vs. Linux Quartus compilation times. This could already be mentioned in other Altera forums.

Quartus may behave slightly differently (place/route) between platforms. This explains why you are sometimes seeing different aocx performances. I think there is no clear winner platform; the difference is more like running the same algorithm with different seeds. You can again refer to other forums for possible more detailed answers.

When it comes to the aoc compilation, my observation is that Linux could be faster just because gcc does a better job. Otherwise, aoc compiler should behave exactly the same way on Linux and on Windows, i.e. you will obtain the same Verilog output for the same input file.

Compilation flow is mostly the same whether the kernel is a task or an ndrange. The most significant difference stems from loop-carried memory dependences. If your task kernel has significant number of memory operations that have memory dependences carried by loops, computing these dependences and generating the appropriate hardware may take some time. Eliminating these dependences will reduce the compile-time as well as may increase performance.

The latency of local memory accesses mainly depends on the complexity of the local memory system that is generated. If a load/store instruction is the only instruction that is connected to a local memory port, its latency will be minimal. However, if multiple local memory accesses are going to the same port then the latency will be increased to account for this "competition" (i.e. interconnect). You can refer to the "Optimizing Local Memory Accesses" Section in the "Best Practices Guide".

Altera_Forum · ‎12-12-2014

Thank you so much for the help! I just have one quick question however: the AOCL programming guide says that the number of access to local memory should be less or equal to 4 for best efficiency. I am just wondering if this means I shouldn't access the local memory 4 times concurrently, or does this mean in the entire kernel I should only have 4 or less different local memory access in total?

Regards,

Ryan

Altera_Forum · ‎12-12-2014

It is the latter; the programming guide recommends that "the entire kernel should have 4 or fewer different accesses to local memory.

Basically, each load/store instruction in the kernel becomes a client (i.e. master) for local memory. Because local memory has at most 4-ports, if you have 4 load/store instructions, each port will be connected to a single load/store so that load/store instructions will not compete with each other. This guarantees the most efficient hardware.

If you have 3 or fewer store instructions and many loads (loads+stores > 4), the compiler may choose to replicate the local memory. This also gives fast accesses at the local memory at the expense of RAMs.

Further, the compiler performs some optimizations to partition the local memory based on the access patterns. Hence, even if there are more than 4 store instructions, you may still get efficient hardware. However, this depends on the complexity of your access patterns and may not always be possible.

Altera_Forum · ‎12-13-2014

Thank you for the clarification! I am just wondering if I have a few kernels in one .cl/.aocx file, where each kernel uses about the same amount of local memory, will AOCL compiler able to generate a single memory block and share it between the kernels? or will it just instantiate a different local memory for each kernels? The kernels that I have will not be running concurrently. The previous kernel must finish before the next can start; and I would like to allocate as much local memory as possible. I am just asking because the AOCL release note mentioned that "num_share_resources", "max_share_resources" and "max_unroll_loops" are deprecated and no work around was given.

Also, I am wondering if there is a way to tell the AOCL compiler explicitly that I won't run the different kernels contained in the same file concurrently so that it could optimize the hardware better, or does that not make any difference anyway?

Regards,

Ryan

--- Quote Start ---

It is the latter; the programming guide recommends that "the entire kernel should have 4 or fewer different accesses to local memory.

Basically, each load/store instruction in the kernel becomes a client (i.e. master) for local memory. Because local memory has at most 4-ports, if you have 4 load/store instructions, each port will be connected to a single load/store so that load/store instructions will not compete with each other. This guarantees the most efficient hardware.

If you have 3 or fewer store instructions and many loads (loads+stores > 4), the compiler may choose to replicate the local memory. This also gives fast accesses at the local memory at the expense of RAMs.

Further, the compiler performs some optimizations to partition the local memory based on the access patterns. Hence, even if there are more than 4 store instructions, you may still get efficient hardware. However, this depends on the complexity of your access patterns and may not always be possible.

--- Quote End ---

Altera_Forum · ‎12-13-2014

Also, does it matter if I use the local memory in 2d (local_mem[i][j]) or flatten it to 1d (local_mem[i*block_size + j]) ?

Altera_Forum · ‎12-15-2014

There is currently no way to specify that kernels in a source file are will be launched/run non-concurrently. The compiler always assumes that they may run simultaneously, so there is no resource sharing (either local memory or other resources) across distinct kernels. If you want to share resources, then you can currently do this manually - define a single kernel with a local memory system, and then call sub-functions from that kernel that define the functionality of the independent kernels. You can pass a local memory pointer to those "kernel" sub-functions.

Altera_Forum · ‎12-15-2014

Both 2d and 1d array accesses should produce similar results, although in some special (complex) cases the compiler might have an easier time analyzing/optimizing the 2d notation. Since 2d is probably your preferred implementation, I don't think there is any reason to flatten the accesses.

Altera_Forum · ‎12-16-2014

Thank you! I am just wondering if I call a "sub" function from my "main" kernel multiple times (ie function call is under a "for" loop) would it cause higher overhead, or cause the compiler to use more hardware? Will this way (use one kernel to call sub functions) of programming in general cause the higher hardware utilization or degrade performance compare to having multiple smaller kernels?

--- Quote Start ---

There is currently no way to specify that kernels in a source file are will be launched/run non-concurrently. The compiler always assumes that they may run simultaneously, so there is no resource sharing (either local memory or other resources) across distinct kernels. If you want to share resources, then you can currently do this manually - define a single kernel with a local memory system, and then call sub-functions from that kernel that define the functionality of the independent kernels. You can pass a local memory pointer to those "kernel" sub-functions.

--- Quote End ---

Altera_Forum · ‎12-16-2014

Thank you! BTW: Is there a difference in latency/throughput between accessing a 2D local memory row by row or accessing it column by column?

Altera_Forum · ‎12-17-2014

--- Quote Start ---

Thank you! I am just wondering if I call a "sub" function from my "main" kernel multiple times (ie function call is under a "for" loop) would it cause higher overhead, or cause the compiler to use more hardware?

--- Quote End ---

The compiler inlines functions into hardware instead of "calling" them, so the answer depends on whether the loop gets automatically unrolled. Auto-unrolling can occur if the loop body is relatively small, and if the loop has a small fixed trip count. If unrolled, the function that you're calling will be replicated multiple times which uses additional hardware. Similarly, if you call the function multiple times from within a kernel, then you will incur the overhead of the function being instantiated multiple times in the hardware. The compiler does this because replicating the hardware provides higher throughput when you have many workitems - different sets of work items can use different instances of the function in parallel.

--- Quote Start ---

Will this way (use one kernel to call sub functions) of programming in general cause the higher hardware utilization or degrade performance compare to having multiple smaller kernels?

--- Quote End ---

Yes, it will most likely degrade the performance of all of the kernels. Local memory systems are optimized for the kernel that they connect to. When you fuse the functionality of multiple kernels, each of which likely has a different type of memory access pattern, you will end up with a complex local memory system that isn't optimized for any one of the kernels. I would expect to see more stalls/access conflicts on the memory, and a higher hardware cost.

The single fused kernel would also be much more complex, which may prevent some compiler optimizations (especially memory) that could be performed on each of the smaller, simpler kernels.

Your previous question asked how to effectively share a local memory across kernels. That's the definition of a global memory, so if at all possible, I would suggest re-working your multiple kernels to either: (1) share global memory instead of local; or (2) find a reasonable balance between local memory sizes such that each kernel can have its own local mem. If you care about throughput/performance and not just fitting multiple kernels each requiring massive local memory onto the chip (at the cost of performance), then fusing kernels is probably not the best way to go.

Altera_Forum · ‎12-17-2014

--- Quote Start ---

Thank you! BTW: Is there a difference in latency/throughput between accessing a 2D local memory row by row or accessing it column by column?

--- Quote End ---

The compiler should do a good job of building and optimizing the local memory system regardless of how you access it - local memory doesn't incur any penalty from random accesses. If you're not accessing every single row or column (say you access every 4th row/column), then you might see an advantage by accessing column-wise because of the memory banking structure, but that's a bit of a corner case. You can run both in the profiler to see how they perform for your application.

Altera_Forum · ‎12-17-2014

What Mike is saying about the banking and column-wise relationship is true. However, if you are just iterating over the entire array by accessing consecutive elements, my money would be on the row-by-row access pattern. If you iterate over the entire row elements and unroll this loop, the compiler will merge the consecutive accesses into very efficient wider accesses.

Altera_Forum · ‎12-17-2014

--- Quote Start ---

What Mike is saying about the banking and column-wise relationship is true. However, if you are just iterating over the entire array by accessing consecutive elements, my money would be on the row-by-row access pattern. If you iterate over the entire row elements and unroll this loop, the compiler will merge the consecutive accesses into very efficient wider accesses.

--- Quote End ---

Ok, I should point out that for this to work, unrolling the column loop is key; this creates consecutive accesses that compiler can merge.

e.g.

for(row = 0; row < N; row++) {

# pragma unroll

for(col = 0; col < 4; col++) {

A[row][col] = row + col;

}

This essentially creates:

for(row = 0; row < N; row++) {

A[row][0] = row;

A[row][1] = row + 1;

A[row][2] = row + 2;

A[row][3] = row + 3;

}

which gets translated to smth like this

for(row = 0; row < N; row++) {

A[row][0] = (int4)( row, row+1, row+2, row+3 ); // very efficient wide access

}

I think this is mentioned in the best practices guide document.

Altera_Forum · ‎12-17-2014

--- Quote Start ---

Your previous question asked how to effectively share a local memory across kernels. That's the definition of a global memory, so if at all possible, I would suggest re-working your multiple kernels to either: (1) share global memory instead of local; or (2) find a reasonable balance between local memory sizes such that each kernel can have its own local mem. If you care about throughput/performance and not just fitting multiple kernels each requiring massive local memory onto the chip (at the cost of performance), then fusing kernels is probably not the best way to go.

--- Quote End ---

To the point of "data sharing between kernels", I can also add (3) use channel instructions (the most efficient way of communicating data between kernels), if kernels have producer-consumer type relationship, i.e. data produced by a kernel can be sent to another kernel via channels. However, if multiple kernels needs to access/update the same address simultaneously, this is not the right model. You can find more information on how to use channel instructions in the "Programming Guide"

Altera_Forum · ‎12-29-2014

Thank you so much for the help! I am just wondering: what if number of iteration for the for loop could not be determine at compilation time (either the operation is conditional or the size N changes during different kernel invocations)? I am forced to access local memory sequentially or is there any other optimization that can be done?

--- Quote Start ---

Ok, I should point out that for this to work, unrolling the column loop is key; this creates consecutive accesses that compiler can merge.

e.g.

for(row = 0; row < N; row++) {

# pragma unroll

for(col = 0; col < 4; col++) {

A[row][col] = row + col;

}

This essentially creates:

for(row = 0; row < N; row++) {

A[row][0] = row;

A[row][1] = row + 1;

A[row][2] = row + 2;

A[row][3] = row + 3;

}

which gets translated to smth like this

for(row = 0; row < N; row++) {

A[row][0] = (int4)( row, row+1, row+2, row+3 ); // very efficient wide access

}

I think this is mentioned in the best practices guide document.

--- Quote End ---

Altera_Forum · ‎12-29-2014

I have another question regarding numerical accuracy. I noticed that when operating on the same set of data repeatedly (ie. running a iterative algorithm), the difference between CPU and FPGA's result gets quite big when the problem size and number of iteration is large. When not using -fp-releaxed nor -fpc flag during compilation, the difference gets smaller but is still significant. Is this due to Altera implemented floating point arithmetic differently then Intel? I thought that they both conform to IEEE 754 stardand?

Altera_Forum · ‎12-31-2014

--- Quote Start ---

Thank you so much for the help! I am just wondering: what if number of iteration for the for loop could not be determine at compilation time (either the operation is conditional or the size N changes during different kernel invocations)? I am forced to access local memory sequentially or is there any other optimization that can be done?

--- Quote End ---

You may try manually unrolling the inner loop like below. The compiler will again merge the stores in the inner loop into wide access; if-statements will determine which bytes within the wide access will be active. Although this gives efficient memory accesses, the kernel may not be as efficient as the earlier example where the loop bounds are known because of the double-nested loop.


for(row = 0; row < N; row++) {
  for(col = 0; col < M; col++) {
    if(4*col+0 < N) A = row + col;
    if(4*col+1 < N) A = row + col;
    if(4*col+2 < N) A = row + col;
    if(4*col+3 < N) A = row + col;
  }
}

Altera_Forum · ‎12-31-2014

--- Quote Start ---

I have another question regarding numerical accuracy. I noticed that when operating on the same set of data repeatedly (ie. running a iterative algorithm), the difference between CPU and FPGA's result gets quite big when the problem size and number of iteration is large. When not using -fp-releaxed nor -fpc flag during compilation, the difference gets smaller but is still significant. Is this due to Altera implemented floating point arithmetic differently then Intel? I thought that they both conform to IEEE 754 stardand?

--- Quote End ---

Read https://software.intel.com/sites/default/files/article/164389/fp-consistency-102511.pdf Intel is not even consistent with itself. We are very consistent, we adhere to IEEE 754 with rounding to nearest.

Altera_Forum · ‎01-01-2015

Thank you! I'll try. What if I have 2d work groups instead of for loops, where each thread copies 1 item from the global memory to local memory? Would the compiler automatically merge memory accesses? Is using "num_simd_work_items" the only way to optimize the kernel?

Altera_Forum · ‎01-01-2015

That's good to know! I tried to use higher precision on CPU then round off to compare with FPGA, and it turns out that the FPGA result is closer to the higher precision CPU result.

--- Quote Start ---

Read https://software.intel.com/sites/default/files/article/164389/fp-consistency-102511.pdf Intel is not even consistent with itself. We are very consistent, we adhere to IEEE 754 with rounding to nearest.

--- Quote End ---