Latency of the loop

Altera_Forum · ‎06-19-2018

Would the latency of the below loop be different for switch_loop = 0 and switch_loop =1? The html report generated for the below kernel includes the latency of the global and local memory access.

Eg: If global memory access takes 5 cycles then the start cycle of local memory access is 6 and hence the start cycle of additon operation is 9 (3cycles of local memory access). Doesnt the latency of the loop depend on the if condition based on kernel argument by host?

__kernel

__attribute__((task))

void dummy_kernel

(

__global *restrict bottom ,

__local *restrict top,

__global *restrict final,

uchar switch_loop)

{

float private;

for (unsigned i = 0; i< 20; i++) {

if (switch_loop == 0)

private = global_memory;

else

private = local_memory;

private = private + 1;

final[i] = private;

}

Altera_Forum · ‎06-20-2018

To make pipelining possible, conditional statements are converted into two parallel paths, both of which will be executed, and the correct output will be chosen by a multiplexer in the end. So, no, the loop latency does not depend on which condition is taken. However, the latency mentioned in the report is the MINIMUM latency WITHOUT considering stallable accesses (global memory, local memory, channels). The actual latency at run-time could be much higher due to pipeline stalls caused by such accesses.

Altera_Forum · ‎06-20-2018

Replicating the logic results in below kernel. Would the latency of the kernel depend on the conditional statement in this case?

__kernel

__attribute__((task))

void dummy_kernel

(

__global *restrict bottom ,

__local *restrict top,

__global *restrict final,

uchar switch_loop)

{

float private;

if (switch_loop == 0)

{

for (unsigned i = 0; i< 20; i++) {

private = global_memory;

final = private;

}

else

{

for (unsigned i = 0; i< 20; i++) {

private = local_memory;

final = private;

}

Altera_Forum · ‎06-20-2018

Do you mean that the multiplexer would wait for the FALSE condition although the TRUE condition execution is completed? Or does multiplexer has a fixed schedule to decide which condition is TRUE? I have attached the system viewer screenshot of kernel 1 (first post) and kernel 2. It looks like two parallel paths are not created as you told. The two conditions are in serial.

Altera_Forum · ‎06-20-2018

--- Quote Start ---

Do you mean that the multiplexer would wait for the FALSE condition although the TRUE condition execution is completed? Or does multiplexer has a fixed schedule to decide which condition is TRUE?

--- Quote End ---

No, there is no "scheduling" (in single work-item kernels) or "waiting". The compiler will insert extra registers to make sure the latency of both paths is the same. This is the only way the operation can be pipelined correctly and is one of the most classical FPGA-based optimizations. Refer to slide 68 from Altera's own presentation:

https://cpufpga.files.wordpress.com/2016/04/opencl_for_fpgas_isca_2016.pdf

This should also be somewhere in their OpenCL documentation but I didn't find it.

--- Quote Start ---

I have attached the system viewer screenshot of kernel 1 (first post) and kernel 2. It looks like two parallel paths are not created as you told. The two conditions are in serial.

--- Quote End ---

Since the forum scales down images, I can hardly see anything and I am not sure if the way the system is drawn in the report is accurate enough for this discussion anyway. Furthermore, you are never writing anything to your local buffer and the compiler even gives a warning about this. It could be generating the circuit in a specific way because of this.

Altera_Forum · ‎06-20-2018

[QUOTE=HRZ;239754

This should also be somewhere in their OpenCL documentation but I didn't find it.

.

--- Quote End ---

Not much is mentioned about conditional execution but some information is there in Best practices guide Pg 162, chapter 9. Strategies for Optimizing Intel Stratix® 10 OpenCL Designs. But do you think the Kernel 2 with logic replication would give different latency.

__kernel

__attribute__((task))

void dummy_kernel

(

uchar switch_loop,

__global float *restrict bottom ,

__local float *restrict top,

__global float *restrict final

)

{

float privatee;

if (switch_loop == 0)

{

for (unsigned i = 0; i< 20; i++) {

privatee = bottom;

final = privatee;

}

else

{

for (unsigned i = 0; i< 20; i++) {

privatee = top;

final = privatee;

}

Altera_Forum · ‎06-21-2018

Actually I am not sure what would happen when you have a for loop in each of the branches. In your case, since the loop trip counts are known and are the same, the compiler probably just needs to make sure the latency of one iteration of each of the loops is the same and then the total latency of execution of both loops will also be the same. However, for cases where the trip counts are not the same or not known at compile-time, this certainly cannot be done. I would assume in such cases the branch is handled in the same way as when you have stallable accesses in one of the sides of the branch. Likely the compiler makes sure the minimum latency of both paths is the same (for a loop trip count of one) and if the two paths did not finish at the same time, the faster side will be stalled until the slower side finishes.