Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)

Latency of the loop

Altera_Forum
Honored Contributor II
1,168 Views

Would the latency of the below loop be different for switch_loop = 0 and switch_loop =1? The html report generated for the below kernel includes the latency of the global and local memory access.  

 

Eg: If global memory access takes 5 cycles then the start cycle of local memory access is 6 and hence the start cycle of additon operation is 9 (3cycles of local memory access). Doesnt the latency of the loop depend on the if condition based on kernel argument by host?  

 

__kernel 

__attribute__((task)) 

void dummy_kernel 

__global *restrict bottom , 

__local *restrict top, 

__global *restrict final, 

uchar switch_loop) 

 

 

float private; 

 

for (unsigned i = 0; i< 20; i++) { 

 

if (switch_loop == 0) 

private = global_memory

else 

private = local_memory

 

private = private + 1; 

final[i] = private; 

}
0 Kudos
6 Replies
Altera_Forum
Honored Contributor II
345 Views

To make pipelining possible, conditional statements are converted into two parallel paths, both of which will be executed, and the correct output will be chosen by a multiplexer in the end. So, no, the loop latency does not depend on which condition is taken. However, the latency mentioned in the report is the MINIMUM latency WITHOUT considering stallable accesses (global memory, local memory, channels). The actual latency at run-time could be much higher due to pipeline stalls caused by such accesses.

0 Kudos
Altera_Forum
Honored Contributor II
345 Views

Replicating the logic results in below kernel. Would the latency of the kernel depend on the conditional statement in this case? 

 

__kernel 

__attribute__((task)) 

void dummy_kernel 

__global *restrict bottom , 

__local *restrict top, 

__global *restrict final, 

uchar switch_loop) 

 

 

float private; 

if (switch_loop == 0) 

for (unsigned i = 0; i< 20; i++) { 

 

private = global_memory

 

final = private; 

else 

for (unsigned i = 0; i< 20; i++) { 

 

private = local_memory

 

final = private; 

}
0 Kudos
Altera_Forum
Honored Contributor II
345 Views

Do you mean that the multiplexer would wait for the FALSE condition although the TRUE condition execution is completed? Or does multiplexer has a fixed schedule to decide which condition is TRUE? I have attached the system viewer screenshot of kernel 1 (first post) and kernel 2. It looks like two parallel paths are not created as you told. The two conditions are in serial.

0 Kudos
Altera_Forum
Honored Contributor II
345 Views

 

--- Quote Start ---  

Do you mean that the multiplexer would wait for the FALSE condition although the TRUE condition execution is completed? Or does multiplexer has a fixed schedule to decide which condition is TRUE? 

--- Quote End ---  

 

 

No, there is no "scheduling" (in single work-item kernels) or "waiting". The compiler will insert extra registers to make sure the latency of both paths is the same. This is the only way the operation can be pipelined correctly and is one of the most classical FPGA-based optimizations. Refer to slide 68 from Altera's own presentation: 

 

https://cpufpga.files.wordpress.com/2016/04/opencl_for_fpgas_isca_2016.pdf 

 

This should also be somewhere in their OpenCL documentation but I didn't find it. 

 

 

--- Quote Start ---  

I have attached the system viewer screenshot of kernel 1 (first post) and kernel 2. It looks like two parallel paths are not created as you told. The two conditions are in serial. 

--- Quote End ---  

 

 

Since the forum scales down images, I can hardly see anything and I am not sure if the way the system is drawn in the report is accurate enough for this discussion anyway. Furthermore, you are never writing anything to your local buffer and the compiler even gives a warning about this. It could be generating the circuit in a specific way because of this.
0 Kudos
Altera_Forum
Honored Contributor II
345 Views

[QUOTE=HRZ;239754 

This should also be somewhere in their OpenCL documentation but I didn't find it. 

--- Quote End ---  

 

 

Not much is mentioned about conditional execution but some information is there in Best practices guide Pg 162, chapter 9. Strategies for Optimizing Intel Stratix® 10 OpenCL Designs. But do you think the Kernel 2 with logic replication would give different latency. 

 

__kernel 

__attribute__((task)) 

void dummy_kernel 

uchar switch_loop, 

__global float *restrict bottom , 

__local float *restrict top, 

__global float *restrict final 

 

 

 

 

float privatee; 

if (switch_loop == 0) 

for (unsigned i = 0; i< 20; i++) { 

 

 

privatee = bottom

 

 

final = privatee; 

else 

for (unsigned i = 0; i< 20; i++) { 

 

 

privatee = top

 

 

final = privatee; 

}
0 Kudos
Altera_Forum
Honored Contributor II
345 Views

Actually I am not sure what would happen when you have a for loop in each of the branches. In your case, since the loop trip counts are known and are the same, the compiler probably just needs to make sure the latency of one iteration of each of the loops is the same and then the total latency of execution of both loops will also be the same. However, for cases where the trip counts are not the same or not known at compile-time, this certainly cannot be done. I would assume in such cases the branch is handled in the same way as when you have stallable accesses in one of the sides of the branch. Likely the compiler makes sure the minimum latency of both paths is the same (for a loop trip count of one) and if the two paths did not finish at the same time, the faster side will be stalled until the slower side finishes.

0 Kudos
Reply