Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16596 Discussions

Complex loop exit condition

Altera_Forum
Honored Contributor II
1,638 Views

Hello, 

I have a very simple code with a few lines, but the outer loop is not pipelined due to "Loop exit condition unresolvable at iteration initiation". Inner loop is pipelined well with II=1. Can anybody suggest any idea?  

Thanks 

 

__attribute__ ((task)) 

kernel void compute_BFS0( 

__global const unsigned* restrict ovid_of_edge,  

__global const unsigned* restrict start_edge,  

__global const unsigned* restrict end_edge, 

__global unsigned* restrict node_data 

{  

unsigned ei; 

unsigned si; 

unsigned ovid; 

 

for (unsigned i = 0; i < 1000; i++ ) // iterates over graph nodes 

si = start_edge; // sequential read 

ei = end_edge; // sequential read 

 

for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges 

ovid = ovid_of_edge[j]; // child node. sequential read. 

node_data[ovid] = 1000; // random-access write 

} //kernel
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
678 Views

Your inner loop has a variable exit condition; hence, the outer loop is not pipelineable. If ei has a maximum limit, you can replace ei in the header of the inner loop with that maximum limit, and add a branch inside of the inner loop to discard cases where j >= ei. That will allow pipelining of both loops, at cost of redundant computation which could be significant depending on how far ei is from that maximum. If there is no maximum or it is too large, then an NDRange kernel would be more appropriate since the thread scheduler can minimize pipeline stalls bubbles at run-time.

0 Kudos
Altera_Forum
Honored Contributor II
678 Views

If the problem is with variable loop limits of inner loop, I wonder why below code doesn't have this problem too, and both loops are pipelined (according to report).  

 

In first problematic code (mentioned earlier), in a graph, a constant value is written to children nodes of every parent node, but in below code, value of children nodes are read, and their summation is written to their parent node. 

In earlier code, there is a random-access write, however in second one, we have random-access read. Can this be the source of pipelining problem? 

 

__attribute__ ((task)) 

 

__kernel void compute_pagerank( 

__global const unsigned* restrict ovid_of_edge,  

__global const unsigned* restrict start_edge,  

__global const unsigned* restrict end_edge, 

__global unsigned* restrict node_data, 

__global unsigned* restrict node_data2 

{  

unsigned acc = 0; 

unsigned ei; 

unsigned si; 

unsigned ovid; 

 

for (unsigned i = 0;i < 1000; i++ ) // iterates over graph nodes 

acc = 0; 

si = start_edge; // sequential read 

ei = end_edge; // sequential read 

 

for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges 

ovid = ovid_of_edge[j]; // child node. sequential read. 

acc += node_data[ovid]; // random-access read 

node_data2[i] = acc; // sequential write 

} //kernel 

 

thanks
0 Kudos
Altera_Forum
Honored Contributor II
678 Views

I am genuinely surprised to the that the second kernel is actually pipelineable. Other than the difference in read and write that you mentioned, I don't see any other difference between the two kernels that could cause the difference. However, I would say the strange case here is the second kernel, not the first one. I checked with newer versions of the compiler. 16.1.2 and 17.0 fully pipeline the second kernel, but 17.1 and 18.0 say: 

 

 

--- Quote Start ---  

II >=1 

II is an approximation due to variable inner loop trip count. 

--- Quote End ---  

 

 

I guess the compiler is doing some corner case optimization that allows pipelining in the second kernel. Other than that, I have no idea. 

 

P.S. You can put your code segments in a CODE tag to preserve indentation.
0 Kudos
Altera_Forum
Honored Contributor II
678 Views

I tested both of your suggestions:  

1- inner loop with fixed limit. Now it is pipeliable, but increase in run-time, due to wasted clocks in inner loop, is large. Because my upper limit is huge. 

2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar. 

Thank you very much for following this discussion. that was really instructive. :)
0 Kudos
Altera_Forum
Honored Contributor II
678 Views

 

--- Quote Start ---  

2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar. 

--- Quote End ---  

 

 

The difference is that the run-time scheduler in the NDRange version will avoid all the redundant computation. NDRange is preferred for unpipelineable loops because of this (while single work-item is preferred for pretty much every other case).
0 Kudos
Reply