Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
15322 Discussions

How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Altera_Forum
Honored Contributor II
1,211 Views

Hi, 

 

Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to: 

 

loop iteration ordering: iterations may get out of order with respect to the inner loop, 

as the number of iterations of the inner loop may be different for different iterations of this loop. 

 

I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < i; j++) sum += input; } output = sum; }  

 

But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor II
106 Views

Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right? 

 

Thanks in advance.
Altera_Forum
Honored Contributor II
106 Views

You can pipeline the loop like this: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; } 

 

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.
Altera_Forum
Honored Contributor II
106 Views

 

--- Quote Start ---  

You can pipeline the loop like this: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; } 

 

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels. 

--- Quote End ---  

 

 

 

 

 

Thanks very much. 

My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...
Reply