Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,130 Views

How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Hi, 

 

Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to: 

 

loop iteration ordering: iterations may get out of order with respect to the inner loop, 

as the number of iterations of the inner loop may be different for different iterations of this loop. 

 

I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < i; j++) sum += input; } output = sum; }  

 

But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?
0 Kudos
3 Replies
Altera_Forum
Honored Contributor I
25 Views

Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right? 

 

Thanks in advance.
Altera_Forum
Honored Contributor I
25 Views

You can pipeline the loop like this: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; } 

 

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.
Altera_Forum
Honored Contributor I
25 Views

 

--- Quote Start ---  

You can pipeline the loop like this: 

 

__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; } 

 

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels. 

--- Quote End ---  

 

 

 

 

 

Thanks very much. 

My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...