Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
All support for Intel NUC 7 - 13 systems has transitioned to ASUS. Read latest update.
16477 Discussions

## How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Honored Contributor II
1,390 Views

Hi,

Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to:

loop iteration ordering: iterations may get out of order with respect to the inner loop,

as the number of iterations of the inner loop may be different for different iterations of this loop.

I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code:

``` __kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < i; j++) sum += input; } output = sum; } ```

But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?
3 Replies
Honored Contributor II
285 Views

Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right?

Honored Contributor II
285 Views

You can pipeline the loop like this:

```__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; }```

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.
Honored Contributor II
285 Views

--- Quote Start ---

You can pipeline the loop like this:

```__kernel void order( __global unsigned* restrict input, __global unsigned* restrict output, int N ) { unsigned sum=0; for (unsigned i = 0; i < N; i++) { for (unsigned j = 0; j < N; j++) if (j < i) sum += input; } output = sum; }```

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.

--- Quote End ---

Thanks very much.

My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...