How to deal with the Out-of-Order Loop Iterations in single work-item kernel?

Altera_Forum · ‎03-20-2018

Hi,

Today I tried to use single work-item kernel. I have a nested loop. In Loop Report, I found my outer loop not pipelined due to:

loop iteration ordering: iterations may get out of order with respect to the inner loop,

as the number of iterations of the inner loop may be different for different iterations of this loop.

I understood this problem. for different outer iterations of outer loop, actually i need different number of iterations of inner loop. And in "out-of-order loop iterations" section of the best practices guide, I found an example, it is just similar to my code:


__kernel void order( __global unsigned* restrict input,
                              __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < i; j++)
            sum += input;
    }
    output = sum;
}

But no solution is mentioned here. How can I pipeline the loop? Or how to deal with this problem? If I use multiple kernels, will it work?

Altera_Forum · ‎03-20-2018

Sorry I just think about multiple kernels... Maybe it will solve this problem, is it right?

Thanks in advance.

Altera_Forum · ‎03-20-2018

You can pipeline the loop like this:

__kernel void order( __global unsigned* restrict input,
                                    __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < N; j++)
            if (j < i) sum += input;
    }
    output = sum;
}

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.

Altera_Forum · ‎03-21-2018

--- Quote Start ---

You can pipeline the loop like this:

__kernel void order( __global unsigned* restrict input,
                                    __global unsigned* restrict output, int N ) {
    unsigned sum=0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < N; j++)
            if (j < i) sum += input;
    }
    output = sum;
}

However, since in this case both of the loops will run N times, depending on N, this code could actually be slower than the original case due to redundant computation. For such unpipelineable loops, it is actually preferred to use NDRange kernels.

--- Quote End ---

Thanks very much.

My code is more complex then it is hard to make the same number of inner iterations... Yes, it is actually preferred to use NDRange kernels...