pragma unroll breaks kernel

Altera_Forum · ‎12-22-2017

Hi,

I'm trying to figure out why my kernel breaks when I try to use the #pragma unroll statement for a loop. I'm using the 16.1 SDK and the device I'm running on is an Arria 10/DE5a-net and compiling it without any compiler flags (i.e. no relaxed floating point operations).

The kernel in question is posted below. It's part of a physics simulation and a bit long, but I'm posting the entire code, just to make sure I'm not cutting out an important part.

The kernel works fine in emulation. The kernel also works fine running on the device if I comment out the #pragma unroll statement for the innermost loop statement. But with the statement enabled, I only get NANs in the output. I can't see a hint of any problem in the compiler report or anywhere else. I know that the kernel is not well written or optimized yet. But at this point I'm just trying to understand why the unrolling fails.

Any help to solve this mystery would be appreciated!

Hanno

#define N 9
__attribute__((task))
__kernel void vector_add(
                        __global const float *restrict p_in,
                        __global const float *restrict p_mass_in,
                        __global float *restrict p_out )
{
    const float dt = 0.01;
    const long steps = 10000;
    float p_pos;
    float p_vel;
    float p_mass;
    for (int i=0; i<N; i++){
        p_pos = p_in;
        p_pos = p_in;
        p_pos = p_in;
        p_vel = p_in;
        p_vel = p_in;
        p_vel = p_in;
        p_mass = p_mass_in;
    }
    float dt12 = dt/2.;
    for(long k=0;k<steps;k++){
        for(int i=0; i<N; i++){
            p_pos += dt12*p_vel;
            p_pos += dt12*p_vel;
            p_pos += dt12*p_vel;
        }
        for(int i=0; i<N; i++){
            const float pix = p_pos;
            const float piy = p_pos;
            const float piz = p_pos;
            float ax = 0.;
            float ay = 0.;
            float az = 0.;
           # pragma unroll
            for(int j=0; j<N; j++){
                const float dx = p_pos - pix;
                const float dy = p_pos - piy;
                const float dz = p_pos - piz;
                const float pre_sqrt = dx*dx + dy*dy + dz*dz;
                const float _r = sqrtf(pre_sqrt);
                const float pre_recip = p_mass/(pre_sqrt * _r);
                const float prefact = ((i==j)?0.:pre_recip);
                ax  -= prefact*dx;
                ay  -= prefact*dy;
                az  -= prefact*dz;
            }
            p_vel += dt*ax;
            p_vel += dt*ay;
            p_vel += dt*az;
        }
        for(int i=0; i<N; i++){
            p_pos += dt12*p_vel;
            p_pos += dt12*p_vel;
            p_pos += dt12*p_vel;
        }
    }
    for (int i=0; i<N; i++){
        p_out = p_pos;
        p_out = p_pos;
        p_out = p_pos;
        p_out = p_vel;
        p_out = p_vel;
        p_out = p_vel;
    }
}

Altera_Forum · ‎12-23-2017

Update: I made the steps variable an argument of the kernel so that I can choose not to run the main loop, i.e.

for(long k=0;k<steps;k++){

The output should just be a copy of the input if I set steps=0. However, some array elements are swapped in the output. Somehow the memory load/stores are not working as expected. The strange thing is that the issue only appears if I compile the kernel with the unrolled inner loop (which should never get executed anyway). This seems like a bug to me, but maybe I'm missing something important here.

Altera_Forum · ‎12-23-2017

Unrolling a loop should certainly not change the output. If your output is different in emulation compared to actual FPGA execution, then it is likely a bug in the compiler (and in rare cases, some bug in the BSP).

Just to make sure, can you remove all instances of "constant" from your code and let the variables be defined as a standard float and see what happens? Assuming that it doesn't fix your problem, you can try opening a service request with Altera. Of course the issue might have already been fixed in the newer versions of the compiler, but I understand that you cannot use them since Terasic have not updated their BSP yet.

P.S. Are you using Windows? The last time I checked, Terasic's v16.1 BSP was completely broken on Linux.

Altera_Forum · ‎12-23-2017

Thanks. I'll try it out without the constant qualifier.

I'm using the 16.1 BSP on Linux which is the newest one. Why is it broken? Do you have a reference? And should I use the 16.0 instead?

Altera_Forum · ‎12-23-2017

--- Quote Start ---

I'm using the 16.1 BSP on Linux which is the newest one. Why is it broken? Do you have a reference? And should I use the 16.0 instead?

--- Quote End ---

Broken as in kernels fail to compile due to some mismatch between the synthesized IP cores in the BSP and Quartus. Though I just tired it on Quartus 16.1.2; maybe it works on 16.1.0.

Anyway, if your kernels compile correctly, you don't need to worry about it. Just wanted to ask since other people in the forum had previously reported compilation failure with this BSP.

Altera_Forum · ‎12-23-2017

I'm compiling it with 16.1.2 (sorry, should have been more precise). No compilation issues.

I've just filed a bug report. This just doesn't make any sense.

Thanks for your help!