Floating point reproducibility across devices

Lukas_E_1 · ‎08-08-2014

Hi all,

Is there a way to get reproducible float results from kernels across all devices?

I'm running an OpenCL kernel on different devices (different CPUs and GPUs), and the computation results for floats differ.

On a system with an i7-3770 CPU, its integrated Intel HD 4000 GPU, and an AMD Capeverde GPU, all possible combinations of OpenCL platform (AMD or Intel) and device lead to bit-by-bit identical results.

On another system with an i3-4010U CPU (with integrated HD 4400 GPU), the Intel OpenCL platform on the GPU produces the same results as the first system, but for the CPU, results differ.

The kernel compiler command line is always "-cl-fp32-correctly-rounded-divide-sqrt".

Arik_N_Intel · ‎08-12-2014

Hi Lukas,

Can you please provide a minimal reproducer for the problem?

Thanks,

Arik

Lukas_E_1 · ‎09-15-2014

Hi Arik,

sorry for the delay. Here is the kernel code where we're seeing the issue:

__kernel void warpCoord(__global float2 *restrict outputPos,
                        const float4 A,       // fx, fy, cx, cy
                        const float4 coeff_1, // k1, k2, p1, p2
                        const float4 coeff_2, // k3, k4, k5, k6
                        const float4 ir_r1, const float4 ir_r2,
                        const float4 ir_r3)
{
    int4 iSP;
    iSP.s0 = get_global_id(0);
    iSP.s1 = get_global_id(1);
    iSP.s2 = get_global_size(0); // frame width
    iSP.s3 = get_global_size(1); // frame height

    float3 pos = (float3)(iSP.x * ir_r1.x + iSP.y * ir_r1.y + ir_r1.z,
                          iSP.x * ir_r2.x + iSP.y * ir_r2.y + ir_r2.z,
                          iSP.x * ir_r3.x + iSP.y * ir_r3.y + ir_r3.z);
    pos /= pos.z;

    float x2 = pos.x * pos.x;
    float y2 = pos.y * pos.y;
    float r2 = x2 + y2;
    float _2xy = 2 * pos.x * pos.y;
    float kr = (1 + ((coeff_2.s0 * r2 + coeff_1.s1) * r2 + coeff_1.s0) * r2) /
               (1 + ((coeff_2.s3 * r2 + coeff_2.s2) * r2 + coeff_2.s1) * r2);
    float2 pos_dist = (float2)(
        A.s0 * (pos.x * kr + coeff_1.s2 * _2xy + coeff_1.s3 * (r2 + 2 * x2)) +
            A.s2,
        A.s1 * (pos.y * kr + coeff_1.s2 * (r2 + 2 * y2) + coeff_1.s3 * _2xy) +
            A.s3);

    float2 fDP = (float2)(pos_dist.x, pos_dist.y);

    outputPos[iSP.s0 + iSP.s1 * iSP.s2] = fDP;
}

The kernel arguments to reproduce this are:

A = {1, 1, 0, 0}
coeff_1 = {0, 0, 0, 0}
coeff_2 = {0, 0, 0, 0}
ir_r1 = {1, 0, 0, 0}
ir_r2 = {0, 1, 0, 0}
ir_r3 = {0, 0, 1, 0}
outputPos is a buffer with global_size(0) * global_size(1) * sizeof(cl_float2) bytes

The expected output is for each element in outputPos to contain it's 2D position in the array.