OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Floating point reproducibility across devices

Lukas_E_1
Beginner
585 Views

Hi all,

Is there a way to get reproducible float results from kernels across all devices?

I'm running an OpenCL kernel on different devices (different CPUs and GPUs), and the computation results for floats differ.

On a system with an i7-3770 CPU, its integrated Intel HD 4000 GPU, and an AMD Capeverde GPU, all possible combinations of OpenCL platform (AMD or Intel) and device lead to bit-by-bit identical results.

On another system with an i3-4010U CPU (with integrated HD 4400 GPU), the Intel OpenCL platform on the GPU produces the same results as the first system, but for the CPU, results differ.

The kernel compiler command line is always "-cl-fp32-correctly-rounded-divide-sqrt".

0 Kudos
2 Replies
Arik_N_Intel
Employee
585 Views

Hi Lukas,

Can you please provide a minimal reproducer for the problem?

Thanks,

Arik

0 Kudos
Lukas_E_1
Beginner
585 Views

Hi Arik,

sorry for the delay. Here is the kernel code where we're seeing the issue:

__kernel void warpCoord(__global float2 *restrict outputPos,
                        const float4 A,       // fx, fy, cx, cy
                        const float4 coeff_1, // k1, k2, p1, p2
                        const float4 coeff_2, // k3, k4, k5, k6
                        const float4 ir_r1, const float4 ir_r2,
                        const float4 ir_r3)
{
    int4 iSP;
    iSP.s0 = get_global_id(0);
    iSP.s1 = get_global_id(1);
    iSP.s2 = get_global_size(0); // frame width
    iSP.s3 = get_global_size(1); // frame height

    float3 pos = (float3)(iSP.x * ir_r1.x + iSP.y * ir_r1.y + ir_r1.z,
                          iSP.x * ir_r2.x + iSP.y * ir_r2.y + ir_r2.z,
                          iSP.x * ir_r3.x + iSP.y * ir_r3.y + ir_r3.z);
    pos /= pos.z;

    float x2 = pos.x * pos.x;
    float y2 = pos.y * pos.y;
    float r2 = x2 + y2;
    float _2xy = 2 * pos.x * pos.y;
    float kr = (1 + ((coeff_2.s0 * r2 + coeff_1.s1) * r2 + coeff_1.s0) * r2) /
               (1 + ((coeff_2.s3 * r2 + coeff_2.s2) * r2 + coeff_2.s1) * r2);
    float2 pos_dist = (float2)(
        A.s0 * (pos.x * kr + coeff_1.s2 * _2xy + coeff_1.s3 * (r2 + 2 * x2)) +
            A.s2,
        A.s1 * (pos.y * kr + coeff_1.s2 * (r2 + 2 * y2) + coeff_1.s3 * _2xy) +
            A.s3);

    float2 fDP = (float2)(pos_dist.x, pos_dist.y);

    outputPos[iSP.s0 + iSP.s1 * iSP.s2] = fDP;
}

The kernel arguments to reproduce this are:

  • A = {1, 1, 0, 0}
  • coeff_1 = {0, 0, 0, 0}
  • coeff_2 = {0, 0, 0, 0}
  • ir_r1 = {1, 0, 0, 0}
  • ir_r2 = {0, 1, 0, 0}
  • ir_r3 = {0, 0, 1, 0}
  • outputPos is a buffer with global_size(0) * global_size(1) * sizeof(cl_float2) bytes

 

The expected output is for each element in outputPos to contain it's 2D position in the array.

0 Kudos
Reply