cos() returns sin() values with -fast-relaxed-math

Oueoue__Teybeo · ‎02-24-2016

Hi,

My system is

OS: Windows 7 64bits
CPU: i7-4720HQ
iGPU: HD 4600
Graphics driver: 15.38.28.4332 and 15.36.24.64.4264
OpenCL SDK: 2016 revision

Problem: when compiled with -fast-relaxed-math, the cos() function behaves as the sin() function on HD4600.

Minimal reproducer code:

void kernel test_bug(global float* input, global float* output) {

    float x = input[0];
    float cos_rand = cos(x);
    output[0] = cos_rand;
}

The buffers initialization (storing only 1 float):

    intel_bug_input_buffer  = cl::Buffer {context, CL_MEM_READ_ONLY, sizeof(float)};
    intel_bug_output_buffer = cl::Buffer {context, CL_MEM_WRITE_ONLY, sizeof(float)};
    intel_bug_input = new float[1];
    intel_bug_input[0] = 1;
    intel_bug_output = new float[1];
    intel_bug_output[0] = 0;

The calling host code:

    queue.enqueueWriteBuffer(intel_bug_input_buffer, CL_TRUE, 0, sizeof(float), intel_bug_input);
    cl::make_kernel<cl::Buffer&, cl::Buffer&> intel_bug(render.prog, "test_bug");
    cl::EnqueueArgs enqueueArgs1(queue, cl::NDRange(1, 1));
    intel_bug(enqueueArgs1, intel_bug_input_buffer, intel_bug_output_buffer).wait();
    queue.enqueueReadBuffer(intel_bug_output_buffer, CL_TRUE, 0, sizeof(float), intel_bug_output);
    printf("cos(%f) returned %f\n", intel_bug_input[0], intel_bug_output[0]);

Testing with different input values and different devices gives the following results:

HD4600

cos(0.000000) returned 0.000000
cos(1.000000) returned 0.841489
cos(2.000000) returned 0.909297
cos(3.140000) returned 0.001593

i7-4720HQ

cos(0.000000) returned 1.000000
cos(1.000000) returned 0.540302
cos(2.000000) returned -0.416147
cos(3.140000) returned -0.999999

Nvidia 965M

cos(0.000000) returned 1.000000
cos(1.000000) returned 0.540302
cos(2.000000) returned -0.416147
cos(3.140000) returned -0.999999

I tried using OpenCL C native_cos() instead and looking at the generated asm with CodeBuilder, I found a strange thing:

With -fast-relaxed-math, the cos() incorrectly returns sin() values and the native_cos() returns correct values. But the generated assembly is exactly the same ! It uses "math.cos" asm instructions. Can execution be different with the same assembly ? Unfortunately I was looking at the code generated by CodeBuilder but had to build and run it with my app (using the exact same kernel) as I couldn't run it in CodeBuilder, so maybe my app actually produced different code.. doubtful but I can't know for sure.

Whitout the flag, both are correct; the native_cos() still generates "math.cos" asm instructions and the cos() generates hundreds of instructions.

I could see how one could have mismatch a cos by a sin in the driver seeing how they are related but I can't believe I'm the only one using a cos() with -fast-relaxed-math on this hardware/software config...

PS: I tried but no, sin() correctly returns sin() values...

PPS: attached is the asm generated for this kernel. It is exactly the same for the 3 [native_cos] / [native_cos+fast-relaxed-math] / [cos+fast-relaxed-math configurations].

Robert_I_Intel · ‎02-25-2016

Hi Teybeo,

I tried really hard to reproduce the issue on my IVB, HSW and BDW boxes running Windows 8.1 and Windows 10 without any success. Everything appears to be working properly.

I wonder what happens when you try the following kernel

kernel void  test_bug(global float* input, global float* output) {
    int i = get_global_id(0);
    float x = input;
    float cos_rand = cos(x);
    output = cos_rand;
}

with buffer sizes and nd ranges of size 1024?

I consulted with our driver architect, he looked at the underlying cos code and stated that it is impossible to return sin values.

Oueoue__Teybeo · ‎02-25-2016

Hi Robert,

I tried your suggestions but every values in the output buffer are still sin() values.

One way we could obtain these values with cos() would be doing cos(Pi/2 - x) or cos(x + 3Pi/4) but the spir and assembly generated through CodeBuilder (ioc64 ?) looks fine to me (but i'm definitely not familiar with llvm, spir nor hd4600 asm...).

I tried retrieving the binary of the kernel after it has been built from my app with clGetProgramInfo( ... CL_PROGRAM_BINARIES...) to verify it was similar to the offline compiler output but it's binary so I can't really parse it... I attached the file if this can help (added .txt to the name to allow me to upload it).

Modified buffer setup:

    INTEL_COUNT = 1024;
    
    intel_bug_input_buffer  = cl::Buffer {context, CL_MEM_READ_ONLY, sizeof(float) * INTEL_COUNT};
    intel_bug_output_buffer = cl::Buffer {context, CL_MEM_WRITE_ONLY, sizeof(float) * INTEL_COUNT};

    intel_bug_input = new float[INTEL_COUNT];
    intel_bug_output = new float[INTEL_COUNT];
    for (int i = 0; i < INTEL_COUNT; ++i) {
        intel_bug_input = 1;
        intel_bug_output = 0;
    }

And modified calling code:

    queue.enqueueWriteBuffer(intel_bug_input_buffer, CL_TRUE, 0, sizeof(float) * INTEL_COUNT, intel_bug_input);
    cl::make_kernel<cl::Buffer&, cl::Buffer&> intel_bug(render.prog, "test_bug");
    cl::EnqueueArgs enqueueArgs1(queue, cl::NDRange(INTEL_COUNT));
    intel_bug(enqueueArgs1, intel_bug_input_buffer, intel_bug_output_buffer).wait();
    queue.enqueueReadBuffer(intel_bug_output_buffer, CL_TRUE, 0, sizeof(float) * INTEL_COUNT, intel_bug_output);
    for (int i = 0; i < INTEL_COUNT; i += 128) {
        printf("[%d]: cos(%f) returned %f\n", i, intel_bug_input, intel_bug_output);
    }