Wrong code created by auto-vectorizer?

sschuberth · ‎02-07-2011

Hi,

with the current SDK version 1.1.0.10515, 64-bit, running on Vista x64 on a Core i7 920, I'm getting wrong results with one of my kernels if I do a trivial change to enable the compiler to perform auto-vectorization. Here's the original kernel code which the compiler is unable to auto-vectorize:

__constant sampler_t sampler=
CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_CLAMP|CLK_FILTER_LINEAR;

__kernel void backproject(
// Stripped some code here ...
__read_only image2d_t input,
__global float* output)
{
// Stripped some code here ...

float4 value=read_imagef(input,sampler,(float2)(un+0.5f,vn+0.5f));

// Comment-out the below "if" statement to enable auto-vectorization with Intel OpenCL.
if (value.x)
{
int out_x=get_global_size(0);
int out_y=get_global_size(1);
output[z*out_y*out_x+y*out_x+x]+=native_recip(wn*wn)*value.x;
}
}

The code above compiles fine, though with the note "Kernel was not vectorized", and gives the correct result. If I now remove the conditional branch by commenting out the marked "if" statement like this:

__constant sampler_t sampler=
CLK_NORMALIZED_COORDS_FALSE|CLK_ADDRESS_CLAMP|CLK_FILTER_LINEAR;

__kernel void backproject(
// Stripped some code here ...
__read_only image2d_t input,
__global float* output)
{
// Stripped some code here ...

float4 value=read_imagef(input,sampler,(float2)(un+0.5f,vn+0.5f));

// Comment-out the below "if" statement to enable auto-vectorization with Intel OpenCL.
//if (value.x)
{
int out_x=get_global_size(0);
int out_y=get_global_size(1);
output[z*out_y*out_x+y*out_x+x]+=native_recip(wn*wn)*value.x;
}
}

Then the code compiles with the note "Kernel was successfully vectorized", but it gives the wrong result.

I've looked at the Assembler generated by the latter kernel. The part where the image is read looks like

mov RBX, QWORD PTR [RCX + 16]
mov R14, QWORD PTR [RSP + 368]
mov R15, QWORD PTR [RSP + 304]
mov R12D, 17
mov RCX, R15
mov EDX, 17
movapd XMM2, XMM0
call __Z11read_imagefP10_image2d_tjU8__vector2f
movaps XMM10, XMM0
movaps XMM0, XMM7
shufps XMM0, XMM8, 17
pshufd XMM0, XMM0, 8
mov RCX, R15
mov EDX, 17
movapd XMM2, XMM0
call __Z11read_imagefP10_image2d_tjU8__vector2f
movlhps XMM10, XMM0
movaps XMM0, XMM7
unpckhps XMM0, XMM8
mov RCX, R15
mov EDX, 17
movapd XMM2, XMM0
call __Z11read_imagefP10_image2d_tjU8__vector2f
movaps XMM11, XMM0
shufps XMM7, XMM8, 51
pshufd XMM0, XMM7, 8
mov RCX, R15
mov EDX, R12D
movapd XMM2, XMM0
call __Z11read_imagefP10_image2d_tjU8__vector2f
movlhps XMM11, XMM0
shufps XMM10, XMM11, -120
mulps XMM9, XMM9
rcpps XMM0, XMM9
mulps XMM0, XMM10
add EDI, ESI
add EBX, R14D
mov RAX, QWORD PTR [RSP + 328]
imul EBX, DWORD PTR [RAX + 40]
add EBX, EDI
imul EBX, DWORD PTR [RAX + 32]

I believe the problem is that I'm uploading new data to the image object between host calls to backproject(), i.e. all calls to backproject() operate on different image data (though on the same image object). But in the vectorized version, the same image data is read four times because the kernel does not return to the host between images reads to get the image data uploaded.

I know that I can explicitly disable auto-vectorization by prefixing the kernel definition by

__kernel __attribute__((vec_type_hint(float4)))

but for compatibility to other OpenCL implementations I believe that the default should just work and only safe auto-vectorizations should be done, although I have no clue how to define "safe" as the OpenCL compiler does not know the host code.

Sion_Berkowits1 · ‎02-14-2011

Hi,
Thanks for bringing up this issue.
We would like to try and reproduce the problem you are experiencing. Would it be possible for you to post the entire kernel code, or some reduced version of the kernel which still reproduces the problem?

Thank you,
Sion

sschuberth · ‎02-14-2011

I'll try to come up with a minimal reproducing example in a few days.

Note, however, that what I initially believed to be the reason for the issue is not a problem: The loop-unrolling affects work-group-level parallelism, so work that would have been done by separate work-items within a work-group is moved into a work-item, and the number of work-items within a work-group is dynamically reduced (I believe the "*_Vectorized" version is only called if the local X work-size as passed to clEnqueueNDRangeKernel() is a multiple of 4). As one cannot modify image data in between runs of different work-items within a work-group, but only in between calls to clEnqueueNDRangeKernel(), it should work. But still, I'm getting wrong results.

Sion_Berkowits1 · ‎02-14-2011

Thanks.
Have you tried turning the Vectorization on/off with the vec_hint_type, instead of unmasking the IF statement?
It may provide more insight, as it does not modify the actually executed code

sschuberth · ‎02-14-2011

I just tried, but the generated code (vectorized vs. non-vectorized using vec_hint_type) still is to complex for me to see through ...

Sion_Berkowits1 · ‎05-01-2011

Hi sschuberth,
Are you still seeing the issue with the auto-vectorizer?
If so, can you send us a minimal reproducing example?

Thanks,
Sion

sschuberth · ‎05-09-2011

Hi Sion,

sorry for the delay. I just tried the OpenCL SDK 1.1 build 12772. For the auto-vectorized code I still get wrong results, but something has changed since build 10515, it's closer to the correct result now, but it's still wrong.

It's hard for me to create a minimal example as my program is embedded into a huge framework and depends on large input data. But I'll see what I can do.