Hi Robert,

Yousuke_T_ · ‎10-19-2015

Hi,

I have encountered a possible code generation bug in the OpenCL runtime compiler for Intel CPUs on Windows platforms.

Please find attached an archive of source code (ocltest.zip) that reproduces the bug. It is a CMake project and you can build it with, e.g., the following commands:

$ unzip ocltest.zip
$ mkdir ocltest-build
$ cd ocltest-build/
$ cmake -G "Visual Studio 12 2013" -A "x64" ../ocltest/
$ cmake --build . --config RelWithDebInfo

Note that you need CMake, Visual Studio 2013 (or 2015), and an OpenCL SDK (Intel INDE or CUDA).

If I run the resulting executable (oclellipticpde.exe) on a PC with the following configuration

Intel Core i7 3770K @ 3.50 GHz, 8 GB RAM, Windows 10 x64

it terminates normally and we obtain the following output:

CL_PLATFORM_NAME: Intel(R) OpenCL
CL_DEVICE_NAME:        Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz
CL_DEVICE_VERSION: OpenCL 1.2 (Build 57)
CL_DRIVER_VERSION: 5.0.0.57
done

However, if I run the same executable on another PC with

Dual Intel Xeon X5650 @ 2.67 GHz, 24 GB RAM, Windows 7 x64

it crashes after the following output:

CL_PLATFORM_NAME: Intel(R) OpenCL
CL_DEVICE_NAME: Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
CL_DEVICE_VERSION: OpenCL 1.2 (Build 57)
CL_DRIVER_VERSION: 5.0.0.57

I have also attached a log of WinDbg for this. The faulting code locates on a rather low address space that does not correspond to any module so it is likely to be the jitted code.

Since the kernel works on other CPUs (and also on GPUs), I suppose it is correct. I am not quite sure exactly what configuration can cause the crash, I suspect the CPU architecture (i.e., SSE4) matters.

Can anyone reproduce the problem or point out what is wrong with the code or any workaround?

Thank you.

Yousuke

Robert_I_Intel · ‎10-19-2015

Hi Yousuke,

I couldn't replicate the problem that you described on the hardware available to me. The Xeon in question is quite old and the support for it is discontinued: http://mark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI

Please make sure that you installed the latest and greatest driver from here: https://software.intel.com/en-us/articles/opencl-drivers#win64 - if that doesn't work for you I believe there is not much that I can do. I can still file a bug but the likelihood of it being fixed for your platform is quite small.

One other thing you might try is to try to produce a native binary for the CPU with an offline compiler as described here https://software.intel.com/en-us/articles/using-spir-for-fun-and-profit-with-intel-opencl-code-builder and try to load that binary instead.

Yousuke_T_ · ‎10-19-2015

Hi Robert,

Thank you for your reply.

I have installed (actually repaired) the latest OpenCL runtime and tried again, but the problem still occurs. The driver version is the same as I gave in the previous post.

For the moment I am not very interested in offline compilation that is not portable for OpenCL 1.2 so I haven't tried an offline compilation yet.

Robert_I_Intel · ‎10-27-2015

Hi Yousuke,

You problem is caused by a bug in the CPU backend and vectorization on SSE42 architecture. It will take us some time to fix the issue.

As a workaround you can set the following environment variable:

CL_CONFIG_CPU_VECTORIZER_MODE=1

This turns off vectorization. For more info on this variable please see the following documentation https://software.intel.com/en-us/node/540560

Yousuke_T_ · ‎10-27-2015

Hi Robert,

Thank you for your information.

The workaround CL_CONFIG_CPU_VECTORIZER_MODE=1 just worked for me. However, not surprisingly, it slows down other kernels (not shown in the repro) that are not suffering from the problem.

In a previous post, you said it was not likely that the bug would be fixed. Does this mean that the latest OpenCL CPU runtime no longer supports non-AVX CPUs and it is not safe to install it on a system with such CPUs? I would appreciate it if you could make this point clear because the release note is not very clear about it.

Thank you,

Yousuke Takada

Robert_I_Intel · ‎10-27-2015

Hi Yousuke,

I am trying to figure out what the timeline for the fix would be. Meanwhile, one more possible workaround is to switch from images to buffers, since the problem on our end has to do with vectorization of read_imagef functions on the CPU. You would need to pass the width of the image to the kernel and calculate the index into the buffer. You buffers will be __global float4* A (not sure if you have to switch B from image to buffer - you can try to leave it on the first iteration).

Sorry for the delay,

Robert

Robert_I_Intel · ‎10-27-2015

Hi Yousuke,

Quick question: do you still see the problem if you rewrite your kernel as follows:

#define H 1
__attribute__((reqd_work_group_size(4, 4, 1)))
__kernel void relax(__read_only image2d_t A, __write_only image2d_t B, int N, int M, float weight)
{
 int i = get_global_id(1);
 int j = get_global_id(0);
 float a = read_imagef(A, (int2)(j, i)).x;
 float a_n = read_imagef(A, (int2)(j, i - 1)).x;
 float a_s = read_imagef(A, (int2)(j, i + 1)).x;
 float a_w = read_imagef(A, (int2)(j - 1, i)).x;
 float a_e = read_imagef(A, (int2)(j + 1, i)).x;

 if (H <= i && i < N - H && H <= j && j < M - H) {
  float sum = 0.25f *(a_n + a_s + a_w + a_e);
  a += weight*(sum - a);
 }
 write_imagef(B, (int2)(j, i), (float4)a);
}

Yousuke_T_ · ‎10-28-2015

Hi Robert,

As you suggested, we can modify the kernel so as to use buffers instead of images and, if we do so, the modified kernel works fine (not shown in the attached repro, though). For this particular kernel, the use of buffers and possibly some local memory would typically result in higher performance.

Regarding the kernel shown in your post, it does not work as is. This is not because of the runtime code generation bug but because of the out-of-bounds read access to the image in your kernel code itself. To fix this, we have to modify the kernel so as to, e.g., use a proper addressing mode as shown in the following. The fixed kernel has worked without CL_CONFIG_CPU_VECTORIZER_MODE=1.

#define H 1
__attribute__((reqd_work_group_size(4, 4, 1)))
__kernel void relax(__read_only image2d_t A, __write_only image2d_t B, int N, int M, float weight)
{
 int i = get_global_id(1);
 int j = get_global_id(0);
 float a = read_imagef(A, (int2)(j, i)).x;
 const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
 float a_n = read_imagef(A, sampler, (int2)(j, i - 1)).x;
 float a_s = read_imagef(A, sampler, (int2)(j, i + 1)).x;
 float a_w = read_imagef(A, sampler, (int2)(j - 1, i)).x;
 float a_e = read_imagef(A, sampler, (int2)(j + 1, i)).x;
 if (H <= i && i < N - H && H <= j && j < M - H) {
  float sum = 0.25f *(a_n + a_s + a_w + a_e);
  a += weight*(sum - a);
 }
 write_imagef(B, (int2)(j, i), (float4)a);
}

I have also found that the following works too.

#define H 1
__attribute__((reqd_work_group_size(4, 4, 1)))
__kernel void relax (__read_only image2d_t A, __write_only image2d_t B, int N, int M, float weight)
{
	int i = get_global_id(1);
	int j = get_global_id(0);
	float a = read_imagef(A, (int2)(j, i)).x;
	if (H <= i && i < N - H && H <= j && j < M - H) {
		const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;			// this works
		//const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;	// this works too
		//const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_NEAREST;			// but this does not
		float a_n = read_imagef(A, sampler, (int2)(j, i - 1)).x;
		float a_s = read_imagef(A, sampler, (int2)(j, i + 1)).x;
		float a_w = read_imagef(A, sampler, (int2)(j - 1, i)).x;
		float a_e = read_imagef(A, sampler, (int2)(j + 1, i)).x;
		float sum = ((float)(1.0/4.0))*(a_n + a_s + a_w + a_e);
		a += weight*(sum - a);
	}
	write_imagef(B, (int2)(j, i), (float4)a);
}

So it seems that explicitly using an addressing mode other than CLK_ADDRESS_NONE is a possible workaround for the problem I reported.

Robert_I_Intel · ‎10-28-2015

Hi Yousuke,

Yes, I forgot about the clamp part. I am glad it worked. The other workaround suggested by the CPU runtime developer:

To disable vectorizer for specific kernel user should use kernel __attribute__((vec_type_hint(int4)))

Topic was discussed here https://software.intel.com/en-us/forums/opencl/topic/281812

But you probably don't need this workaround in this case anymore.

Yousuke_T_ · ‎10-28-2015

Hi Robert,

Thank you for the further information.

I did some experiments and have confirmed that specifying __attribute__((vec_type_hint(int4))) disables auto-vectorization and the kernel works fine. As you said, I no longer need such a workaround that affects auto-vectorization, though.

I have also took some build logs to confirm the behavior. See below.

If we do not specify the vec_type_hint attribute, the kernel is auto-vectorized and the build log (CL_PROGRAM_BUILD_LOG) shows:

Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <relax> was successfully vectorized (4)
Done.

If we specify a vec_type_hint other than __attribute__((vec_type_hint(float))), i.e., __attribute__((vec_type_hint(int4))), __attribute__((vec_type_hint(float4))), __attribute__((vec_type_hint(float2))), etc., the kernel is not vectorized, showing:

Compilation started
Compilation done
Linking started
Linking done
Device build started
Device build done
Kernel <relax> was not vectorized
Done.

An OpenCL kernel that reads/writes textures causes access violation when jitted for CPUs that support up to SSE4