vstore4 cannot write data to global address space on Intel HD4600 GPU in OpenCL kernel

Hua_Z_ · ‎06-03-2015

The code is simple (Update: see code in #9):

//host side code can generate a cl::Buffer for param1 like this (or directly use C API):
//cl::Buffer bufferParam1(context, CL_MEM_READ_WRITE|CL_MEM_COPY_HOST_PTR, 200, new size_t[200]);
//code below is device side:
typedef struct _Class {
ulong vtable;
ulong id;
} Class;

__kernel void prepareNodes(__global ulong* param1) {
__global ulong* param2 = param1 + 100;
param1[0] = 3333;
param2[0] = 2222;
__global Class* psrc = (__global Class*) param1; //param1 is any valid __global kernel pointer parameter (length >= sizeof(Class))
__global Class* pdest = (__global Class*) param2; //param2 is any valid __global kernel pointer parameter (length >= sizeof(Class))
uint4 ui4 = vload4(0, (__global uint*) psrc);
vstore4(ui4, 0, (__global uint*) pdest);
printf("%#v4hlX vtable=%ld\n", ui4, pdest->vtable); // or use vtable=%lld if %ld does not print properly
}

The result shows on HD4600 GPU pdest is not modified from param2 to be param1. While same code works fine on AMD and Nvidia GPU.

I also found code below works on HD4600 which removes the __global address space qualifier:
Class src = {3333, 200};
Class dest = {2222, 100};
uint4 ui4 = vload4(0, (__global uint*) &src);
vstore4(ui4, 0, (__global uint*) &dest);
printf("%#v4hlX vtable=%ld\n", ui4, dest.vtable); // or use vtable=%lld if %ld does not print properly

Is it a bug? Any comments are appreciated.

Robert_I_Intel · ‎06-04-2015

Hi Hua,

Could you please specify the details of your system: OS version, Graphics Driver version, the full processor name?

Also, could you please provide a small replicator code instead of a code snippet?

One thing to note: we do not recommend the use of ulong, long and size_t types, since they could be of different sizes on the CPU and on the GPU.

Hua_Z_ · ‎06-04-2015

I have specified the environment when posting the bug but it seems the forum loses the information.

OS: Windows 10 preview
Driver Information:
Name:Intel(R) OpenCL
Version:OpenCL 1.2
Vendor:Intel(R) Corporation
Profile:FULL_PROFILE
Extensions:cl_intel_dx9_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir
Devices:
        Name:Intel(R) HD Graphics 4600
        Type:GPU
        Version:OpenCL 1.2 /10.18.15.4204
        Global/Local Memory:2,078,697,063/65,536
        Max ComputeUnits/WorkGroupSize/WorkItemDims:20/512/3
        Max WorkItemSizes:512:512:512
        BuiltInKernels:block_motion_estimate_intel;block_advanced_motion_estimate_check_intel
        Extensions:cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_ctz cl_intel_d3d11_nv12_media_sharing cl_intel_dx9_media_sharing cl_intel_motion_estimation cl_intel_simultaneous_sharing cl_intel_subgroups cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_depth_images cl_khr_dx9_media_sharing cl_khr_gl_depth_images cl_khr_gl_event cl_khr_gl_msaa_sharing cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir

Processor full name: Intel(R) Core(TM) i7-4710MQ CPU @ 2.50GHz 2.50GHz
Driver version: igdumdim64 10.18.15.4204 /Win8 64 (from GPU-Z)

You can write a simple program to reproduce the different results for two similar code snippets described in my post. Since I provide the code comparison, I think you may not need the whole host/device code to reproduce the results.

Hua_Z_ · ‎06-04-2015

As for your notes, because I use OpenCL code to work on data structures from x86_64 host memory, I HAVE TO handle ulong and etc data types. The size and endian are under control, it is not the problem.

Hua_Z_ · ‎06-04-2015

I supplement some code, now it is easy to write the left.

Robert_I_Intel · ‎06-05-2015

Hi Hua,

You might try the following two things:

1. Update your driver to the latest and greatest: https://downloadcenter.intel.com/search?keyword=4th+Generation+Intel%c2%ae+Core%e2%84%a2+Processors+with+Intel%c2%ae+HD+Graphics+4600

2. Put barrier(CLK_GLOBAL_MEM_FENCE); between your global writes and global reads.

The code that you are providing is not strictly valid, since multiple work items may try to write the same location (param2).

The following code worked fine for me:

typedef struct _Class {
   ulong vtable;
   ulong id;
 }  Class;

__kernel void prepareNodes(__global ulong* param1) {
   __global ulong* param2 = param1 + 100;
   param1[get_global_id(0)] = 3333;
   param2[get_global_id(0)] = 2222;
   barrier(CLK_GLOBAL_MEM_FENCE);
   __global Class* psrc = (__global Class*) param1; //param1 is any valid __global kernel pointer parameter (length >= sizeof(Class))
   __global Class* pdest = (__global Class*) param2; //param2 is any valid __global kernel pointer parameter (length >= sizeof(Class))
   uint4 ui4 = vload4(0, (__global uint*) psrc);
   vstore4(ui4, 0, (__global uint*) pdest);
   printf("%#v4hlX vtable=%ld\n", ui4, pdest->vtable); // or use vtable=%lld if %ld does not print properly
 }

Hua_Z_ · ‎06-05-2015

Yes, what you said is true. But actually it is not the fact for my program. All items read param1, but each item uses different address param2 in my program (each item works on its own copy of param1) .

Or, you can slightly modified it as:

typedef struct _Class {
   ulong vtable;
   ulong id;
 }  Class;

__kernel void prepareNodes(__global ulong* param1) {
  if (get_global_id(0) == 0) {
     __global ulong* param2 = param1 + 100;
     param1[0] = 3333;
     param2[0] = 2222;
     __global Class* psrc = (__global Class*) param1; //param1 is any valid __global kernel pointer parameter (length >= sizeof(Class))
     __global Class* pdest = (__global Class*) param2; //param2 is any valid __global kernel pointer parameter (length >= sizeof(Class))
     uint4 ui4 = vload4(0, (__global uint*) psrc);
     vstore4(ui4, 0, (__global uint*) pdest);
     printf("%#v4hlX vtable=%ld\n", ui4, pdest->vtable); // or use vtable=%lld if %ld does not print properly
  }
 }

Robert_I_Intel · ‎06-09-2015

Hi Hua,

When you are writing to a global memory even from one work item, there is no guarantee that you can read from the same memory: you will need a barrier, so I suggest you modify your code as follows:

typedef struct _Class {
   ulong vtable;
   ulong id;
 }  Class;

__kernel void prepareNodes(__global ulong* param1) {
  if (get_global_id(0) == 0) {
     __global ulong* param2 = param1 + 100;
     param1[0] = 3333;
     param2[0] = 2222;
  }
  barrier((CLK_GLOBAL_MEM_FENCE);
  if (get_global_id(0) == 0) {
     __global Class* psrc = (__global Class*) param1; //param1 is any valid __global kernel pointer parameter (length >= sizeof(Class))
     __global Class* pdest = (__global Class*) param2; //param2 is any valid __global kernel pointer parameter (length >= sizeof(Class))
     uint4 ui4 = vload4(0, (__global uint*) psrc);
     vstore4(ui4, 0, (__global uint*) pdest);
     printf("%#v4hlX vtable=%ld\n", ui4, pdest->vtable); // or use vtable=%lld if %ld does not print properly
  }
 }

Hua_Z_ · ‎06-10-2015

I cannot understand "When you are writing to a global memory even from one work item, there is no guarantee that you can read from the same memory". The cl buffer corresponding to the global memory is set to CL_MEM_READ_WRITE.

I rewrite the test code as follows and tested it:

typedef struct _Class {
 ulong vtable;
 __global struct _Class* self;
} Class;


__kernel void prepareNodes(__global uchar* src, __global void* dest) {
 if (get_local_id(0) == 0)
  ((__global Class*) dest)->self = (__global Class*) dest;
 barrier(CLK_GLOBAL_MEM_FENCE);

// __global Class* dest_for_item = (__global Class*) (((__global Class*) dest) + get_global_id(0)); //CORRECT
 __global Class* dest_for_item = (__global Class*) (((__global Class*) dest)->self + get_global_id(0)); //WRONG
 dest_for_item->vtable = 12345;
 uint4 ui4 = vload4(0, (__global uint*) src);
 vstore4(ui4, 0, (__global uint*) dest_for_item);

 if (get_local_id(0) == 0) {
  printf("vtable=%ld\n", dest_for_item->vtable);
 }
}

host side code snippet:

 auto sizesItem = device.getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();

 cl::Context context(device);
 cl::Program program(context, source);
 program.build();
 typedef cl::make_kernel<cl::Buffer&, cl::Buffer&> KernelType;
 KernelType kernel(program, "prepareNodes");

 cl::CommandQueue queue(context, device);
 size_t work_size = (NUM_PARALLEL + sizesItem[0] - 1) / sizesItem[0] * sizesItem[0];
 cl::NDRange globalNDR(work_size);
 cl::NDRange localNDR(sizesItem[0]);

 char source[100] = "33";
 cl::Buffer bufferSrc(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(source), source);
 char work_buffer[8096];
 cl::Buffer bufferWork(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(work_buffer), work_buffer);

 cl::EnqueueArgs arg(queue, cl::NullRange, globalNDR, localNDR);
 kernel(arg, bufferSrc, bufferWork);

The Intel GPU prints 12345 while NVidia/AMD GPU prints right result 13107(0x3333). If you comment out the line "//WRONG" and use the line "//CORRECT", printf prints correct result 13107. The key is that dest is replaced with ((__global Class*) dest)->self.

Please try it.

Hua_Z_ · ‎06-10-2015

Robert, Could you please verify whether you can get correct result with Iris 6200 GPU OpenCL 2.0 driver?

Robert_I_Intel · ‎06-10-2015

This line is a write to global memory:

dest_for_item->vtable = 12345;

And this line is a write to global memory (the same location):

vstore4(ui4, 0, (__global uint*) dest_for_item);

There is no guarantee that first one succeeds before the second one unless you have

barrier(CLK_GLOBAL_MEM_FENCE);

in between them.

Hua_Z_ · ‎06-10-2015

Your reply is disappointed. Obviously you didn't try running the code. If you tried, you will find whether you add the memory barrier as you require or not, Intel GPU always return wrong results.

And you didn't casefully read my updates. So please tell me how to submit bugs for Intel products in a formal way, or how can I ask for other Intel staffs to reply my question? It seems you totally cannot understand what I said.

Robert_I_Intel · ‎06-10-2015

Hua,

I think the easiest thing to do for you would be to provide me the full sample that I can build and run and also let me know what parameters you are trying to run with when you see the issue. By full sample, I mean full C++ file, full OpenCL C file and either a make file or a project to build it and parameters to run it with.

Please carefully read https://software.intel.com/en-us/forums/topic/559379

You could save a lot of time and aggravation.

Hua_Z_ · ‎06-10-2015

I uploaded the VS2013 project to identify the issue. Please check it. You only need set the include and library directory to use CL/cl.hpp(https://www.khronos.org/registry/cl/api/1.2/cl.hpp) and opencl.lib.

464830

Robert_I_Intel · ‎06-11-2015

Hi Hua,

I was able to replicate the issue that you reported. Trying to figure out whether we should classify this as a bug or as unsupported behavior. We have a discrepancy of behavior on 3rd Generation Intel(R) Processors (works as you would expect) and subsequent versions. I am contacting compiler architects to see what their take on the situation is.

Hua_Z_ · ‎06-11-2015

Thank you, Robert. Could you please verify the issue with Iris 6200 GPU and OpenCL 2.0 driver? The code (struct contains pointer to itself) definitely makes sense in OpenCL 2.0 and should be supported. As for OpenCL 1.2 (HD4600 driver), I'm not sure (but AMD/NVidia supports).

Robert_I_Intel · ‎04-04-2016

This issue is resolved even on Haswell GPUs now.

Hua_Z_ · ‎04-05-2016

Okay. I will try later. Currently the driver version is "igdumdim64 20.19.15.4331 / Win8 64", does it contain the bug fix?

Robert_I_Intel · ‎04-05-2016

Not sure, since I don't have Windows 8.1 system readily available. You will have to try it.