Solved: global_work_size and power of two?

lolxdfly · ‎05-03-2022

I have problems with an OpenCL application if the global_work_size is not a power of two number.

If the global work size is not a power of two, the program will finish without any error, but with wrong results. For example, if the global work size is 256 everything runs fine. But if the global work size is 255 or 257 the application will have wrong data in the output buffer.

Here is a test application with the same behavior:

https://gist.github.com/lolxdfly/f1e692b22f0b4e18d47f7d23d63b687a

It runs fine as it is in the listing, but if you change the "size" in line 25. to for example 257 it will produce this output:

Error at 0: 66 != 0

// ....
Error at 186: 66 != 56609

// ...
Error at 256: 66 != 0

It seems like the simple "copy"-kernel wrote random data to the out-buffer.

Is this is mistake by me or is it a bug in the driver? I remember, that older hardware could handle this without problems.

GPU: Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)

Ubuntu 22.04 with kernels: 5.13.0-1020-oem, 5.15.0-27-generic and 5.17.0-1003-oem

intel-opencl-icd version: 22.14.22890-1

Ben_A_Intel · ‎05-03-2022

Hello, this is a very interesting question, thank you for writing this up and providing the reproducer!

To summarize, the test is creating two buffers with CL_MEM_USE_HOST_PTR:

cl::Buffer bufferIn = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), in);
cl::Buffer bufferOut = cl::Buffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), out);

Then, after enqueuing the kernel for execution, the tester is calling clFinish before verifying the results:

queue.finish();

This will work in some cases, such as when the buffer is "zero-copy", but even with CL_MEM_USE_HOST_PTR the OpenCL runtime is free to make a copy if it chooses to. In this case, calling clFinish alone isn't sufficient.

If I add calls to map the buffers before verifying (and then, to unmap the buffers after verifying) then the verification succeeds regardless of the buffer size and global work size. The call to map the buffers is needed to tell the OpenCL runtime to copy data back into the host pointer if it has chosen to make a copy:

queue.enqueueMapBuffer(bufferIn, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));
queue.enqueueMapBuffer(bufferOut, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));

Note that we don't need to worry about the return value from mapping the buffer because we have CL_MEM_USE_HOST_PTR; the map is only needed to tell the OpenCL runtime to transfer the data to the host pointer if needed.

If you don't want to worry about mapping and unmapping OpenCL buffers, I would encourage you to take a look at our unified shared memory (USM) extension, and especially host USM or shared USM, which is accessible by both the host and device and does not need to be mapped.

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html

I have a host USM example that is similar to your buffer copy test here:

https://github.com/bashbaug/SimpleOpenCLSamples/tree/master/samples/usm/200_hmemhelloworld

Hope this helps!

PS: Why does the global work size matter? Actually, I don't think it does... but in the tester the global work size is also used to determine the size of the buffer, and the buffer size influences whether the OpenCL runtime is able to use the host memory in-place ("zero copy"). With a buffer that is 257 integers big the tester will still fail without mapping the buffer even with a global work size of 256. Likewise, the tester will "pass" with a buffer that is 256 integer big with a global work size of 255 - at least on our integrated GPUs! Mapping the buffer will still be needed on discrete GPUs, or in any other cases when the OpenCL runtime has chosen to make a copy of the host data.

View solution in original post

Ben_A_Intel · ‎05-03-2022

Hello, this is a very interesting question, thank you for writing this up and providing the reproducer!

To summarize, the test is creating two buffers with CL_MEM_USE_HOST_PTR:

cl::Buffer bufferIn = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), in);
cl::Buffer bufferOut = cl::Buffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), out);

Then, after enqueuing the kernel for execution, the tester is calling clFinish before verifying the results:

queue.finish();

This will work in some cases, such as when the buffer is "zero-copy", but even with CL_MEM_USE_HOST_PTR the OpenCL runtime is free to make a copy if it chooses to. In this case, calling clFinish alone isn't sufficient.

If I add calls to map the buffers before verifying (and then, to unmap the buffers after verifying) then the verification succeeds regardless of the buffer size and global work size. The call to map the buffers is needed to tell the OpenCL runtime to copy data back into the host pointer if it has chosen to make a copy:

queue.enqueueMapBuffer(bufferIn, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));
queue.enqueueMapBuffer(bufferOut, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));

Note that we don't need to worry about the return value from mapping the buffer because we have CL_MEM_USE_HOST_PTR; the map is only needed to tell the OpenCL runtime to transfer the data to the host pointer if needed.

If you don't want to worry about mapping and unmapping OpenCL buffers, I would encourage you to take a look at our unified shared memory (USM) extension, and especially host USM or shared USM, which is accessible by both the host and device and does not need to be mapped.

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html

I have a host USM example that is similar to your buffer copy test here:

https://github.com/bashbaug/SimpleOpenCLSamples/tree/master/samples/usm/200_hmemhelloworld

Hope this helps!

PS: Why does the global work size matter? Actually, I don't think it does... but in the tester the global work size is also used to determine the size of the buffer, and the buffer size influences whether the OpenCL runtime is able to use the host memory in-place ("zero copy"). With a buffer that is 257 integers big the tester will still fail without mapping the buffer even with a global work size of 256. Likewise, the tester will "pass" with a buffer that is 256 integer big with a global work size of 255 - at least on our integrated GPUs! Mapping the buffer will still be needed on discrete GPUs, or in any other cases when the OpenCL runtime has chosen to make a copy of the host data.

lolxdfly · ‎05-04-2022

Hi,

thanks for the info!

I knew, that OpenCL can still decide to copy the buffer, but I thought that this is transparent to the user and OpenCL will copy it back after clFinish. Anyway, my application does work now!

Also, thanks for the hint to USM! I will look into that.

luci · ‎10-10-2022

@Ben_A_Intel wrote:

Hello, this is a very interesting question, thank you for writing this up and providing the reproducer!

To summarize, the test is creating two buffers with CL_MEM_USE_HOST_PTR:
cl::Buffer bufferIn = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), in);
cl::Buffer bufferOut = cl::Buffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, size * sizeof(int), out);
Then, after enqueuing the kernel for execution, the tester is calling clFinish before verifying the results:
queue.finish();
This will work in some cases, such as when the buffer is "zero-copy", but even with CL_MEM_USE_HOST_PTR the OpenCL runtime is free to make a copy if it chooses to. In this case, calling clFinish alone isn't sufficient.

If I add calls to map the buffers before verifying (and then, to unmap the buffers after verifying) then the verification succeeds regardless of the buffer size and global work size. The call to map the buffers is needed to tell the OpenCL runtime to copy data back into the host pointer if it has chosen to make a copy:
queue.enqueueMapBuffer(bufferIn, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));
queue.enqueueMapBuffer(bufferOut, CL_TRUE, CL_MAP_READ, 0, size * sizeof(int));
Note that we don't need to worry about the return value from mapping the buffer because we have CL_MEM_USE_HOST_PTR; the map is only needed to tell the OpenCL runtime to transfer the data to the host pointer if needed.

If you don't want to worry about mapping and unmapping OpenCL buffers, I would encourage you to take a look at our unified shared memory (USM) extension, and especially host USM or shared USM, which is accessible by both the host and device and does not need to be mapped.

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html

I have a host USM example that is similar to your buffer copy test here:

https://github.com/bashbaug/SimpleOpenCLSamples/tree/master/samples/usm/200_hmemhello world

Hope this helps!

PS: Why does the global work size matter? Actually, I don't think it does... but in the tester the global work size is also used to determine the size of the buffer, and the buffer size influences whether the OpenCL runtime is able to use the host memory in-place ("zero copy"). With a buffer that is 257 integers big the tester will still fail without mapping the buffer even with a global work size of 256. Likewise, the tester will "pass" with a buffer that is 256 integer big with a global work size of 255 - at least on our integrated GPUs! Mapping the buffer will still be needed on discrete GPUs, or in any other cases when the OpenCL runtime has chosen to make a copy of the host data.

Thanks for helping us. I was facing also the same issue.

NoorjahanSk_Intel · ‎05-04-2022

Hi,

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks & Regards,

Noorjahan.

global_work_size and power of two?

GPU Driver

OpenCL* for GPU