Solved: "intel_sub_group_block_read8" fails to get correct values if the image is created from a buffer

Shan_K_Intel · ‎03-02-2020

intel_sub_group_block_read8(src, coord) doesn't get correct values, if the src 2D image is created through method 1 like the following steps:
cl_mem buf_from_hostptr = clCreateBuffer( context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, N * M * sizeof(float), src, &err );
cl_image_desc desc;
...
desc.buffer = buf_from_hostptr;
clCreateImage( context, 0, &mbr_imageFormat, &desc, NULL, &err );

If I create the src 2D image through method 2 which is created from src array directly:
mi_src0 = clCreateImage( context, CL_MEM_USE_HOST_PTR, &mbr_imageFormat, &desc, src, &err );
It can work correctly.

I have a test app and you can get the code through "git clone https://github.com/kangshan0910/buffer2image.git", run "Make" and you will get the test binary.

The test will create a 8x8 matrix. In the opencl kernel, each work item in the subgroup will call intel_sub_group_block_read8 to read one column data.
The 8x8 matrix is:
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07
2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07
3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07
4.00 4.01 4.02 4.03 4.04 4.05 4.06 4.07
5.00 5.01 5.02 5.03 5.04 5.05 5.06 5.07
6.00 6.01 6.02 6.03 6.04 6.05 6.06 6.07
7.00 7.01 7.02 7.03 7.04 7.05 7.06 7.07

For method 1, run "./test -b", its output is:
matrix size: 8x8
group_xy(0,0) local_xy(00,00) data=0.00,0.00,2.00,2.00,4.00,4.00,6.00,6.00
group_xy(0,0) local_xy(01,00) data=0.01,0.01,2.01,2.01,4.01,4.01,6.01,6.01
group_xy(0,0) local_xy(02,00) data=0.02,0.02,2.02,2.02,4.02,4.02,6.02,6.02
group_xy(0,0) local_xy(03,00) data=0.03,0.03,2.03,2.03,4.03,4.03,6.03,6.03
group_xy(0,0) local_xy(04,00) data=0.04,0.04,2.04,2.04,4.04,4.04,6.04,6.04
group_xy(0,0) local_xy(05,00) data=0.05,0.05,2.05,2.05,4.05,4.05,6.05,6.05
group_xy(0,0) local_xy(06,00) data=0.06,0.06,2.06,2.06,4.06,4.06,6.06,6.06
group_xy(0,0) local_xy(07,00) data=0.07,0.07,2.07,2.07,4.07,4.07,6.07,6.07
This is incorrect.

For method 2, execute "./test", its output is:
matrix size: 8x8
group_xy(0,0) local_xy(00,00) data=0.00,1.00,2.00,3.00,4.00,5.00,6.00,7.00
group_xy(0,0) local_xy(01,00) data=0.01,1.01,2.01,3.01,4.01,5.01,6.01,7.01
group_xy(0,0) local_xy(02,00) data=0.02,1.02,2.02,3.02,4.02,5.02,6.02,7.02
group_xy(0,0) local_xy(03,00) data=0.03,1.03,2.03,3.03,4.03,5.03,6.03,7.03
group_xy(0,0) local_xy(04,00) data=0.04,1.04,2.04,3.04,4.04,5.04,6.04,7.04
group_xy(0,0) local_xy(05,00) data=0.05,1.05,2.05,3.05,4.05,5.05,6.05,7.05
group_xy(0,0) local_xy(06,00) data=0.06,1.06,2.06,3.06,4.06,5.06,6.06,7.06
group_xy(0,0) local_xy(07,00) data=0.07,1.07,2.07,3.07,4.07,5.07,6.07,7.07
This is correct.

My main clinfo output is:
Number of platforms 3
Platform Name Intel(R) OpenCL HD Graphics
Platform Vendor Intel(R) Corporation
Platform Version OpenCL 2.1
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_depth_images cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_intel_subgroups cl_intel_required_subgroup_size cl_intel_subgroups_short cl_khr_spir cl_intel_accelerator cl_intel_media_block_io cl_intel_driver_diagnostics cl_khr_priority_hints cl_khr_throttle_hints cl_khr_create_command_queue cl_khr_fp64 cl_khr_subgroups cl_khr_il_program cl_intel_spirv_device_side_avc_motion_estimation cl_intel_spirv_media_block_io cl_intel_spirv_subgroups cl_khr_spirv_no_integer_wrap_decoration cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_intel_unified_shared_memory_preview cl_intel_planar_yuv cl_intel_packed_yuv cl_intel_motion_estimation cl_intel_device_side_avc_motion_estimation cl_intel_advanced_motion_estimation cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_intel_va_api_media_sharing
Platform Host timer resolution 1ns
Platform Extensions function suffix INTEL
...
Platform Name Intel(R) OpenCL HD Graphics
Number of devices 1
Device Name Intel(R) Gen9 HD Graphics NEO
Device Vendor Intel(R) Corporation
Device Vendor ID 0x8086
Device Version OpenCL 2.1 NEO
Driver Version 20.01.15264
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
...

Ben_A_Intel · ‎03-02-2020

Hello,

First, thank you for the excellent reproducer!

There are a few extra restrictions for the subgroup image block reads when the image is created from a buffer. See the bottom of the Intel subgroups spec:

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.html

Specifically:

"When reading or writing a 2D image created from a buffer with the subgroup block read and write built-ins, the image row pitch is required to be a multiple of 64-bytes, in addition to the CL_DEVICE_IMAGE_PITCH_ALIGNMENT requirements."

In the reproducer, the image row pitch is 32 bytes. Can you try an image row pitch of 64 bytes instead, either by making the image 16 pixels wide rather than 8, or by padding the image?

View solution in original post

Ben_A_Intel · ‎03-02-2020