intel_sub_group_block_read8 gets unexpected column data with large work group size

Shengquan_Y_Intel · ‎11-08-2016

Hi,

I got a problem when use intel_sub_group_block_read8 to read image2D. It is a very simple usage, just continuously read uint8. The image2D is 64x32 (Width x Height), and each pixel is a uint. Part of the image2D is printed as bellow for illustration:

0 1 2 3 4 5 6 7
--------------------------------------------------
0|0x000 0x001 0x002 0x003 0x004 0x005 0x006 0x007
1|0x040 0x041 0x042 0x043 0x044 0x045 0x046 0x047
2|0x080 0x081 0x082 0x083 0x084 0x085 0x086 0x087
3|0x0c0 0x0c1 0x0c2 0x0c3 0x0c4 0x0c5 0x0c6 0x0c7
4|0x100 0x101 0x102 0x103 0x104 0x105 0x106 0x107
5|0x140 0x141 0x142 0x143 0x144 0x145 0x146 0x147
6|0x180 0x181 0x182 0x183 0x184 0x185 0x186 0x187
7|0x1c0 0x1c1 0x1c2 0x1c3 0x1c4 0x1c5 0x1c6 0x1c7

8|0x200 0x201 0x202 0x203 0x204 0x205 0x206 0x207
9|0x240 0x241 0x242 0x243 0x244 0x245 0x246 0x247
10|0x280 0x281 0x282 0x283 0x284 0x285 0x286 0x287
11|0x2c0 0x2c1 0x2c2 0x2c3 0x2c4 0x2c5 0x2c6 0x2c7
12|0x300 0x301 0x302 0x303 0x304 0x305 0x306 0x307
13|0x340 0x341 0x342 0x343 0x344 0x345 0x346 0x347
14|0x380 0x381 0x382 0x383 0x384 0x385 0x386 0x387
15|0x3c0 0x3c1 0x3c2 0x3c3 0x3c4 0x3c5 0x3c6 0x3c7

For this 64x32 image2D, I set global work size to 64x4, and each work item read a uint8

The problems is:

if I use large work group size 64x4 or 32x4 (work group size[1] is 4), I can't read the expected column data on some location. E.g. I expect "0x200 0x240 0x280 0x2c0 0x300 0x340 0x380 0x3c0" at byte coordination (0,8), but I actually get "0x4 0x44 0x84 0xc4 0x104 0x144 0x184 0x1c4" .

If work group size is 64x2 or 32x2 (work group size[1] is 2), I can get "0x200 0x240 0x280 0x2c0 0x300 0x340 0x380 0x3c0" from byte coordination (0,8).

1. ./transpose -y 4| tee wg_4.log (work group is 64x4, which gets unexpected column data at byte coordination:0,8)

br_src = 0x4 0x44 0x84 0xc4 0x104 0x144 0x184 0x1c4
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=4
|--src img_coord(0,8)/byte_coord_xy(0,8)/group_xy(0,0)/local_xy(0,1),size (64,4)/

2. ./transpose | tee wg_2.log (work group is 64x2, which gets expected column data at byte coordination:0,8)

br_src = 0x200 0x240 0x280 0x2c0 0x300 0x340 0x380 0x3c0
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=8 (group_size=64,2), subgrp_id=4
|--src img_coord(0,8)/byte_coord_xy(0,8)/group_xy(0,0)/local_xy(0,1),size (64,2)/

The sample is attached: block_read.zip, my environment: Intel(R) HD Graphics Skylake ULX GT2, Driver Version  16.5.56895

Could someone help to look at it and point out where the problem is?

Thanks

-Austin

Jeffrey_M_Intel1 · ‎11-09-2016

I can replicate what you're seeing. Next step is to set up the tools and try to understand what is going on. Hope to get back to you in a day or two with more info.

Shengquan_Y_Intel · ‎11-10-2016

Thanks for looking at it. I got more information

The sample kernel is SIMD16 from vtune.

With bellow patch (just print sub_group_id:subgropu_local_id), I found

For workgroup size=64x4, the same subgroup_id = 0 gets get_sub_group_local_id 0/1/2/3, 0/1/2/3, ..., 0/1/2/3.

For workgroup size=64x2, the same subgroup_id = 0 gets get_sub_group_local_id 0/1/2/3/4/5/6/.../12/13/14/15.

While spec is:

uint get_sub_group_local_id( void )
Returns the unique work item ID within the current subgroup. The mapping from get_local_id to get_sub_group_local_id will be invariant for the lifetime of the work group.

diff -Nur block-read/transpose.cl transpose/transpose.cl

--- block-read/transpose.cl     2016-11-08 17:12:08.000000000 +0800
+++ transpose/transpose.cl      2016-11-10 16:07:34.969383908 +0800
@@ -52,12 +52,13 @@
     const int subgrp_size_max = get_max_sub_group_size();
     const int subgrp_num = get_num_sub_groups();
     const int subgrp_id = get_sub_group_id();
+    const int subgrp_local_id = get_sub_group_local_id();

     printf("br_src = 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x\n"
-           "|--subgrp_size(work items)=%d, subgrp_size_max=%d subgrp_num_in_work_group=%d (group_size=%d,%d), subgrp_id=%d\n"
+           "|--subgrp_size(work items)=%d, subgrp_size_max=%d subgrp_num_in_work_group=%d (group_size=%d,%d), subgrp_id=%d, subgrp_local_id=%d\n"
            "|--src img_coord(%d,%d)/byte_coord_xy(%d,%d)/group_xy(%d,%d)/local_xy(%d,%d),size (%d,%d)/\n",
-           br_src.elvector.s0, br_src.elvector.s1, br_src.elvector.s2, br_src.elvector.s3,br_src.elvector.s4, br_src.elvector.s5, br_src.elvector.s6, br_src.elvector.s7,
-           subgrp_size, subgrp_size_max, subgrp_num, group_size_x, group_size_y, subgrp_id,
+           br_src.elarray[0], br_src.elarray[1], br_src.elarray[2], br_src.elarray[3],br_src.elarray[4], br_src.elarray[5], br_src.elarray[6], br_src.elarray[7],
+           subgrp_size, subgrp_size_max, subgrp_num, group_size_x, group_size_y, subgrp_id, subgrp_local_id,
            img_coord.x, img_coord.y, img_byte_coord.x, img_byte_coord.y, group_x, group_y, local_x, local_y, group_size_x, group_size_y);
#endif
}

The log for workgroup size = 64x4 (pixel size is 64x32, the whole kernel only has a work group)

We can see for "subgrp_id=0", subgrp_local_id=0/1/2/3, 0/1/2/3, ...

root@skl-austin:~/transpose# less 1.log |grep "subgrp_id=0"
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=0
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=1
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=2
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=3
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=0
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=1
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=2
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=3
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=0
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=1
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=2
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=3
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=0
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=1
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=2
|--subgrp_size(work items)=16, subgrp_size_max=16 subgrp_num_in_work_group=16 (group_size=64,4), subgrp_id=0, subgrp_local_id=3

Ben_A_Intel · ‎11-10-2016

For 2-dimensional work groups, there's no guarantee that work items will be assigned to subgroups left-to-right and top-to-bottom, even though this is usually the intuitive mapping. In many cases - like the one you describe above - the OpenCL runtime may assign work items to subgroups in two-dimensional blocks, which may improve performance sampling or writing two-dimensional image data (due to the way caches work for images), but unfortunately complicates subgroup programming.

Internally, we usually use 1-dimensional work groups for subgroup programming to avoid this issue. If you absolutely must use a 2-dimensional work group then you can get the "intuitive" behavior as a function of the subgroup ID, subgroup size, and subgroup local ID, but in this case you should be very careful not to mix your calculated IDs with the usual local IDs / global IDs.

Hope this helps!

Shengquan_Y_Intel · ‎11-17-2016

Hi, Ben

This information is really helpful, now with 1D, I can get correct block read/write.

I want to wrap up what I reported above:

For workgroup size=64x4, the same subgroup_id = 0 gets get_sub_group_local_id 0/1/2/3, 0/1/2/3, ..., 0/1/2/3.
For workgroup size=64x2, the same subgroup_id = 0 gets get_sub_group_local_id 0/1/2/3/4/5/6/.../12/13/14/15.

I cleaned up the test tool and attached here again. If it is one issue, and someone wants to follow up...

Thanks

-Austin