Execution Model For the Intel GPU

lucas_w_ · ‎02-01-2016

Now I'm confused about execution Model of the work-items . There are 3 compiled model(SIMD8 SIMD16,SIMD32) for the intel GPU.

1. For the SIMD-X,does that mean there are X work-items execute simultaneously in one hardware thread?

2. There is a OpenCL Kernel , compiled to SIMD16. and my work-group is a “square” work-group < 8, 8, 1 >. For SIMD16, does that mean the first two 8 work-items(< 0~7, 0~1, 1 >) execute simultaneously?

3. And for each EU,there is two 4-SIMD FPU. For SIMD16, does that mean 16 work-items run in a 4-SIMD FPU for 4 times? or in two 4-SIMD FPU for 2 times?

PS: for the local memory, are there 16 banks whose wide is 32-bit?

Thanks!

Robert_I_Intel · ‎02-01-2016

Hi Lucas,

1. Yes,

2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.

3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.

4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.

lucas_w_ · ‎02-02-2016

Robert Ioffe (Intel) wrote:

Hi Lucas,

1. Yes,

2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.

3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.

4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.

Hi Robert,
1.Based on the thirdly question, I have another question. For example,
//work-group < 64, 1, 1 >,assume compiled SIMD16
__kernel void vectorProcess(uchar4 *A,uchar4*B,uchar4*C)
{
   uint id = get_local_id(0);
   C[id] = A[id]+B[id];
   }
for vector data(uchar4) ,How the SIMD-4 FPU execute? Does it process one work-item in 1 cycle,and need 16 cycles to process all the 16 work-items?Or does it process 4 work-items for the first component of the vector data ?
2. For the OpenCL kernel reading local memory data, Does the data pass through L3 CacheLine whose size is 64 bit?

Thanks.