OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

Execution Model For the Intel GPU

lucas_w_
Beginner
850 Views

     Now I'm confused about  execution Model of the work-items . There are 3 compiled model(SIMD8 SIMD16,SIMD32) for the intel GPU.

     1. For the SIMD-X,does that mean there are X work-items execute simultaneously in one hardware thread?

     2. There is a OpenCL Kernel , compiled to SIMD16. and my work-group is  a “square” work-group < 8, 8, 1 >. For SIMD16, does that mean the first two 8 work-items(< 0~7, 0~1, 1 >) execute simultaneously? 

     3. And for each EU,there is two 4-SIMD FPU. For SIMD16, does that mean 16 work-items run in a 4-SIMD FPU for 4 times? or in two 4-SIMD FPU for 2 times?

    PS: for the local memory, are there 16 banks  whose wide is 32-bit? 

    Thanks!

 

0 Kudos
2 Replies
Robert_I_Intel
Employee
850 Views

Hi Lucas,

1. Yes,

2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.

3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.

4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.

0 Kudos
lucas_w_
Beginner
850 Views

Robert Ioffe (Intel) wrote:

Hi Lucas,

1. Yes,

2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.

3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.

4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.

Hi Robert,
   1.Based on the thirdly question, I have another question. For example,
   //work-group < 64, 1, 1 >,assume compiled SIMD16
   __kernel void vectorProcess(uchar4 *A,uchar4*B,uchar4*C)   
   {
      uint id = get_local_id(0);
      C[id] = A[id]+B[id];
    }
   for vector data(uchar4) ,How the SIMD-4 FPU execute? Does it process one work-item in 1 cycle,and need 16 cycles to process all the 16 work-items?Or does it process 4 work-items for the first component of the vector data ?
   2. For the OpenCL kernel reading local memory data, Does the data pass through L3 CacheLine whose size is 64 bit?
      
   Thanks.

 

 

0 Kudos
Reply