- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I'm confused about execution Model of the work-items . There are 3 compiled model(SIMD8 SIMD16,SIMD32) for the intel GPU.
1. For the SIMD-X,does that mean there are X work-items execute simultaneously in one hardware thread?
2. There is a OpenCL Kernel , compiled to SIMD16. and my work-group is a “square” work-group < 8, 8, 1 >. For SIMD16, does that mean the first two 8 work-items(< 0~7, 0~1, 1 >) execute simultaneously?
3. And for each EU,there is two 4-SIMD FPU. For SIMD16, does that mean 16 work-items run in a 4-SIMD FPU for 4 times? or in two 4-SIMD FPU for 2 times?
PS: for the local memory, are there 16 banks whose wide is 32-bit?
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Lucas,
1. Yes,
2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.
3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.
4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Robert Ioffe (Intel) wrote:
Hi Lucas,
1. Yes,
2. Yes. The 8 by 8 workgroup will end up executing on 4 hardware threads, though about the mapping of the individual work items I am not quite sure - your code shouldn't assume a particular mapping.
3. Typically, you should assume SIMD16 instruction will execute in 4 cycles on one SIMD4 FPU.
4. Correct. Local memory is accessed thru 16 banks, 4 bytes (32-bits) could be fetched from each bank in a single access.
Hi Robert,
1.Based on the thirdly question, I have another question. For example,
//work-group < 64, 1, 1 >,assume compiled SIMD16
__kernel void vectorProcess(uchar4 *A,uchar4*B,uchar4*C)
{
uint id = get_local_id(0);
C[id] = A[id]+B[id];
}
for vector data(uchar4) ,How the SIMD-4 FPU execute? Does it process one work-item in 1 cycle,and need 16 cycles to process all the 16 work-items?Or does it process 4 work-items for the first component of the vector data ?
2. For the OpenCL kernel reading local memory data, Does the data pass through L3 CacheLine whose size is 64 bit?
Thanks.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page