OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1687 Discussions

confusion about Intel graphic and opencl implementation


Hi, there

         I 'been reading  the Compute architecture of intel processor graphic 7.5 doc, got confused on the SIMD FPU & kernel instance concept.As mentioned in the doc, the FPU is physically SIMD-4 width, while the kernel instance that run parallel is compiler-related, the instance num can be up to do the 32 instances work parallel while the FPU is SIMD-4. eg.Now I have a 4 hardware threads active in a EU,each one of them is SIMD-32 compiled and run the same instruction:a float4 add operation, that means we have 32x4 workitems work in parallel, how do we load this much computation on 2 FPUs, does the adding operation  run serial in  FPU? How many cycles will this 144 float4 adding cost? 

       looking forward to get some help on the subject, thank you.



0 Kudos
2 Replies

I'm still learning this myself, but this is my understanding of how this works:


"Within each EU, the primary computation units are a pair of SIMD floating point units (FPU). Although called FPUs, they support both floating point and integer computation. These units can SIMD execute up to _four_ 32-bit floating point (or integer) operations"

So each FPU can do a SIMD operation on a standard 128 bit OpenCL data type (float4, etc.)
 4 SIMD operations on 32 bit floats or integers
 8 SIMD operations on 16 bit integers (i.e. short)
 16 SIMD operations on 8 bit integers (i.e. uchar)

Since there are 2 FPUs, this translates to 8, 16, or 32 possible simultaneous operations.  However, SIMD32 isn't always possible.  It requires 8 bit data types and very small kernels/low register usage.  SIMD8, 16, or 32 is automatically chosen by the compiler.  

(For more info see

With 32 bit float data you will be in SIMD8 mode, 2 FPUs x 4 SIMD ops.  Your workgroups/items can of course work on larger chunks of data but only 2x4 operations will happen simultaneously (or 16 in the case of multiply-add). 



Look like we are on the same page here, after running some tests, the results agreed with my speculation,it did work the way you described. 

Thank you for  your help.