OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1719 Discussions

confusion about Intel graphic and opencl implementation


Hi, there

         I 'been reading  the Compute architecture of intel processor graphic 7.5 doc, got confused on the SIMD FPU & kernel instance concept.As mentioned in the doc, the FPU is physically SIMD-4 width, while the kernel instance that run parallel is compiler-related, the instance num can be up to do the 32 instances work parallel while the FPU is SIMD-4. eg.Now I have a 4 hardware threads active in a EU,each one of them is SIMD-32 compiled and run the same instruction:a float4 add operation, that means we have 32x4 workitems work in parallel, how do we load this much computation on 2 FPUs, does the adding operation  run serial in  FPU? How many cycles will this 144 float4 adding cost? 

       looking forward to get some help on the subject, thank you.



0 Kudos
2 Replies

I'm still learning this myself, but this is my understanding of how this works:


"Within each EU, the primary computation units are a pair of SIMD floating point units (FPU). Although called FPUs, they support both floating point and integer computation. These units can SIMD execute up to _four_ 32-bit floating point (or integer) operations"

So each FPU can do a SIMD operation on a standard 128 bit OpenCL data type (float4, etc.)
 4 SIMD operations on 32 bit floats or integers
 8 SIMD operations on 16 bit integers (i.e. short)
 16 SIMD operations on 8 bit integers (i.e. uchar)

Since there are 2 FPUs, this translates to 8, 16, or 32 possible simultaneous operations.  However, SIMD32 isn't always possible.  It requires 8 bit data types and very small kernels/low register usage.  SIMD8, 16, or 32 is automatically chosen by the compiler.  

(For more info see

With 32 bit float data you will be in SIMD8 mode, 2 FPUs x 4 SIMD ops.  Your workgroups/items can of course work on larger chunks of data but only 2x4 operations will happen simultaneously (or 16 in the case of multiply-add). 


0 Kudos

Look like we are on the same page here, after running some tests, the results agreed with my speculation,it did work the way you described. 

Thank you for  your help.

0 Kudos