confusion about Intel graphic and opencl implementation

Tianruo_Z_ · ‎12-27-2015

Hi, there

I 'been reading the Compute architecture of intel processor graphic 7.5 doc, got confused on the SIMD FPU & kernel instance concept.As mentioned in the doc, the FPU is physically SIMD-4 width, while the kernel instance that run parallel is compiler-related, the instance num can be up to 32.how do the 32 instances work parallel while the FPU is SIMD-4. eg.Now I have a 4 hardware threads active in a EU,each one of them is SIMD-32 compiled and run the same instruction:a float4 add operation, that means we have 32x4 workitems work in parallel, how do we load this much computation on 2 FPUs, does the adding operation run serial in FPU? How many cycles will this 144 float4 adding cost?

looking forward to get some help on the subject, thank you.

Jeffrey_M_Intel1 · ‎12-29-2015

I'm still learning this myself, but this is my understanding of how this works:

From https://software.intel.com/sites/default/files/managed/f3/13/Compute_Architecture_of_Intel_Processor_Graphics_Gen7dot5_Aug2014.pdf:

"Within each EU, the primary computation units are a pair of SIMD floating point units (FPU). Although called FPUs, they support both floating point and integer computation. These units can SIMD execute up to _four_ 32-bit floating point (or integer) operations"

So each FPU can do a SIMD operation on a standard 128 bit OpenCL data type (float4, etc.)
4 SIMD operations on 32 bit floats or integers
8 SIMD operations on 16 bit integers (i.e. short)
16 SIMD operations on 8 bit integers (i.e. uchar)

Since there are 2 FPUs, this translates to 8, 16, or 32 possible simultaneous operations. However, SIMD32 isn't always possible. It requires 8 bit data types and very small kernels/low register usage. SIMD8, 16, or 32 is automatically chosen by the compiler.

(For more info see https://software.intel.com/en-us/forums/opencl/topic/538075)

With 32 bit float data you will be in SIMD8 mode, 2 FPUs x 4 SIMD ops. Your workgroups/items can of course work on larger chunks of data but only 2x4 operations will happen simultaneously (or 16 in the case of multiply-add).

Tianruo_Z_ · ‎12-29-2015

Look like we are on the same page here, after running some tests, the results agreed with my speculation,it did work the way you described.

Thank you for your help.