I am taking the reference from the intel_gen8_arch
Few sections are causing confusion in my understanding for SIMD engine concept.
5.3.2 SIMD FPUs Within each EU, the primary computation units are a pair of SIMD floating-point units (FPUs). Although called FPUs, they support both floating-point and integer computation. These units can SIMD execute up to four 32-bit floating-point (or integer) operations, or SIMD execute up to eight 16-bit integer or 16-bit floating-point operations. Each SIMD FPU can complete simultaneous add and multiply (MAD) floating-point instructions every cycle. Thus each EU is capable of 16 32-bit floating-point operations per cycle: (add + mul) x 2 FPUs x SIMD-4.
The above lines of the documents clearly states the maximum floating point operations that can be done on each Execution Unit.
First Question: I think it is referring to per hardware thread of Execution unit than the whole execution unit. Am I right here?
In section 5.3.5 it mentions On Gen8 compute architecture, most SPMD programming models employ this style code generation and EU processor execution. Effectively, each SPMD kernel instance appears to execute serially and independently within its own SIMD lane. In actuality, each thread executes a SIMD-Width number of kernel instances concurrently. Thus for a SIMD-16 compile of a compute kernel, it is possible for SIMD-16 x 7 threads = 112 kernel instances to be executing concurrently on a single EU. Similarly, for a SIMD-32 compile of a compute kernel, 32 x 7 threads = 224 kernel instances could be executing concurrently on a single EU.
Now this section illustration seems contradicting with the section 5.3.2.
Specifically, 1) Since it says each HW thread of EU has 2, SIMD-4 units then how SIMD-32 works. How are we reaching to calculation of 224 on 7 threads. DO we combine some hardware threads?
Also, How we compile the kernel in SIMD-16 or SIMD-32 mode?