I have a question about the number of vector instructions that can be performed simultaneously.
- On a KNC, there are 61 cores.
- Since each core has 4 threads, totally 244 threads can store single-precision floating point numbers in VPU registers.
- VPU width is 512 bits and 16 single precision floating point numbers can be vectorized on each thread.
Now the question is that how many threads can perform the vector instructions with 16 float numbers simultaneously? 244 or 61?
If VPU is a kind of physical core, maybe 61 vector instructions only can be performed at once. Which number is correct?
Each knc thread could issue a vpu instruction at best on alternate clock. cycles so 2 or 3. threads per core. might be effective.One or more cores is usually busy with mpss.You will need to benchmark your own code as performance peak at 118 threads was far from universal.
On KNC there is one fully-pipelined vector functional unit per physical core. Use of this functional unit (and all the other function units) is shared by the threads. So the KNC chip has a total of 61 vector units and therefore a total of 61 vector instructions can be issued every cycle. As Tim P. noted, this requires a minimum of 122 threads.
For 64-bit floating-point operands, the 512-bit vector width corresponds to 8 elements per cycle. The VFMA instructions perform a multiply-add on each element (2 FP operations), so 16 FP operations per cycle per core.
The peak theoretical performance for 64-bit floating-point values is therefore: 61 cores * 1.1 GHz * 16 FP ops/cycle/core = 1073.6 GFLOPS
Note that KNL has two vector functional units per core, so the peak theoretical performance is much higher.
>>I have a question about the number of vector instructions that can be performed simultaneously.
That is an unspecific question, and as a result you typically get specific answers. John's answer is for highest estimated flops, but not for the highest number of concurrent instructions. If you consider "instruction" as time from instruction fetch to last cycle involved with the instruction, then (presumption on my part) instructions including memory references will typically experience instruction stalls while waiting for memory/data. During this stall, the instruction is still in the concurrency category, yet the VPU is available for use by other threads sharing the core. This will (should) have the effect of increasing the instruction concurrency (without increasing GFLOPS).
For KNC there are two instruction pipes, one can issue vector instructions and the other can issue all other instruction types. So the answer of "1 per core per cycle" is still correct as the maximum rate for vector instructions on KNC.
The phrase "performed simultaneously" from the original question is ambiguous, since most vector instructions on KNC are pipelined with (most commonly) six cycle latency. To simplify my answer I just addressed the question of how many vector instructions could be *issued* in a single cycle. (The answer is the same for retirement.) Because of the 6-cycle latency, issuing one vector instruction per cycle results in (after 6 cycles) 6 vector instructions that are "in the process of being executed simultaneously" in each of the 61 cores. Because of the unusual issue limitation of the KNC core, if there are 6 instructions in the vector pipeline, they must come from at least 2 different logical processors, but for the most part the pipeline does not care how many threads are issuing instructions.
IMHO this philosophical question is best answered with a benchmark structured in a manner that is consistent with the readers concept of concurrency.
concurrent register to register copy
concurrent register to resister integer
concurrent register to register floating point add
concurrent register to register floating point multiply
concurrent register to register floating point fused multiply and add
concurrent register to register floating point divide (of various precisions)
concurrent register to register floating point square root (of various precisions)
and the above with one memory operand (including load and store)
then permute with doubles
In addition to the above, are you going to factor in (extent of) concurrency of the same hardware thread having multiple instructions in flight?
The pipeline introduces ambiguities into the answer of this question