You are absolutely right Core i7 can retire 4 instructutions/cycle. Regarding your calculations:
3x2.8GHz = 8.4Gflops/sec
you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles. Secondly, you are also missing number of cores in the system as COREi-7 is not a single core machine as far as i know.
Assuming theoreticaly you are doing three 32 bit floating point instructions on three different execution engines then GFLOPS are:
3x2.8ghz x 4 (SIMD WIDTH) x 4 (number of cores) = 134GFLOPS
My above caluclations (previous reply) were based on 2 execution engines.
H Tim, so we can say then that there are indeed 4 32-b FP ALUs per FP capable port (or their equivalent). One could execute the same OP on quadtruple operands in a vector/pipelined fashion which means 1 result (vs 4) produced / clock.
When the manual says that Nehalem can retire 4 micro-ops / cycle, this includes two 4-way SP or two 2-way DP SIMD ones, right?
Can we use any of the tools (eg, VT) to assess how successfully a compiler packs operands to use the SIMD arithmetic ops in code?
thanks ... Michael
thanks for the pointer. I like the definite answers. You can understand that due to my ignorance in how the SIMD FPs are implemented within the Nehalem core I could only see the vectorized or the true SIMD approach and docs are not definite in this respect.
I am "new" to Intel64 ISA and its microarchitectures (I did ASM back in late 80s ... :)
BTW, is there any document which discusses the implementation of microarch in detail outside the standard Intel Docs?
thanks for the definite answer for this.
Everything started when the question was posed to me "what is the theoretical MAX FP performance of Nehalem cores?" I thought is should asymptotically be bounded by sustained rate of FP operation retirement in the core but then it wasn't clear to me if up to 4 SP FPs / cycle / issue port (SIMD) or 1 SP FPs / cycle / issue port (vector with pipelined ALUs) .
Thanks again. I will come back to maybe torture other answers out of the forum (maybe some uncore implementation stuff, L3 to core to QPI connection capbilities, etc. ;;)