On another note, are there separate floating-point execution units supporting the various SSE or MMX instructions ? Or are the same execution units with those used for the regular FP instructions ? As far as I can tell, Nehalem cores have 5 exec. unit 2 of which can carry out double-precistion FP calculation + a 3rd unit doing FPshuffles, correct ?
What is the "thoretical" Maximuym FLOPS performance of a Nehalem CORE running at 2.8GHz (no turbo) ?
Thanks for any info on these......
Each core has its own execution units as you can find out from that article. Also, you may want to read more in Software
Intel 64 and IA-32 Architectures Optimization Reference Manual at http://www.intel.com/products/processor/manuals/
Regarding flops.2.8GHZ with 4 corescan support 90GFLOPS
I was wondering what would be the rate of floating-point ops / sec for a Nehalem core. From their h/w docs it appears that each core has 6 execution units from which only 2 can carry out FP arithmetic ops (the 3rd one can do FP shuffle and the other 3 are for mem access). For the FLOPS (FP ops / sec) I cannot see how this can be > 3 / cycle.
If you take a look at :
http://www.intel.com/products/processor/manuals/ the "Intel 64 and IA-32 Architectures Optimization Reference Manual " (pp.2.21 2.2 INTEL MICROARCHITECTURE (NEHALEM)) or presentations at IDF Aug. 2008 you can see that all micro-ops are routed via 6 ports to the execution units 2 (or 3 with abuse of logic) can deliver FP results and quoting pp 2.23
" The scheduler (or reservation station) can dispatch up to six micro-ops in one cycle
through six issue ports (five issue ports are shown in Figure 2-5; store operation
involves separate ports for store address and store data but is depicted as one in the
The out-of-order engine has many execution units that are arranged in three execu-
tion clusters shown in Figure 2-5. It can retire four micro-ops in one cycle, same as
See then p 2.26 where the FP unis can indeed retire 1 op / cycle ("latency =1") but with 3 FP unis the best you can do is to retire 3 FPops/cycle or 3 X ClockHz. In the 2.8GHz case I can see
3x2.8GHz = 8.4Gflops/sec
ONLY if there are MORE units than those shown which can carry out FP arithmetic I can justify a number > 8.4Gflops/sec / core.
That is why I posed the question if the varios SIMD or XMM instructions utilize different FP units from those in the main core pipelines.
thanks for replying to my comment.....
You are absolutely right Core i7 can retire 4 instructutions/cycle. Regarding your calculations:
3x2.8GHz = 8.4Gflops/sec
you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles. Secondly, you are also missing number of cores in the system as COREi-7 is not a single core machine as far as i know.
Assuming theoreticaly you are doing three 32 bit floating point instructions on three different execution engines then GFLOPS are:
3x2.8ghz x 4 (SIMD WIDTH) x 4 (number of cores) = 134GFLOPS
My above caluclations (previous reply) were based on 2 execution engines.
I am trying to udnerstand about the "you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles."
Can you explain this a little more to me or point to documentation that explains it?
This is how I see the 4 SP ops / FP UNIT (ai, bi. ci 32-bit SP numbers)
c1= a1 FPa b1
c2= a2 FPa b2
c3= a3 FPa b3
c4= a4 FPa b4
where FPa is an operation which is applied in SIMD fashion to 4 different pairs of operands but
this implies 4 separate FP ALU units capable of delivering 4 (pipelined) results every clock.
For this to work this means that the 2 FP units (via Port 0 and Port1) are really 8 separete SP ALUs ? Can I also asssume that these eight units can writeback 8 operands / clock cycle ?
So the 2 FP ops retiring / cycle <=> 8 date results produced / cycle?
If this is indeed the case it is nowever stated clearly that there are really 4 FP ALUS per FP unit. Or I just missed it ....
thanks for the reply ...
H Tim, so we can say then that there are indeed 4 32-b FP ALUs per FP capable port (or their equivalent). One could execute the same OP on quadtruple operands in a vector/pipelined fashion which means 1 result (vs 4) produced / clock.
When the manual says that Nehalem can retire 4 micro-ops / cycle, this includes two 4-way SP or two 2-way DP SIMD ones, right?
Can we use any of the tools (eg, VT) to assess how successfully a compiler packs operands to use the SIMD arithmetic ops in code?
thanks ... Michael
thanks for the pointer. I like the definite answers. You can understand that due to my ignorance in how the SIMD FPs are implemented within the Nehalem core I could only see the vectorized or the true SIMD approach and docs are not definite in this respect.
I am "new" to Intel64 ISA and its microarchitectures (I did ASM back in late 80s ... :)
BTW, is there any document which discusses the implementation of microarch in detail outside the standard Intel Docs?
thanks for the definite answer for this.
Everything started when the question was posed to me "what is the theoretical MAX FP performance of Nehalem cores?" I thought is should asymptotically be bounded by sustained rate of FP operation retirement in the core but then it wasn't clear to me if up to 4 SP FPs / cycle / issue port (SIMD) or 1 SP FPs / cycle / issue port (vector with pipelined ALUs) .
Thanks again. I will come back to maybe torture other answers out of the forum (maybe some uncore implementation stuff, L3 to core to QPI connection capbilities, etc. ;;)
here is one old albeit good reference : Inside Nehalem: Intel's Future Processor and System