Has Fused Multiply-Add been already implemented on Nehaelm?

HPC-TAMU · ‎05-20-2010

Has Fused Multiply-Add been already implemented on Nehalem? We have deployed a cluster of 2592 Nehalem core (2 socket X5560 / node).

On another note, are there separate floating-point execution units supporting the various SSE or MMX instructions ? Or are the same execution units with those used for the regular FP instructions ? As far as I can tell, Nehalem cores have 5 exec. unit 2 of which can carry out double-precistion FP calculation + a 3rd unit doing FPshuffles, correct ?

What is the "thoretical" Maximuym FLOPS performance of a Nehalem CORE running at 2.8GHz (no turbo) ?

Thanks for any info on these......

Michael

Brijender_B_Intel · ‎05-20-2010

Nehalem or Corei-7 (http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)) has SSE4.1 and SSE4.2 implemented. it does not have Fused Multiply Add.
Each core has its own execution units as you can find out from that article. Also, you may want to read more in Software
Intel 64 and IA-32 Architectures Optimization Reference Manual at http://www.intel.com/products/processor/manuals/

Regarding flops.2.8GHZ with 4 corescan support 90GFLOPS

HPC-TAMU · ‎05-21-2010

Thanks for the reply ....

I was wondering what would be the rate of floating-point ops / sec for a Nehalem core. From their h/w docs it appears that each core has 6 execution units from which only 2 can carry out FP arithmetic ops (the 3rd one can do FP shuffle and the other 3 are for mem access). For the FLOPS (FP ops / sec) I cannot see how this can be > 3 / cycle.

If you take a look at :

http://www.intel.com/products/processor/manuals/ the "Intel 64 and IA-32 Architectures Optimization Reference Manual " (pp.2.21 2.2 INTEL MICROARCHITECTURE (NEHALEM)) or presentations at IDF Aug. 2008 you can see that all micro-ops are routed via 6 ports to the execution units 2 (or 3 with abuse of logic) can deliver FP results and quoting pp 2.23

" The scheduler (or reservation station) can dispatch up to six micro-ops in one cycle
through six issue ports (five issue ports are shown in Figure 2-5; store operation
involves separate ports for store address and store data but is depicted as one in the
diagram).

The out-of-order engine has many execution units that are arranged in three execu-
tion clusters shown in Figure 2-5. It can retire four micro-ops in one cycle, same as
its predecessor."

See then p 2.26 where the FP unis can indeed retire 1 op / cycle ("latency =1") but with 3 FP unis the best you can do is to retire 3 FPops/cycle or 3 X ClockHz. In the 2.8GHz case I can see
3x2.8GHz = 8.4Gflops/sec

ONLY if there are MORE units than those shown which can carry out FP arithmetic I can justify a number > 8.4Gflops/sec / core.

That is why I posed the question if the varios SIMD or XMM instructions utilize different FP units from those in the main core pipelines.

thanks for replying to my comment.....

Michael

Brijender_B_Intel · ‎05-21-2010

You are absolutely right Core i7 can retire 4 instructutions/cycle. Regarding your calculations:

3x2.8GHz = 8.4Gflops/sec

you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles. Secondly, you are also missing number of cores in the system as COREi-7 is not a single core machine as far as i know.
Assuming theoreticaly you are doing three 32 bit floating point instructions on three different execution engines then GFLOPS are:
3x2.8ghz x 4 (SIMD WIDTH) x 4 (number of cores) = 134GFLOPS
My above caluclations (previous reply) were based on 2 execution engines.

HPC-TAMU · ‎05-21-2010

I was trying to estimate the FLOPS / core first.

I am trying to udnerstand about the "you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles."

Can you explain this a little more to me or point to documentation that explains it?

This is how I see the 4 SP ops / FP UNIT (ai, bi. ci 32-bit SP numbers)

c1= a1 FPa b1
c2= a2 FPa b2
c3= a3 FPa b3
c4= a4 FPa b4

where FPa is an operation which is applied in SIMD fashion to 4 different pairs of operands but
this implies 4 separate FP ALU units capable of delivering 4 (pipelined) results every clock.

For this to work this means that the 2 FP units (via Port 0 and Port1) are really 8 separete SP ALUs ? Can I also asssume that these eight units can writeback 8 operands / clock cycle ?

So the 2 FP ops retiring / cycle <=> 8 date results produced / cycle?

If this is indeed the case it is nowever stated clearly that there are really 4 FP ALUS per FP unit. Or I just missed it ....

thanks for the reply ...

Michael

TimP · ‎05-21-2010

The FP units actually operate on 2 64-bit or 4 32-bit operands in parallel. The peak rate of 4 Flops/cycle (e.g. for Top 500 rating) would be achieved by retiring a pair of double precision adds and a pair of multiplies on each clock cycle, using the 2 FP units. Certain CPU models (among Intel models, generally laptop models) did split an SSE parallel operand, and (in the case of the Intel laptops) start on one half a cycle before the other, and some models could start a multiply only every other cycle, so there are several possible variants on this calculation.

drMikeT · ‎05-24-2010

Quoting tim18

The FP units actually operate on 2 64-bit or 4 32-bit operands in parallel. The peak rate of 4 Flops/cycle (e.g. for Top 500 rating) would be achieved by retiring a pair of double precision adds and a pair of multiplies on each clock cycle, using the 2 FP units. Certain CPU models (among Intel models, generally laptop models) did split an SSE parallel operand, and (in the case of the Intel laptops) start on one half a cycle before the other, and some models could start a multiply only every other cycle, so there are several possible variants on this calculation.

H Tim, so we can say then that there are indeed 4 32-b FP ALUs per FP capable port (or their equivalent). One could execute the same OP on quadtruple operands in a vector/pipelined fashion which means 1 result (vs 4) produced / clock.

When the manual says that Nehalem can retire 4 micro-ops / cycle, this includes two 4-way SP or two 2-way DP SIMD ones, right?

Can we use any of the tools (eg, VT) to assess how successfully a compiler packs operands to use the SIMD arithmetic ops in code?

thanks ... Michael

Max_L · ‎05-24-2010

Hi, you may also want to check my older response explaining the SIMD execution units implementation and the peak performance numbers for a few last generations of Intel processors: http://software.intel.com/en-us/forums/showpost.php?p=60696

thanks,
-Max

TimP · ‎05-24-2010

Yes, that's correct about the peak rate of micro-op retirement including 2 SIMD, in the case where there are equal numbers of multiply and add instructions. The SDE tool might be of interest, in that it can produce a count of the number of executions of each instruction and each category of instruction. Unfortunately, the public version (the only one I've seen) doesn't report the minimum number of clock cycles associated with execution.

HPC-TAMU · ‎05-26-2010

Quoting Max Locktyukhin (Intel)

Hi, you may also want to check my older response explaining the SIMD execution units implementation and the peak performance numbers for a few last generations of Intel processors: http://software.intel.com/en-us/forums/showpost.php?p=60696

thanks,
-Max

Hi Max,

thanks for the pointer. I like the definite answers. You can understand that due to my ignorance in how the SIMD FPs are implemented within the Nehalem core I could only see the vectorized or the true SIMD approach and docs are not definite in this respect.

I am "new" to Intel64 ISA and its microarchitectures (I did ASM back in late 80s ... :)

BTW, is there any document which discusses the implementation of microarch in detail outside the standard Intel Docs?

cheers ....

Michael

HPC-TAMU · ‎05-26-2010

Quoting tim18

Yes, that's correct about the peak rate of micro-op retirement including 2 SIMD, in the case where there are equal numbers of multiply and add instructions. The SDE tool might be of interest, in that it can produce a count of the number of executions of each instruction and each category of instruction. Unfortunately, the public version (the only one I've seen) doesn't report the minimum number of clock cycles associated with execution.

Hi Tim,

thanks for the definite answer for this.

Everything started when the question was posed to me "what is the theoretical MAX FP performance of Nehalem cores?" I thought is should asymptotically be bounded by sustained rate of FP operation retirement in the core but then it wasn't clear to me if up to 4 SP FPs / cycle / issue port (SIMD) or 1 SP FPs / cycle / issue port (vector with pipelined ALUs) .

Thanks again. I will come back to maybe torture other answers out of the forum (maybe some uncore implementation stuff, L3 to core to QPI connection capbilities, etc. ;;)

bronxzv · ‎05-26-2010

>BTW, is there any document which discusses the implementation of microarch in detail outside the standard Intel Docs?

here is one old albeit good reference : Inside Nehalem: Intel's Future Processor and System

bronxzv · ‎05-26-2010

>maybe some uncore implementation stuff

here is another old albeit good reference about QPI : The Common System Interface: Intel's Future Interconnect

capens__nicolas · ‎07-06-2011

Another excellent source of x86 ISA and microarchitectures information is Agner Fog's manuals.