- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

On another note, are there

**separate**floating-point execution units supporting the various SSE or MMX instructions ? Or are the same execution units with those used for the regular FP instructions ? As far as I can tell, Nehalem cores have 5 exec. unit 2 of which can carry out double-precistion FP calculation + a 3rd unit doing FPshuffles, correct ?

What is the "thoretical" Maximuym FLOPS performance of a Nehalem CORE running at 2.8GHz (no turbo) ?

Thanks for any info on these......

Michael

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Each core has its own execution units as you can find out from that article. Also, you may want to read more in Software

Intel 64 and IA-32 Architectures Optimization Reference Manual at http://www.intel.com/products/processor/manuals/

Regarding flops.2.8GHZ with 4 corescan support 90GFLOPS

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I was wondering what would be the rate of floating-point ops / sec for a Nehalem core. From their h/w docs it appears that each core has 6 execution units from which only 2 can carry out FP arithmetic ops (the 3rd one can do FP shuffle and the other 3 are for mem access). For the FLOPS (

**FP**ops / sec) I cannot see how this can be > 3 / cycle.

If you take a look at :

http://www.intel.com/products/processor/manuals/ the "Intel 64 and IA-32 Architectures Optimization Reference Manual " (pp.2.21 2.2 INTEL MICROARCHITECTURE (NEHALEM)) or presentations at IDF Aug. 2008 you can see that all micro-ops are routed via 6 ports to the execution units 2 (or 3 with abuse of logic) can deliver FP results and quoting pp 2.23

"

*The scheduler (or reservation station) can dispatch up to*

through six issue ports (five issue ports are shown in Figure 2-5; store operation

involves separate ports for store address and store data but is depicted as one in the

diagram).

The out-of-order engine has many execution units that are arranged in three execu-

tion clusters shown in Figure 2-5.

its predecessor."

**six micro-ops in one cycle**through six issue ports (five issue ports are shown in Figure 2-5; store operation

involves separate ports for store address and store data but is depicted as one in the

diagram).

The out-of-order engine has many execution units that are arranged in three execu-

tion clusters shown in Figure 2-5.

**It can retire four micro-ops in one cycle**, same asits predecessor.

See then p 2.26 where the FP unis can indeed retire 1 op / cycle ("latency =1") but with 3 FP unis the best you can do is to retire

**3 FPops/cycle**or 3 X ClockHz. In the 2.8GHz case I can see

**3x2.8GHz = 8.4Gflops/sec**

ONLY if there are

**MORE**units than those shown which can carry out FP arithmetic I can justify a number > 8.4Gflops/sec / core.

**That is why I posed the question if the varios SIMD or XMM instructions utilize different FP units from those in the main core pipelines.**

thanks for replying to my comment.....

Michael

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You are absolutely right Core i7 can retire 4 instructutions/cycle. Regarding your calculations:*3x2.8GHz = 8.4Gflops/sec*you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles. Secondly, you are also missing number of cores in the system as COREi-7 is not a single core machine as far as i know.

Assuming theoreticaly you are doing three 32 bit floating point instructions on three different execution engines then GFLOPS are:

3x2.8ghz x 4 (SIMD WIDTH) x 4 (number of cores) = 134GFLOPS

My above caluclations (previous reply) were based on 2 execution engines.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am trying to udnerstand about the "you need to multiply this number with SIMD Width ie. 4 FP (32bit) per cycles."

Can you explain this a little more to me or point to documentation that explains it?

This is how I see the 4 SP ops / FP UNIT (ai, bi. ci 32-bit SP numbers)

c1= a1 FPa b1

c2= a2 FPa b2

c3= a3 FPa b3

c4= a4 FPa b4

where FPa is an operation which is applied in SIMD fashion to 4 different pairs of operands but

this implies

**4 separate FP ALU**units capable of delivering 4 (pipelined) results every clock.

For this to work this means that the 2 FP units (via Port 0 and Port1) are really 8 separete SP ALUs ? Can I also asssume that these eight units can writeback 8 operands / clock cycle ?

So the 2 FP ops retiring / cycle <=> 8 date results produced / cycle?

If this is indeed the case it is nowever stated clearly that there are really 4 FP ALUS per FP unit. Or I just missed it ....

thanks for the reply ...

Michael

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*The FP units actually operate on 2 64-bit or 4 32-bit operands in parallel. The peak rate of 4 Flops/cycle (e.g. for Top 500 rating) would be achieved by retiring a pair of double precision adds and a pair of multiplies on each clock cycle, using the 2 FP units. Certain CPU models (among Intel models, generally laptop models) did split an SSE parallel operand, and (in the case of the Intel laptops) start on one half a cycle before the other, and some models could start a multiply only every other cycle, so there are several possible variants on this calculation.*

H Tim, so we can say then that there are indeed **4** 32-b FP ALUs per FP capable port (or their equivalent). One could execute the same OP on quadtruple operands in a vector/pipelined fashion which means 1 result (vs 4) produced / clock.

When the manual says that Nehalem can retire 4 micro-ops / cycle, this includes two 4-way SP or two 2-way DP SIMD ones, right?

Can we use any of the tools (eg, VT) to assess how successfully a compiler packs operands to use the SIMD arithmetic ops in code?

thanks ... Michael

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

thanks,

-Max

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*Hi, you may also want to check my older response explaining the SIMD execution units implementation and the peak performance numbers for a few last generations of Intel processors: http://software.intel.com/en-us/forums/showpost.php?p=60696*

thanks,

-Max

thanks,

-Max

Hi Max,

thanks for the pointer. I like the definite answers. You can understand that due to my ignorance in how the SIMD FPs are implemented within the Nehalem core I could only see the vectorized or the true SIMD approach and docs are not definite in this respect.

I am "new" to Intel64 ISA and its microarchitectures (I did ASM back in late 80s ... :)

BTW, is there any document which discusses the implementation of microarch in detail outside the standard Intel Docs?

cheers ....

Michael

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

*Yes, that's correct about the peak rate of micro-op retirement including 2 SIMD, in the case where there are equal numbers of multiply and add instructions. The SDE tool might be of interest, in that it can produce a count of the number of executions of each instruction and each category of instruction. Unfortunately, the public version (the only one I've seen) doesn't report the minimum number of clock cycles associated with execution.*

Hi Tim,

thanks for the definite answer for this.

Everything started when the question was posed to me "what is the theoretical MAX FP performance of Nehalem cores?" I thought is should asymptotically be bounded by sustained rate of FP operation retirement in the core but then it wasn't clear to me if up to 4 SP FPs / cycle / issue port (SIMD) or 1 SP FPs / cycle / issue port (vector with pipelined ALUs) .

Thanks again. I will come back to maybe torture other answers out of the forum (maybe some uncore implementation stuff, L3 to core to QPI connection capbilities, etc. ;;)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

here is one old albeit good reference : Inside Nehalem: Intel's Future Processor and System

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

>maybe some uncore implementation stuff

here is another old albeit good reference about QPI : The Common System Interface: Intel's Future Interconnect

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page