sse execution units in core duo - Page 2

s_gautam · ‎06-18-2008

I have read at various places that all intel processors before Core 2 Duo (including Core Duo) have 64-bit floating point execution units. (I am not talking about the x87 FPU). Due to this, the sse instructions using 128-bit operands are split into two with 64-bits handled at a time.

Regarding this, I have the following questions:

a. Is this true?

b. Assuming it is true, won't it mean that there is no speed advantage with instructions like addpd as compared to addsd (as the addpd instruction is split into two anyway) ?

Regards
Gautam

maa1 · ‎09-04-2009

Hi, Max!

>> please check Optimization Reference Manual at pages 12-19 - 12-26 for Atom

on pages 12-10...12-11 is written:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 2
Packed single (mulps) --- 2
Packed double (mulpd) --- 9

on pages 12-19...12-26:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 1
Packed single (mulps) --- 1
Packed double (mulpd) --- 8

What is true??

Also, in column "Ports" for instructions addpd/mulpd specified "Both"
Whether it means, what these instructions can not be run simultaneously, and peak performance for packed DP in this case = 2*(1/(5+8)) = 0.15 flops/cycle (in thirteen times more slowly, than scalar DP)?

maa1 · ‎09-08-2009

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"

And how correct calculate DP performance for PIII?
From manual: FADD throughput=1, FMUL throughput=2, (FADD and FMUL on one Port0)
=1*(2/(1+2))=0.667 flop/cycle?

maa1 · ‎09-08-2009

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...and peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"

And how correct calculate DP performance for PIII?
From manual: FADD throughput=1 FMUL=2 (both on one Port0)
= 1*(2/(1+2))=0.667 flop/cycle? Its true?