Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

sse execution units in core duo

s_gautam
Beginner
1,599 Views
I have read at various places that all intel processors before Core 2 Duo (including Core Duo) have 64-bit floating point execution units. (I am not talking about the x87 FPU). Due to this, the sse instructions using 128-bit operands are split into two with 64-bits handled at a time.

Regarding this, I have the following questions:

a. Is this true?

b. Assuming it is true, won't it mean that there is no speed advantage with instructions like addpd as compared to addsd (as the addpd instruction is split into two anyway) ?

Regards
Gautam
0 Kudos
23 Replies
maa1
Beginner
188 Views

Hi, Max!

>> please check Optimization Reference Manual at pages 12-19 - 12-26 for Atom

on pages 12-10...12-11 is written:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 2
Packed single (mulps) --- 2
Packed double (mulpd) --- 9

on pages 12-19...12-26:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 1
Packed single (mulps) --- 1
Packed double (mulpd) --- 8

What is true??



Also, in column "Ports" for instructions addpd/mulpd specified "Both"
Whether it means, what these instructions can not be run simultaneously, and peak performance for packed DP in this case = 2*(1/(5+8)) = 0.15 flops/cycle (in thirteen times more slowly, than scalar DP)?


0 Kudos
maa1
Beginner
188 Views

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"


And how correct calculate DP performance for PIII?
From manual: FADD throughput=1, FMUL throughput=2, (FADD and FMUL on one Port0)
=1*(2/(1+2))=0.667 flop/cycle?

0 Kudos
maa1
Beginner
188 Views

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...and peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"


And how correct calculate DP performance for PIII?
From manual: FADD throughput=1 FMUL=2 (both on one Port0)
= 1*(2/(1+2))=0.667 flop/cycle? Its true?

0 Kudos
Reply