mul instruction latency

tthsqe · ‎01-09-2010

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?

tthsqe · ‎01-10-2010

Quoting - tthsqe

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?

If I'm not mistaken, the intel processors can do a 32 bit multiplyin 4 cycles. Given four of these multipliers anda 128 bit adder, wouldn't it then be easy do a64 bit multiply in 4+2 cycles?

x2 . x1
* y2 . y1
-------------------
[x2*y2].[x1*y1]
+ [x1*y2]
+ [x2*y1]

capens__nicolas · ‎01-18-2010

Quoting tthsqe

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?

Are you sure you're looking at the numbers for the Core architecture(s)? On NetBurst (Pentium 4) mul took 10+ cycles, but on Core it should be around 4 I believe.

The optimization guide with the processor model numbers can be quite confusing...

Max_L · ‎02-05-2010

Hello, I understand you are speaking of MUL r64, RAX instruction producing 128-bit result in RDX:RAX it is only 3-cycle latency for low 64-bit RAX part of the result (sameas for most of the rest of integer multiplies scalar and SIMD ones) and 7-cycle latency for the high 64-bit (RDX) part of the result. This instruction is used in the long precision integer arithmetic where latency of high 64-bit part of the result can be hidden, what is your usage of it?

Thank you,

-Max