Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?
If I'm not mistaken, the intel processors can do a 32 bit multiplyin 4 cycles. Given four of these multipliers anda 128 bit adder, wouldn't it then be easy do a64 bit multiply in 4+2 cycles?
x2 . x1
* y2 . y1
Are you sure you're looking at the numbers for the Core architecture(s)? On NetBurst (Pentium 4) mul took 10+ cycles, but on Core it should be around 4 I believe.
The optimization guide with the processor model numbers can be quite confusing...
Hello, I understand you are speaking of MUL r64, RAX instruction producing 128-bit result in RDX:RAX it is only 3-cycle latency for low 64-bit RAX part of the result (sameas for most of the rest of integer multiplies scalar and SIMD ones) and 7-cycle latency for the high 64-bit (RDX) part of the result. This instruction is used in the long precision integer arithmetic where latency of high 64-bit part of the result can be hidden, what is your usage of it?