Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

mul instruction latency

tthsqe
Beginner
1,838 Views

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?

0 Kudos
3 Replies
tthsqe
Beginner
1,838 Views
Quoting - tthsqe

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?


If I'm not mistaken, the intel processors can do a 32 bit multiplyin 4 cycles. Given four of these multipliers anda 128 bit adder, wouldn't it then be easy do a64 bit multiply in 4+2 cycles?


x2 . x1
* y2 . y1
-------------------
[x2*y2].[x1*y1]
+ [x1*y2]
+ [x2*y1]

0 Kudos
capens__nicolas
New Contributor I
1,838 Views
Quoting tthsqe

Does the multiply instruction really have a latency of 10-15 cycles on the newer intel processors? I think the opteron cantake the 128 bit product of two qwords in 5 cycles ...why the big gap?

Are you sure you're looking at the numbers for the Core architecture(s)? On NetBurst (Pentium 4) mul took 10+ cycles, but on Core it should be around 4 I believe.

The optimization guide with the processor model numbers can be quite confusing...

0 Kudos
Max_L
Employee
1,838 Views

Hello, I understand you are speaking of MUL r64, RAX instruction producing 128-bit result in RDX:RAX it is only 3-cycle latency for low 64-bit RAX part of the result (sameas for most of the rest of integer multiplies scalar and SIMD ones) and 7-cycle latency for the high 64-bit (RDX) part of the result. This instruction is used in the long precision integer arithmetic where latency of high 64-bit part of the result can be hidden, what is your usage of it?

Thank you,

-Max

0 Kudos
Reply