Latency of a General purpose MOV instruction on Intel CPUs

SergeyKostrov · ‎05-19-2013

Hi everybody,

I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle. For example, I've completed a set of tests for Intel(R) Pentium(R) 4 CPU 1.60GHz and my numbers are as follows:

[ Intel C++ compiler - DEBUG ]
...
Overhead of Assignment: 1.091 clock cycles
...

[ Intel C++ compiler - RELEASE ]
...
Overhead of Assignment: 1.191 clock cycles
...

A C code with assignment looks like:

unsigned __int64 uiClockCycles = __rdtsc();

and a value returned from RDTSC instruction is assigned to uiClockCycles variable with two General purpose MOV instructions, and it means, that 2 clock cycles will be actually spent.

Thanks in advance.

Bernard · ‎05-20-2013

I think that two mov instructions are used to load high and low part of RDTSC value.

SergeyKostrov · ‎05-20-2013

>>>>...and it means, that 2 clock cycles will be actually spent. >> >>...I think that two mov instructions are used to load high and low part... I know this because a value returned from RDTSC instruction is saved in EDX and EAX registers and in order to load it in a 64-bit variable two MOV instructions are needed. I simply wanted to confirm that a General purpose MOV instruction is always executed in 1 clock cycle on any Intel CPU.

Bernard · ‎05-20-2013

How large was loop counter needed to precisely measure latency of MOV instruction?And how many such a measurements did you average?

SergeyKostrov · ‎05-20-2013

Here is a new update. >>...I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle... Is that true? Just completed another set of tests and I couldn't get 1 clock cycle Latency for MOV instruction on Ivy Bridge system with Intel Core i7-3840QM ( 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Here are test results: [ Intel C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Microsoft C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.381 clock cycles Final RDTSC Overhead Value: 23.619 clock cycles ... Note: '...Overhead of Assignment...' means Latency of MOV instruction and as you cn see on Ivy Bridge system it is less than 1 clock cycle These values 0.372 and 0.381 clock cycles are very consistent ( the same from test to test! ) for Intel and Microsoft C++ compilers.

Bernard · ‎05-20-2013

On latest architecture memory moves are executed by two Ports2 and 3 in parallel , but I do not know that this can explain such a low latency.

bronxzv · ‎05-20-2013

Sergey Kostrov wrote:
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle.

you can find this information for specific implementations in the optimization manual [1] in appendix C.3 Latency and Throughput, IIRC latency for MOV is 1 clock for all processors, now it looks like you are more after reciprocal throughput (since you issue two independent MOV in your example), rcp throughput is documented as 0.33 for Sandy Bridge/Ivy Bridge for ex. (i.e. there is 3 ports available for GPR to GPR moves) but may be only 0.5 for older processors

[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
available here: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

Bernard · ‎05-20-2013

@bronxzv

you were faster with your answer about the reciprocal throughput:) I wanted to write exactly the same answer:)

Btw. afaik there are only two ports which are executing load/store instructions.

bronxzv · ‎05-21-2013

iliyapolak wrote:
Btw. afaik there are only two ports which are executing load/store instructions.

load from memory isn't involved in the example at hand, 0.33 is for register to register moves (also for 64-bit MMX and 128-bit XMM registers), the store to memory is not on the critical path in the example at hand (as it's usual for stores)

Bernard · ‎05-21-2013

thanks for correcting my error.

SergeyKostrov · ‎05-21-2013

>>...load from memory isn't involved in the example at hand, 0.33 is for register to register moves... The question was about the Latency ( for any Intel CPU / unfortunately Intel® 64 and IA-32 Architectures Optimization Reference Manual doesn't list all microarchitectures ) and Not about the Throughput. However, I see that my current test perfectly measured the Throughput of a General purpose MOV instruction on Ivy Bridge system. Here is a verification for 32-bit and 64--bit codes: [ Intel C++ compiler - RELEASE - 32-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Intel C++ compiler - RELEASE - 64-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.369 clock cycles Final RDTSC Overhead Value: 23.631 clock cycles ... Note: '...Min Overhead of Assignment...' needs to be changed to '...Min Throughput of Assignment...'

SergeyKostrov · ‎05-21-2013

>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012... I have that Manual and I saw the numbers for MOV instruction. Thanks. Any comments from Intel engineers?

bronxzv · ‎05-21-2013

Sergey Kostrov wrote:
>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...

I have that Manual and I saw the numbers for MOV instruction. Thanks.

Any comments from Intel engineers?

as you can see at page C-31 of the optimization manual (written by Intel engineers) the latency was 0.5 for Pentium 4 with the double pumped "Fireball" ALU (signature = 0F_2H) so the answer to your question is clearly no, it isn't 1 clock cycle for all Intel CPUs

SergeyKostrov · ‎05-21-2013

Guys, please pause for a moment and let's wait for a comment from Intel engineers. OK?

perfwise · ‎05-23-2013

Sergey,

I have a suite of 3-4K tests .. which tell me all the instr late, more presice than anything found on the internet. I get 1 clk on SB/IB for mov. I also monitor the number eliminated, via move elimination and it appears they can eliminate only 1 move per dispatched set of ops.. I believe. More food for thought on this.. but it's probably 1 clk.

Perfwise

SergeyKostrov · ‎05-23-2013

Here are two more quotes I just found in Intel Manuals: Intel(R) 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-026 April 2012 C.3.1 Latency and Throughput with Register Operands ... Processor instruction timing data is implementation specific; it can vary between model encodings within the same family encoding... ... On Page 738

SergeyKostrov · ‎05-23-2013

Latency: 0F_3H - 1 0F_2H - 0.5 Throughput: 0F_3H - 0.5 0F_2H - 0.5 Notes: 0F_3H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4, Pentium D processors 0F_2H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4 processors Intel(R) 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide Order Number: 325384-044US August 2012 CHAPTER 35 MODEL-SPECIFIC REGISTERS (MSRS) ... Table 35-1. CPUID Signature Values of DisplayFamily_DisplayModel ... On Page 1151

Bernard · ‎05-23-2013

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

bronxzv · ‎05-23-2013

iliyapolak wrote:

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

0F_2H (family 15, model 2) is for the P4 Northwood core [2] with its double pumped ALU, AFAIK ALU latencies were the same in the original P4 Willamette [1] with CPUID signature = 0F_1H (family 15, model 1)

with the P4 Prescott [3] (0F_3H, i.e family 15, model 3) the double pumped "Fireball" ALU was replaced by a regular ALU at core clock thus latencies increased

[1] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Willamette.html
[2] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Northwood.html
[3] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Prescott.html

Bernard · ‎05-23-2013

Yes it makes sense when double-pumped ALU is taken into account.

Thanks for interesting links.

Btw. it is interesting how the designers of double pumped ALU were able to double the clock of this unit.I think that main reason was low transistor count needed to implement ALU and thus lower heat disipation.

SergeyKostrov · ‎05-26-2013

Feature Request: Please consider to add numbers for Latency and Throughput for all instructions in Intel® 64 and IA-32 Architectures Software Developer’s Manual ( A ) Volume 2 ( 2A, 2B & 2C ): Instruction Set Reference, A-Z instead of Intel® 64 and IA-32 Architectures Optimization Reference Manual ( B ) For example, this is how it would be nice to have ( on page 72 in A ): ... AAA - ASCII Adjust After Addition Opcode Instruction Op/En 64-bitMode Compat/Leg Mode Description 37 AAA NP Invalid Valid ASCII adjust AL after addition. Latency n1 Throughput n2 Where n1 and n2 are some numbers. In that case information about Latency and Throughput for all instructions is consolidated and there is No need to look or search in another Intel Manual(s). Information for different Intel Microarchitectures ( if there are differences ) also could be added in the same way. Thanks in advance.