Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Valued Contributor II
66 Views

Latency of a General purpose MOV instruction on Intel CPUs

Hi everybody,

I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle. For example, I've completed a set of tests for Intel(R) Pentium(R) 4 CPU 1.60GHz and my numbers are as follows:

[ Intel C++ compiler - DEBUG ]
...
Overhead of Assignment: 1.091 clock cycles
...

[ Intel C++ compiler - RELEASE ]
...
Overhead of Assignment: 1.191 clock cycles
...

A C code with assignment looks like:

unsigned __int64 uiClockCycles = __rdtsc();

and a value returned from RDTSC instruction is assigned to uiClockCycles variable with two General purpose MOV instructions, and it means, that 2 clock cycles will be actually spent.

Thanks in advance.

0 Kudos
22 Replies
Highlighted
Black Belt
63 Views

I think that two mov instructions are used to load high and low part of RDTSC value.

0 Kudos
Highlighted
Valued Contributor II
63 Views

>>>>...and it means, that 2 clock cycles will be actually spent. >> >>...I think that two mov instructions are used to load high and low part... I know this because a value returned from RDTSC instruction is saved in EDX and EAX registers and in order to load it in a 64-bit variable two MOV instructions are needed. I simply wanted to confirm that a General purpose MOV instruction is always executed in 1 clock cycle on any Intel CPU.
0 Kudos
Highlighted
Black Belt
63 Views

How large was loop counter needed to precisely measure latency of MOV instruction?And how many such a measurements did you average?

0 Kudos
Highlighted
Valued Contributor II
63 Views

Here is a new update. >>...I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle... Is that true? Just completed another set of tests and I couldn't get 1 clock cycle Latency for MOV instruction on Ivy Bridge system with Intel Core i7-3840QM ( 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Here are test results: [ Intel C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Microsoft C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.381 clock cycles Final RDTSC Overhead Value: 23.619 clock cycles ... Note: '...Overhead of Assignment...' means Latency of MOV instruction and as you cn see on Ivy Bridge system it is less than 1 clock cycle These values 0.372 and 0.381 clock cycles are very consistent ( the same from test to test! ) for Intel and Microsoft C++ compilers.
0 Kudos
Highlighted
Black Belt
63 Views

On latest architecture memory moves are executed by two Ports2 and 3 in parallel , but I do not know that this can explain such a low latency.

0 Kudos
Highlighted
New Contributor II
63 Views

Sergey Kostrov wrote:
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle.

you can find this information for specific implementations in the optimization manual [1] in appendix C.3 Latency and Throughput, IIRC latency for MOV is 1 clock for all processors, now it looks like you are more after reciprocal throughput (since you issue two independent MOV in your example), rcp throughput is documented as 0.33 for Sandy Bridge/Ivy Bridge for ex. (i.e. there is 3 ports available for GPR to GPR moves) but may be only 0.5 for older processors

[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
available here: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati...

0 Kudos
Highlighted
Black Belt
63 Views

@bronxzv

you were faster with your answer about the reciprocal throughput:) I wanted to write exactly the same answer:)

Btw. afaik there are only two ports which are executing load/store instructions.

0 Kudos
Highlighted
New Contributor II
63 Views

iliyapolak wrote:
Btw. afaik there are only two ports which are executing load/store instructions.

load from memory isn't involved in the example at hand, 0.33 is for register to register moves (also for 64-bit MMX and 128-bit XMM registers), the store to memory is not on the critical path in the example at hand (as it's usual for stores)

0 Kudos
Highlighted
Black Belt
63 Views

thanks for correcting my error.

0 Kudos
Highlighted
Valued Contributor II
63 Views

>>...load from memory isn't involved in the example at hand, 0.33 is for register to register moves... The question was about the Latency ( for any Intel CPU / unfortunately Intel® 64 and IA-32 Architectures Optimization Reference Manual doesn't list all microarchitectures ) and Not about the Throughput. However, I see that my current test perfectly measured the Throughput of a General purpose MOV instruction on Ivy Bridge system. Here is a verification for 32-bit and 64--bit codes: [ Intel C++ compiler - RELEASE - 32-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Intel C++ compiler - RELEASE - 64-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.369 clock cycles Final RDTSC Overhead Value: 23.631 clock cycles ... Note: '...Min Overhead of Assignment...' needs to be changed to '...Min Throughput of Assignment...'
0 Kudos
Highlighted
Valued Contributor II
63 Views

>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012... I have that Manual and I saw the numbers for MOV instruction. Thanks. Any comments from Intel engineers?
0 Kudos
Highlighted
New Contributor II
63 Views

Sergey Kostrov wrote:
>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...

I have that Manual and I saw the numbers for MOV instruction. Thanks.

Any comments from Intel engineers?

as you can see at page C-31 of the optimization manual (written by Intel engineers) the latency was 0.5 for Pentium 4 with the double pumped "Fireball" ALU (signature = 0F_2H) so the answer to your question is clearly no, it isn't 1 clock cycle for all Intel CPUs

0 Kudos
Highlighted
Valued Contributor II
63 Views

Guys, please pause for a moment and let's wait for a comment from Intel engineers. OK?
0 Kudos
Highlighted
Beginner
63 Views

Sergey,

   I have a suite of 3-4K tests .. which tell me all the instr late, more presice than anything found on the internet.  I get 1 clk on SB/IB for mov.  I also monitor the number eliminated, via move elimination and it appears they can eliminate only 1 move per dispatched set of ops.. I believe.  More food for thought on this.. but it's probably 1 clk.

Perfwise

0 Kudos
Highlighted
Valued Contributor II
63 Views

Here are two more quotes I just found in Intel Manuals: Intel(R) 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-026 April 2012 C.3.1 Latency and Throughput with Register Operands ... Processor instruction timing data is implementation specific; it can vary between model encodings within the same family encoding... ... On Page 738
0 Kudos
Highlighted
Valued Contributor II
63 Views

Latency: 0F_3H - 1 0F_2H - 0.5 Throughput: 0F_3H - 0.5 0F_2H - 0.5 Notes: 0F_3H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4, Pentium D processors 0F_2H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4 processors Intel(R) 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide Order Number: 325384-044US August 2012 CHAPTER 35 MODEL-SPECIFIC REGISTERS (MSRS) ... Table 35-1. CPUID Signature Values of DisplayFamily_DisplayModel ... On Page 1151
0 Kudos
Highlighted
Black Belt
63 Views

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

0 Kudos
Highlighted
New Contributor II
63 Views

iliyapolak wrote:

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

0F_2H (family 15, model 2) is for the P4 Northwood core [2] with its double pumped ALU, AFAIK ALU latencies were the same in the original P4 Willamette [1] with CPUID signature = 0F_1H (family 15, model 1)

with the P4 Prescott [3] (0F_3H, i.e family 15, model 3) the double pumped "Fireball" ALU was replaced by a regular ALU at core clock thus latencies increased 

[1] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Willamette.html
[2] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Northwood.html
[3] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Prescott.html

 

0 Kudos
Highlighted
Black Belt
63 Views

Yes it makes sense when double-pumped ALU is taken into account.

Thanks for interesting links.

Btw. it is interesting how the designers of double pumped ALU were able to double the clock of this unit.I think that main reason was low transistor count needed to implement ALU  and thus lower heat disipation.

0 Kudos