topic @bronxzv in Intel® ISA Extensions

Latency of a General purpose MOV instruction on Intel CPUs

SergeyKostrov — Mon, 20 May 2013 04:03:26 GMT

Hi everybody,

I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle. For example, I've completed a set of tests for Intel(R) Pentium(R) 4 CPU 1.60GHz and my numbers are as follows:

[ Intel C++ compiler - DEBUG ]
...
Overhead of Assignment: 1.091 clock cycles
...

[ Intel C++ compiler - RELEASE ]
...
Overhead of Assignment: 1.191 clock cycles
...

A C code with assignment looks like:

unsigned __int64 uiClockCycles = __rdtsc();

and a value returned from RDTSC instruction is assigned to uiClockCycles variable with two General purpose MOV instructions, and it means, that 2 clock cycles will be actually spent.

Thanks in advance.

I think that two mov

Bernard — Mon, 20 May 2013 16:51:18 GMT

I think that two mov instructions are used to load high and low part of RDTSC value.

>>>>...and it means, that 2

SergeyKostrov — Mon, 20 May 2013 17:38:02 GMT

>>>>...and it means, that 2 clock cycles will be actually spent. >> >>...I think that two mov instructions are used to load high and low part... I know this because a value returned from RDTSC instruction is saved in EDX and EAX registers and in order to load it in a 64-bit variable two MOV instructions are needed. I simply wanted to confirm that a General purpose MOV instruction is always executed in 1 clock cycle on any Intel CPU.

How large was loop counter

Bernard — Mon, 20 May 2013 18:41:00 GMT

How large was loop counter needed to precisely measure latency of MOV instruction?And how many such a measurements did you average?

Here is a new update.

SergeyKostrov — Tue, 21 May 2013 00:57:03 GMT

Here is a new update. >>...I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle... Is that true? Just completed another set of tests and I couldn't get 1 clock cycle Latency for MOV instruction on Ivy Bridge system with Intel Core i7-3840QM ( 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ) Here are test results: [ Intel C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Microsoft C++ compiler ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.381 clock cycles Final RDTSC Overhead Value: 23.619 clock cycles ... Note: '...Overhead of Assignment...' means Latency of MOV instruction and as you cn see on Ivy Bridge system it is less than 1 clock cycle These values 0.372 and 0.381 clock cycles are very consistent ( the same from test to test! ) for Intel and Microsoft C++ compilers.

On latest architecture memory

Bernard — Tue, 21 May 2013 04:55:04 GMT

On latest architecture memory moves are executed by two Ports2 and 3 in parallel , but I do not know that this can explain such a low latency.

Quote:Sergey Kostrov wrote: I

bronxzv — Tue, 21 May 2013 05:59:00 GMT

Sergey Kostrov wrote:
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle.

you can find this information for specific implementations in the optimization manual [1] in appendix C.3 Latency and Throughput, IIRC latency for MOV is 1 clock for all processors, now it looks like you are more after reciprocal throughput (since you issue two independent MOV in your example), rcp throughput is documented as 0.33 for Sandy Bridge/Ivy Bridge for ex. (i.e. there is 3 ports available for GPR to GPR moves) but may be only 0.5 for older processors

[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
available here: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html

@bronxzv

Bernard — Tue, 21 May 2013 06:41:18 GMT

@bronxzv

you were faster with your answer about the reciprocal throughput:) I wanted to write exactly the same answer:)

Btw. afaik there are only two ports which are executing load/store instructions.

Quote:iliyapolak wrote: Btw.

bronxzv — Tue, 21 May 2013 07:02:00 GMT

iliyapolak wrote:
Btw. afaik there are only two ports which are executing load/store instructions.

load from memory isn't involved in the example at hand, 0.33 is for register to register moves (also for 64-bit MMX and 128-bit XMM registers), the store to memory is not on the critical path in the example at hand (as it's usual for stores)

thanks for correcting my

Bernard — Tue, 21 May 2013 07:09:48 GMT

thanks for correcting my error.

>>...load from memory isn't

SergeyKostrov — Tue, 21 May 2013 12:46:00 GMT

>>...load from memory isn't involved in the example at hand, 0.33 is for register to register moves... The question was about the Latency ( for any Intel CPU / unfortunately Intel® 64 and IA-32 Architectures Optimization Reference Manual doesn't list all microarchitectures ) and Not about the Throughput. However, I see that my current test perfectly measured the Throughput of a General purpose MOV instruction on Ivy Bridge system. Here is a verification for 32-bit and 64--bit codes: [ Intel C++ compiler - RELEASE - 32-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.372 clock cycles Final RDTSC Overhead Value: 23.628 clock cycles ... [ Intel C++ compiler - RELEASE - 64-bit ] ... Test-Case 1.3 - Overhead of Assignment of a Value from RDTSC instruction Min Overhead of Assignment: 0.369 clock cycles Final RDTSC Overhead Value: 23.631 clock cycles ... Note: '...Min Overhead of Assignment...' needs to be changed to '...Min Throughput of Assignment...'

>>...[1] : Intel® 64 and IA

SergeyKostrov — Tue, 21 May 2013 12:51:29 GMT

>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012... I have that Manual and I saw the numbers for MOV instruction. Thanks. Any comments from Intel engineers?

Quote:Sergey Kostrov wrote:>>

bronxzv — Tue, 21 May 2013 13:27:00 GMT

Sergey Kostrov wrote:
>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...

I have that Manual and I saw the numbers for MOV instruction. Thanks.

Any comments from Intel engineers?

as you can see at page C-31 of the optimization manual (written by Intel engineers) the latency was 0.5 for Pentium 4 with the double pumped "Fireball" ALU (signature = 0F_2H) so the answer to your question is clearly no, it isn't 1 clock cycle for all Intel CPUs

Guys, please pause for a

SergeyKostrov — Tue, 21 May 2013 13:39:00 GMT

Guys, please pause for a moment and let's wait for a comment from Intel engineers. OK?

Sergey,

perfwise — Thu, 23 May 2013 11:51:02 GMT

Sergey,

I have a suite of 3-4K tests .. which tell me all the instr late, more presice than anything found on the internet. I get 1 clk on SB/IB for mov. I also monitor the number eliminated, via move elimination and it appears they can eliminate only 1 move per dispatched set of ops.. I believe. More food for thought on this.. but it's probably 1 clk.

Perfwise

Here are two more quotes I

SergeyKostrov — Thu, 23 May 2013 13:00:07 GMT

Here are two more quotes I just found in Intel Manuals: Intel(R) 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-026 April 2012 C.3.1 Latency and Throughput with Register Operands ... Processor instruction timing data is implementation specific; it can vary between model encodings within the same family encoding... ... On Page 738

Latency:

SergeyKostrov — Thu, 23 May 2013 13:02:07 GMT

Latency: 0F_3H - 1 0F_2H - 0.5 Throughput: 0F_3H - 0.5 0F_2H - 0.5 Notes: 0F_3H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4, Pentium D processors 0F_2H - Intel Xeon Processor, Intel Xeon Processor MP, Intel Pentium 4 processors Intel(R) 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B & 3C): System Programming Guide Order Number: 325384-044US August 2012 CHAPTER 35 MODEL-SPECIFIC REGISTERS (MSRS) ... Table 35-1. CPUID Signature Values of DisplayFamily_DisplayModel ... On Page 1151

>>>0F_2H - 0.5>>>

Bernard — Thu, 23 May 2013 16:30:12 GMT

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

Quote:iliyapolak wrote:

bronxzv — Thu, 23 May 2013 17:58:00 GMT

iliyapolak wrote:

>>>0F_2H - 0.5>>>

So on this model encoding processor latency is 0,5 cycle.

0F_2H (family 15, model 2) is for the P4 Northwood core [2] with its double pumped ALU, AFAIK ALU latencies were the same in the original P4 Willamette [1] with CPUID signature = 0F_1H (family 15, model 1)

with the P4 Prescott [3] (0F_3H, i.e family 15, model 3) the double pumped "Fireball" ALU was replaced by a regular ALU at core clock thus latencies increased

[1] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Willamette.html
[2] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Northwood.html
[3] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Prescott.html

Yes it makes sense when

Bernard — Fri, 24 May 2013 04:34:00 GMT

Yes it makes sense when double-pumped ALU is taken into account.

Thanks for interesting links.

Btw. it is interesting how the designers of double pumped ALU were able to double the clock of this unit.I think that main reason was low transistor count needed to implement ALU and thus lower heat disipation.