- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody,
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle. For example, I've completed a set of tests for Intel(R) Pentium(R) 4 CPU 1.60GHz and my numbers are as follows:
[ Intel C++ compiler - DEBUG ]
...
Overhead of Assignment: 1.091 clock cycles
...
[ Intel C++ compiler - RELEASE ]
...
Overhead of Assignment: 1.191 clock cycles
...
A C code with assignment looks like:
unsigned __int64 uiClockCycles = __rdtsc();
and a value returned from RDTSC instruction is assigned to uiClockCycles variable with two General purpose MOV instructions, and it means, that 2 clock cycles will be actually spent.
Thanks in advance.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that two mov instructions are used to load high and low part of RDTSC value.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How large was loop counter needed to precisely measure latency of MOV instruction?And how many such a measurements did you average?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On latest architecture memory moves are executed by two Ports2 and 3 in parallel , but I do not know that this can explain such a low latency.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
I'd like to hear from Intel engineers that Latency of a General purpose MOV instruction on any Intel CPUs is 1 clock cycle.
you can find this information for specific implementations in the optimization manual [1] in appendix C.3 Latency and Throughput, IIRC latency for MOV is 1 clock for all processors, now it looks like you are more after reciprocal throughput (since you issue two independent MOV in your example), rcp throughput is documented as 0.33 for Sandy Bridge/Ivy Bridge for ex. (i.e. there is 3 ports available for GPR to GPR moves) but may be only 0.5 for older processors
[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
available here: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@bronxzv
you were faster with your answer about the reciprocal throughput:) I wanted to write exactly the same answer:)
Btw. afaik there are only two ports which are executing load/store instructions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Btw. afaik there are only two ports which are executing load/store instructions.
load from memory isn't involved in the example at hand, 0.33 is for register to register moves (also for 64-bit MMX and 128-bit XMM registers), the store to memory is not on the critical path in the example at hand (as it's usual for stores)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for correcting my error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey Kostrov wrote:
>>...[1] : Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012...I have that Manual and I saw the numbers for MOV instruction. Thanks.
Any comments from Intel engineers?
as you can see at page C-31 of the optimization manual (written by Intel engineers) the latency was 0.5 for Pentium 4 with the double pumped "Fireball" ALU (signature = 0F_2H) so the answer to your question is clearly no, it isn't 1 clock cycle for all Intel CPUs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
I have a suite of 3-4K tests .. which tell me all the instr late, more presice than anything found on the internet. I get 1 clk on SB/IB for mov. I also monitor the number eliminated, via move elimination and it appears they can eliminate only 1 move per dispatched set of ops.. I believe. More food for thought on this.. but it's probably 1 clk.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>0F_2H - 0.5>>>
So on this model encoding processor latency is 0,5 cycle.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>0F_2H - 0.5>>>
So on this model encoding processor latency is 0,5 cycle.
0F_2H (family 15, model 2) is for the P4 Northwood core [2] with its double pumped ALU, AFAIK ALU latencies were the same in the original P4 Willamette [1] with CPUID signature = 0F_1H (family 15, model 1)
with the P4 Prescott [3] (0F_3H, i.e family 15, model 3) the double pumped "Fireball" ALU was replaced by a regular ALU at core clock thus latencies increased
[1] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Willamette.html
[2] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Northwood.html
[3] http://www.cpu-world.com/CPUs/Pentium_4/TYPE-Desktop%20Pentium%204%20Prescott.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes it makes sense when double-pumped ALU is taken into account.
Thanks for interesting links.
Btw. it is interesting how the designers of double pumped ALU were able to double the clock of this unit.I think that main reason was low transistor count needed to implement ALU and thus lower heat disipation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page