How many iterations does your loop have?
I was not able to find latency of rdtscp instruction , but what I have found is the latency of rdtsc which is 24 cycles. Judging by large variation in time measurement of your small loop there is still possibility of context switching or ISR routines polluting your results.
>>>I measure the time separately in every iteration (and that is what the asm code is for).>>>
If I understood it correctly you perform measurement with the 1/1e+5 time resolution. I think that in such a case total latency of the measured code per one iteration can pollute the results mainly because some uops can retire before retirement of rdtscp intruction.
Btw did you try to perform your test this way:
for 0 to 99
start_clock = rdtscp
for 0 to 99999
//loop your code
stop_clock = rdtscp
There is a quite in-depth paper on this subject at http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-...
The paper develops a measurement methodology which (in its final version) has reliable minimum clockticks with very little variance.
Can you read this article and see if it doesn't answer all of your questions?
It is a bit tricky to determine the "latency" of the RDTSC and RDTSCP instructions on recent processors for a couple of reasons:
On my Xeon E5-2680 (Sandy Bridge EP) systems I typically see a minimum delta for consecutive RDTSC invocations that corresponds the same elapsed time as "base multiplier + 1" *core* cycles. Examples:
RDTSCP always measures a bit slower than RDTSC. This might be due to the ordering rules or it might be due to the need to load MSR_TSC_AUX into %ecx. The results are only clean at the nominal 2.7 GHz frequency, where RDTSCP takes 4 extra cycles when I only store the low-order 32-bits of the TSC to memory. Storing the other 32-bit TSC registers takes 4 more cycles, and combining the results to perform a single 64-bit store takes 4 more cycles. The difference in reference cycles gets bigger at lower core frequencies, and is a bit smaller at elevated (Turbo) frequencies.
Xeon Phi has a very fast RDTSC -- minimum granularities are 6-10 cycles (depending on how you choose to store the output). Note that this is *not* a "constant-rate" TSC -- it increments at the instantaneous core clock rate (which is obviously going to be a lot easier to implement!) The Xeon Phi SE10P Coprocessors at TACC are configured for fixed-frequency operation, so this does not cause a problem here.
After looking at the results some more, it looks like the RDTSC instruction probably takes (approximately) a fixed number of core cycles -- it is just an accident that the number happens to be close to the multiplier ratio on my systems.
An algorithm that might work would be:
There are not enough cycles to do this with the full integer divide apparatus (which can take over 90 cycles for 64-bit operands), but there are lots of tricks that can be done to reduce the complexity given the small range of sizes of the CurrentMultiplier and ReferenceMultiplier (all plausible values can fit in 6 bits). The answer does not need to be exact -- it needs to guarantee monotonicity and it should probably be within a few cycles of the correct answer (otherwise computed differences will have scatter).
If for example an exact algorithm for calculating RDTSC value looks like that one in your post #9 I wonder how CPU designers could use integer division at micro-op level even for small range of values of CurrentMultiplier and ReferenceMultiplier and achieve RDTSC instruction latency of 24 cycles?
I admit that I don't know any fast divide algorithms that could use the limited size of the denominator to rescale the core clock frequency to get a reference clock frequency in the ~20 cycles or so available, but I am am not an expert in these kinds of tricks so I was just guessing that it might be possible.
There are lots of other possible implementation approaches for RDTSC. We can use performance characteristics to make general comments about whether various algorithms seem plausible or not, but I don't expect that a definitive answer is possible from this sort of external "black box" evaluation.
My main point was that the choice to define the TSC in terms of reference cycles rather than core cycles means that an extremely high-speed implementation (e.g., <10 cycles) is harder.
There are a couple of units in each processor that run at a fixed frequency that could be used to implement a counter that increments at a fixed rate -- e.g., the Power Control Unit (PCU) in Xeon E5-26xx runs at 800 MHz and the DDR clock is not changed once the system has completed DRAM configuration. It would be fairly easy to build a free-running incrementer in the PCU that would add 27 increments every 10 ns (8 clock cycles).
For example it could add [3,3,4,3,3,4,3,4] to the counter in each PCU cycle. A core that wanted to read the TSC would issue a special uncached load transaction to get this value from the "global" TSC in the PCU and then add its private TSC offset to get the final value. If you skipped the L1 & L2 cache lookups and sent the load straight to the uncore ring, the latency should be at least ~6 cycles lower than the L3 latency, so the 28 core cycle latency that I am observing seems quite reasonable. This mechanism is also consistent with my observation that TSC deltas gathered in a single experiment never include consecutive values -- if the TSC deltas are different, they are different by at least 2 cycles.
But this is still just a guess -- there are many other possible implementation mechanisms.
None of this explains why the original poster is seeing repeated values of 25, 30, and 55 cycles -- or why some of my experiments showed 31, 34, and 63 cycles. Noting that the largest values in both sets of observations are a bit over 2x the smallest values might suggest something about the implementation, but there are probably too many possible implementations for this to be definitive. In my tests I only accumulate latencies in bins of up to 99 cycles, so I don't have any data on larger values. It is hard to think of mechanisms that would result in a 148 cycle latency, for example. Anything disturbing the processor would likely require a lot more than the extra ~100 cycles, and it is hard to think of mechanisms that might slow down an uncore access by ~100 cycles. On the other hand there are a lot of mechanisms in these machines that no one outside the appropriate subset of the engineering team knows anything about. As long as they are rare they don't seriously impact average performance, and users just need to be aware that very little in these boxes is absolutely guaranteed to have performance repeatability.
When I use rdtsc I store the float inverse of clock rate once in a static variable (assuming fixed rate) so as not to incur the latency of divide for scaling each measurement. I do get occasional outliers where a time interval suffers from a large extra delay, but the results are much more consistent on my old MIC than on my corei7-4 laptop.
As hinted above, tsc can't be expected to measure intervals smaller than core clock frequency, thus the bunching around values related to clock multiplier. I thought "invariant" tsc meant that the clock multiplier used to produce tsc is the nominal one built into the CPU even if you have enabled dynamic variation of actual clock multiplier (on the models which support that). As long as the core clock rate is fixed, tsc counts should be related linearly to elapsed wall clock time.
The "invariant" (or "constant rate") TSC does increment at the rate of the nominal CPU core clock. The minimum delta between values returned by RDTSC shows some confusing behavior, but the fact that it is confusing is not much of a surprise after you spend some time thinking about how one might implement such a behavior given a dynamically varying core clock. One could imagine an implementation that showed much lower overhead when running at the nominal CPU frequency (just take the free-running cycle count and add it to the base TSC value for the current timing epoch), but it is probably less confusing to have the latency be the same (in core cycles) at any frequency.
Most of the descriptions of the TSC refer to the external reference clock (100 MHz on recent processors) and the nominal CPU core clock multiplier (e.g., 27 on the Xeon E5-2680). There is a performance counter event (Event 3CH, Umask 01H) that counts external reference clock cycles on most Intel processors but which counts external reference clock cycles times the nominal CPU core clock multiplier (i.e., the same rate as the TSC) on some processors (notably Nehalem & Westmere). Chapter 19 of Vol 3 of the SW developer's manual has been updated to list the actual behavior of each processor model. On the systems where this counts at the scaled rate (like the TSC) I have not checked to see if user-mode RDPMC on this event shows the same characteristics as we have been discussing for the RDTSC and RDTSCP instructions.
>>>It is hard to think of mechanisms that would result in a 148 cycle latency, for example. Anything disturbing the processor would likely require a lot more than the extra ~100 cycles, and it is hard to think of mechanisms that might slow down an uncore access by ~100 cycles. On the other hand there are a lot of mechanisms in these machines that no one outside the appropriate subset of the engineering team knows anything about. As long as they are rare they don't seriously impact average performance, and users just need to be aware that very little in these boxes is absolutely guaranteed to have performance repeatability.>>>
I completely agree with you.
By looking at original poster results I was thinking about the possibility that various results(variation in measurement) of RDTSC instruction could be due to parallel execution of RDTSC converted uop(s) and uops representing measured machine code. For example if measured code is broken down into uop(s) which utilize ALU components like adder unit and if RDTSC implementation at the same time needs to increment counter by adding 27 increments so there be some kind of stall when one uop must wait for the completion of the other uop. It still does not explain why there is such variation when the same code is measured during N loop iterations.
I've attached my rdtscp timing code based on the paper above. See the uploaded file rdtsc_mod2.c. The code runs as a kernel module. The Makefile is uploaded as Makefile.txt. It disables preemption and interrupts but it still gets interrupted by something. It runs a 1000 outer loops and 100,000 inner loops. It calculates the time in cycles and accumulates the time in an array for each test. Then it calculates variances. The minimum time for rdtscp (that is, the latency) still can vary wildly from 36 to 150 cycles. To control the variation in the minimum time of rdtscp, I have to fix the cpu frequency. I uploaded my set_speed.sh script as set_speed.sh.txt. Also the script lists lots of frequencies as available, for only some of the available frequencies could I set the min and max frequency to the same value.
On my ivybridge laptop, once I controlled the frequency, I found that the rdtscp latency was:
120 cycles at 1.0 Ghz
75 cycles at 1.6 Ghz
60 cycles at 1.9 Ghz
36-37 cycles at 2.0+ Ghz
The code demonstrates how to get a reliable minimum latency for rdtscp (or any other function introduced at 'call the function to measure here' comment in rdtsc_mod2.c). The output shows how many times (for each outer loop) you didn't get the minimum latency. it also shows the variance.
To compile the module just do 'make' (after renaming Makefile.txt to Makefile).
To run the test:
#wait for cmd prompt to return...a minute or so... takes more time at lower freqs
dmesg > tmp.txt
Search backwards from the bottom of the file for 'hello'. This marks the top of the last time you ran.
I've attached an output file for the case where I've set the cpu frequency to 2.0 Ghz.