Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Precision of rdtscp

Jochen_H_
Beginner
2,017 Views
I'm trying to measure execution times with hardware performance counters on a Haswell 4770 processor. In my case it is just a few instructions, so I thought it is best to test the method itself, so here is what I do (in gcc extended assembler): "rdtscp # Timestamp in EDX:EAX and IA32_TSC_AUX\n\t"\ "rdtscp # Timestamp in EDX:EAX and IA32_TSC_AUX\n\t"\ "shl $32, %%rdx # Shift the upper bits left. \n\t"\ "or %%rdx, %%rax # 'Or' in the lower bits. \n\t"\ "mov %%rax, %%r12 # Save the value here \n\t"\ "rdtscp # Timestamp in EDX:EAX and IA32_TSC_AUX\n\t"\ "shl $32, %%rdx # Shift the upper bits left. \n\t"\ "or %%rdx, %%rax # 'Or' in the lower bits. \n\t"\ "sub %%r12, %%rax # compute the time that passed \n\t"\ I do this in a loop and would have expected to get the same deterministic value for all executions in r12. However I see values from 20 to 188247 cycles, with the majority in the range 20 to 135. The highest value is probably due to context switches (occurs with very low likelihood). Putting this code into a transaction (so that no interrupt, etc could disturb; only measuring within successful transactions) did not improve the situation either. Then I see 20 to 148 cycles. Is my assumption that a serializing instruction plus only accessing registers should result in the same value for all executions wrong, if I exclude context switches, interrupts, etc? What could course the variability in values? Could I use some other method?
0 Kudos
16 Replies
Bernard
Valued Contributor I
2,017 Views

How many iterations does your loop have?

0 Kudos
Jochen_H_
Beginner
2,017 Views

I run my benchmark 100 times with 100.000 iterations each, however I measure the time separately in every iteration (and that is what the asm code is for).

0 Kudos
Bernard
Valued Contributor I
2,017 Views

iliyapolak wrote:

How many iterations does your loop have?

I was not able to find latency of rdtscp instruction , but what I have found is the latency of rdtsc which is 24 cycles. Judging by large variation in time measurement of your small loop there is still possibility of context switching or ISR routines polluting your results.

0 Kudos
Bernard
Valued Contributor I
2,017 Views

>>>I measure the time separately in every iteration (and that is what the asm code is for).>>>

If I understood it correctly you perform measurement with the 1/1e+5 time resolution. I think that in such a case total latency of the measured code per one iteration can pollute the results mainly because some uops can retire before retirement of rdtscp intruction.

Btw did you try to perform your test this way:

    for 0 to 99

     start_clock = rdtscp

     for 0 to 99999

      //loop your code

       stop_clock = rdtscp

        end for

        end for

0 Kudos
Jochen_H_
Beginner
2,017 Views
As for the possibility of polluting the measurement by context switching, interrupts, etc. This is true, but only for the non transactional case. If I use rdtscp in a transaction, the transaction will be cancelled in any such event. So I would not count those timestamps. If I'm not completely wrong the transaction guarantees, that those instructions are executed without interruption or delay (as no cache or memory is accessed). I'm particular interested in a typical and worst case scenario of the instructions I want to measure. So measuring my code in a loop will not work, as it will only give me average execution times, neither would it show the distribution of runtimes. I want to do my measurement with a per cycle scenario and was just surprised that the measurement of the overhead itself is not deterministic. With no cache or memory accessing code, how can I have such a wide range of runtimes 20-148? I also tried out rdtsc instead of rdtscp: Here I see without using transactions values from 22 to 59981, with everything >110 being very unusual. With putting the measurement into a transaction, I see 22-57 instructions, with a peak at 25, 30 and 55 instructions.
0 Kudos
Patrick_F_Intel1
Employee
2,017 Views

Hello Jochen,

There is a quite in-depth paper on this subject at http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf.

The paper develops a measurement methodology which (in its final version) has reliable minimum clockticks with very little variance.

Can you read this article and see if it doesn't answer all of your questions?

Pat

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,017 Views

It is a bit tricky to determine the "latency" of the RDTSC and RDTSCP instructions on recent processors for a couple of reasons:

  • The TSC increments at the rate of the "nominal" processor frequency, not the instantaneous processor frequency.   This makes implementation trickier.  
    • Intel could either take the easy way out and simply increment the counter by the base ratio once every tick of the 100 MHz reference clock, or
    • they could use that as a start and then modify the value by a fraction of the base ratio using the number of core clocks elapsed since the last update (along with the ratio of the current core clock to the base core clock).  This latter approach is likely to be very ugly and hard to do fast.
    • The results below don't really match either of these ideas -- the implementation must be more complex?
  • Consecutive calls to the TSC should not return the same value.  So even if the value is available instantaneously, subsequent calls would need to be delayed until the value increments again.  This may be once every "base ratio" 10 ns reference cycles, but a more complex internal algorithm could provide intermediate values.
  • Storing the results results (or the deltas) can be slow unless you pre-fill the output array in the L1 cache and limit the number of updates to that L1-containable range.
  • Be careful of the overhead of the shifts and subtracts and loops.
    • I originally unrolled my loops to minimize overheads, but then realized that different iterations showed different overheads -- so I switched back to a single iteration per loop so I would always have the same (very small) overhead for compare and branch.
    • I was amused to see that the gcc compiler removed my "shift, xor, store 64 bits" code and replaced it with 2 32 bit stores of the original output registers of the RDTSC instruction.

On my Xeon E5-2680 (Sandy Bridge EP) systems I typically see a minimum delta for consecutive RDTSC invocations that corresponds the same elapsed time as "base multiplier + 1" *core* cycles.   Examples:

  • When I am running at exactly the nominal 2.7 GHz (core ratio = 27), I see a minimum RDTSC delta of 28.  This delta occurred 99 times out of 100 in each of the 5 tests that I ran.
  • At the lowest supported frequency of 1.2 GHz, the minimum RDTSC delta was 63, which also corresponds to 28 core clock cycles.  This delta also occurred 99 times in each of the 5 tests I ran.
  • At some of the intermediate frequencies, the distributions included multiple frequent values, suggesting that the hardware is doing "intermediate" updates to the TSC (i.e., not just incrementing it by 27 every 10 ns).  For example at 2.4 GHz (where 28 core cycles is 31.5 reference cycles), the five tests showed deltas of:
    • 63 (99 times)
    • 31 (82 times), 34 (17 times)
    • 63 (99 times)
    • 31 (82 times), 34 (17 times)
    • 31 (82 times), 34 (17 times)

RDTSCP always measures a bit slower than RDTSC.   This might be due to the ordering rules or it might be due to the need to load MSR_TSC_AUX into %ecx.   The results are only clean at the nominal 2.7 GHz frequency, where RDTSCP takes 4 extra cycles when I only store the low-order 32-bits of the TSC to memory.  Storing the other 32-bit TSC registers takes 4 more cycles, and combining the results to perform a single 64-bit store takes 4 more cycles.   The difference in reference cycles gets bigger at lower core frequencies, and is a bit smaller at elevated (Turbo) frequencies.

Xeon Phi has a very fast RDTSC -- minimum granularities are 6-10 cycles (depending on how you choose to store the output).  Note that this is *not* a "constant-rate" TSC -- it increments at the instantaneous core clock rate (which is obviously going to be a lot easier to implement!)  The Xeon Phi SE10P Coprocessors at TACC are configured for fixed-frequency operation, so this does not cause a problem here.

0 Kudos
McCalpinJohn
Honored Contributor III
2,017 Views

After looking at the results some more, it looks like the RDTSC instruction probably takes (approximately) a fixed number of core cycles -- it is just an accident that the number happens to be close to the multiplier ratio on my systems.

An algorithm that might work would be:

  1. Save the "old" TSC (TSC_PREV) in an internal (hidden) register at each p-state (frequency) change
  2. Start a free-running core clock counter (CoreCLK) at each p-state (frequency) change
  3. Compute the "delta TSC" as: Delta_TSC = (CoreCLK * CurrentMultiplier) / ReferenceMultiplier
  4. Compute and return the "new" TSC as "TSC_PREV + Delta_TSC"

There are not enough cycles to do this with the full integer divide apparatus (which can take over 90 cycles for 64-bit operands), but there are lots of tricks that can be done to reduce the complexity given the small range of sizes of the CurrentMultiplier and ReferenceMultiplier (all plausible values can fit in 6 bits).   The answer does not need to be exact -- it needs to guarantee monotonicity and it should probably be within a few cycles of the correct answer (otherwise computed differences will have scatter).  

0 Kudos
Bernard
Valued Contributor I
2,017 Views

@John

If for example an exact algorithm for calculating RDTSC value looks like that one in your post #9 I wonder how CPU designers could  use integer division at micro-op level even for small range of values of CurrentMultiplier and ReferenceMultiplier and achieve RDTSC instruction latency of 24 cycles?

0 Kudos
McCalpinJohn
Honored Contributor III
2,017 Views

I admit that I don't know any fast divide algorithms that could use the limited size of the denominator to rescale the core clock frequency to get a reference clock frequency in the ~20 cycles or so available, but I am am not an expert in these kinds of tricks so I was just guessing that it might be possible.

There are lots of other possible implementation approaches for RDTSC.  We can use performance characteristics to make general comments about whether various algorithms seem plausible or not, but I don't expect that a definitive answer is possible from this sort of external "black box" evaluation.

My main point was that the choice to define the TSC in terms of reference cycles rather than core cycles means that an extremely high-speed implementation (e.g., <10 cycles) is harder.   

There are a couple of units in each processor that run at a fixed frequency that could be used to implement a counter that increments at a fixed rate -- e.g., the Power Control Unit (PCU) in Xeon E5-26xx runs at 800 MHz and the DDR clock is not changed once the system has completed DRAM configuration.   It would be fairly easy to build a free-running incrementer in the PCU that would add 27 increments every 10 ns (8 clock cycles). 

For example it could add [3,3,4,3,3,4,3,4] to the counter in each PCU cycle.  A core that wanted to read the TSC would issue a special uncached load transaction to get this value from the "global" TSC in the PCU and then add its private TSC offset to get the final value.  If you skipped the L1 & L2 cache lookups and sent the load straight to the uncore ring, the latency should be at least ~6 cycles lower than the L3 latency, so the 28 core cycle latency that I am observing seems quite reasonable.   This mechanism is also consistent with my observation that TSC deltas gathered in a single experiment never include consecutive values -- if the TSC deltas are different, they are different by at least 2 cycles.

But this is still just a guess -- there are many other possible implementation mechanisms.

None of this explains why the original poster is seeing repeated values of 25, 30, and 55 cycles -- or why some of my experiments showed 31, 34, and 63 cycles.   Noting that the largest values in both sets of observations are a bit over 2x the smallest values might suggest something about the implementation, but there are probably too many possible implementations for this to be definitive.    In my tests I only accumulate latencies in bins of up to 99 cycles, so I don't have any data on larger values.   It is hard to think of mechanisms that would result in a 148 cycle latency, for example.  Anything disturbing the processor would likely require a lot more than the extra ~100 cycles, and it is hard to think of mechanisms that might slow down an uncore access by ~100 cycles.   On the other hand there are a lot of mechanisms in these machines that no one outside the appropriate subset of the engineering team knows anything about.   As long as they are rare they don't seriously impact average performance, and users just need to be aware that very little in these boxes is absolutely guaranteed to have performance repeatability.

0 Kudos
TimP
Honored Contributor III
2,017 Views

When I use rdtsc I store the float inverse of clock rate once in a static variable (assuming fixed rate) so as not to incur the latency of divide for scaling each measurement.   I do get occasional outliers where a time interval suffers from a large extra delay, but the results are much more consistent on my old MIC than on my corei7-4 laptop.

As hinted above, tsc can't be expected to measure intervals smaller than core clock frequency, thus the bunching around values related to clock multiplier.  I thought "invariant" tsc meant that the clock multiplier used to produce tsc is the nominal one built into the CPU even if you have enabled dynamic variation of actual clock multiplier (on the models which support that).  As long as the core clock rate is fixed, tsc counts should be related linearly to elapsed wall clock time.

0 Kudos
McCalpinJohn
Honored Contributor III
2,017 Views

The "invariant" (or "constant rate") TSC does increment at the rate of the nominal CPU core clock.  The minimum delta between values returned by RDTSC shows some confusing behavior, but the fact that it is confusing is not much of a surprise after you spend some time thinking about how one might implement such a behavior given a dynamically varying core clock.  One could imagine an implementation that showed much lower overhead when running at the nominal CPU frequency (just take the free-running cycle count and add it to the base TSC value for the current timing epoch), but it is probably less confusing to have the latency be the same (in core cycles) at any frequency.

Most of the descriptions of the TSC refer to the external reference clock (100 MHz on recent processors) and the nominal CPU core clock multiplier (e.g., 27 on the Xeon E5-2680).   There is a performance counter event (Event 3CH, Umask 01H) that counts external reference clock cycles on most Intel processors but which counts external reference clock cycles times the nominal CPU core clock multiplier (i.e., the same rate as the TSC) on some processors (notably Nehalem & Westmere).  Chapter 19 of Vol 3 of the SW developer's manual has been updated to list the actual behavior of each processor model.   On the systems where this counts at the scaled rate (like the TSC) I have not checked to see if user-mode RDPMC on this event shows the same characteristics as we have been discussing for the RDTSC and RDTSCP instructions.

0 Kudos
Bernard
Valued Contributor I
2,017 Views

>>>It is hard to think of mechanisms that would result in a 148 cycle latency, for example.  Anything disturbing the processor would likely require a lot more than the extra ~100 cycles, and it is hard to think of mechanisms that might slow down an uncore access by ~100 cycles.   On the other hand there are a lot of mechanisms in these machines that no one outside the appropriate subset of the engineering team knows anything about.   As long as they are rare they don't seriously impact average performance, and users just need to be aware that very little in these boxes is absolutely guaranteed to have performance repeatability.>>>

I completely agree with you.

By looking at original poster results I was thinking about the possibility that various results(variation in measurement) of RDTSC instruction could be due to parallel execution of RDTSC converted uop(s) and uops representing measured machine code. For example if measured code is broken down into uop(s) which utilize ALU components like adder unit and if RDTSC implementation at the same time needs to increment counter by adding 27 increments so there be some kind of  stall when one uop must wait for the completion of the other uop. It still does not explain why there is such variation when the same code is measured during N loop iterations.

0 Kudos
Patrick_F_Intel1
Employee
2,017 Views

I've attached my rdtscp timing code based on the paper above. See the uploaded file rdtsc_mod2.c. The code runs as a kernel module. The Makefile is uploaded as Makefile.txt. It disables preemption and interrupts but it still gets interrupted by something. It runs a 1000 outer loops and 100,000 inner loops. It calculates the time in cycles and accumulates the time in an array for each test. Then it calculates variances. The minimum time for rdtscp (that is, the latency) still can vary wildly from 36 to 150 cycles. To control the variation in the minimum time of rdtscp, I have to fix the cpu frequency. I uploaded my set_speed.sh script as set_speed.sh.txt. Also the script lists lots of frequencies as available, for only some of the available frequencies could I set the min and max frequency to the same value.

On my ivybridge laptop, once I controlled the frequency, I found that the rdtscp latency was:

120 cycles at 1.0 Ghz

75 cycles at 1.6 Ghz

60 cycles at 1.9 Ghz

36-37 cycles at 2.0+ Ghz

The code demonstrates how to get a reliable minimum latency for rdtscp (or any other function introduced at 'call the function to measure here' comment in rdtsc_mod2.c). The output shows how many times (for each outer loop) you didn't get the minimum latency. it also shows the variance.

To compile the module just do 'make' (after renaming Makefile.txt to Makefile).

To run the test:

insmod ./rdtsc_mod2.ko

#wait for cmd prompt to return...a minute or so... takes more time at lower freqs

rmmod ./rdtsc_mod2.ko

dmesg > tmp.txt

Search backwards from the bottom of the file for 'hello'. This marks the top of the last time you ran.

I've attached an output file for the case where I've set the cpu frequency to 2.0 Ghz.

Pat

0 Kudos
Bernard
Valued Contributor I
2,017 Views

>>>120 cycles at 1.0 Ghz

75 cycles at 1.6 Ghz

60 cycles at 1.9 Ghz

36-37 cycles at 2.0+ Ghz>>>

What is the latency when you measure some code?

 

0 Kudos
McCalpinJohn
Honored Contributor III
2,017 Views

FYI, all my tests were done on systems with fixed core frequencies so I would not see any RDTSCP timing effects due to frequency changes in my results.

0 Kudos
Reply