Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Intrinsic functions _rdtsc and _rdtscp

morca
Beginner
5,374 Views
Hello There is an intrinsic _rdtsc according to [1]. The questions are: 1- What is the unit of the output? It is an unsigned number. Is that nano second? clock cycle? ... 2- Why there is a form _rdtscp [2] that takes an address as an argument? I don't understand that. I want to get the timestamp. What is the purpose of supplying an address for that? [1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,4255&text=rdt [2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,4255,4256&text=rdt
0 Kudos
12 Replies
James_C_Intel2
Employee
5,374 Views

Since these intrinsics are a thin veneer over the underlying machine instructions, you need to consult the Intel® 64 and IA-32 Architectures Software Developer Manuals. You will find the description of these instructions in Volume 2.

0 Kudos
McCalpinJohn
Honored Contributor III
5,374 Views

The description of the instructions are important, but won't tell you what the operating system decides to put in the IA32_TSC_AUX register.

On Linux systems, the low-order 12 bits (bits 11:0) of the IA32_TSC_AUX register are set to the logical processor number, while the next 12 bits (bits 23:12) are set to the socket number.   The hardware guarantees that the TSC and IA32_TSC_AUX register are read atomically, so that if the TSCs are not synchronized, you know which logical processor number you were running on when you executed the instruction.   This is also a very easy way to check to see if the scheduler is moving a process (or thread) without requiring interaction with the OS.

0 Kudos
morca
Beginner
5,374 Views

Thanks for the replies. I want to know how exactly TSC register is updated? At every processor cycle? So, if all power savings are disabled and the CPU frequency is 3.2GHz, then each cycle will be 0.312 ns. By calling __rdtsc() two times and finding the difference, we are able to measure the time. For example, if the diff vale is 100, then the region of interest will be 31.2 ns.

Am I right?

I have seen some topic discussing that. However, the exact answer is not clear yet.

 

0 Kudos
TimP
Honored Contributor III
5,374 Views

tsc updates every buss cycle (once per multiplier number of CPU cycles).  

0 Kudos
morca
Beginner
5,374 Views

So, is the value of bus cycle available? Where can I find the multiplier value?

 

0 Kudos
McCalpinJohn
Honored Contributor III
5,374 Views

The TSC increments at the rate of the reference clock (i.e., the nominal processor frequency), independent of the actual core frequency.

From my measurements, it is not obvious that from the point of view of the core, the TSC is ever updated, except on demand.

The overhead of the RDTSC and RDTSCP instructions is high enough that it does not appear to be possible to understand exactly how it is updated, but it does not appear to update once per bus clock.    It looks like it interpolates between increments of the reference clock, so that the values are always increasing, but by variable amounts.   On a Xeon Platinum 8160 running at 3.7 GHz (nominal 2.1 GHz), repeated calls to RDTSC have a minimum TSC delta of 12, an average TSC delta of a bit over 14, and a maximum delta of 16.  These values increase if the core frequency is lower (e.g., the minimum delta is 14 cycles at 3.5 GHz, and almost 50 cycles at 1.0 GHz), suggesting that the operation takes 20-24 core clocks.   RDTSCP shows a minimum increment of 18 (TSC) cycles between consecutive calls on the same system (also at 3.7 GHz).

Determining the frequency of the reference clock is a bit of a pain in user space.  It is trivial to read the value from bits 15:8 of MSR 0xCE (MSR_PLATFORM_INFO), but this must be done in the kernel.   There is a convoluted procedure to obtain the nominal frequency from the "Brand ID String" provided by the CPUID instruction.

There are codes and notes at https://github.com/jdmccalpin/low-overhead-timers that may make some of this more clear... (or not)....

0 Kudos
morca
Beginner
5,374 Views

I read this paragraph from section 17.15 in Volume 3 of the developers manual.

The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.

So, That means when processor is constantly running at 2GHz, TSC is incremented every 0.5ns, a.k.a the clock period. However, on real systems where frequency changes, it depends on the time epochs that CPU is clocked at specific frequency.

So, there is no statement about bus cycle or reference clock or ... I think our terminology are not the same. What do you exactly mean by core/reference frequency? Do you mean that the maximum frequency written on the CPU box is the nominal frequency?

 

 

 

0 Kudos
James_C_Intel2
Employee
5,374 Views

Thanks for the replies. I want to know how exactly TSC register is updated? At every processor cycle? So, if all power savings are disabled and the CPU frequency is 3.2GHz, then each cycle will be 0.312 ns. By calling __rdtsc() two times and finding the difference, we are able to measure the time. For example, if the diff vale is 100, then the region of interest will be 31.2 ns.

Please read the fine manual. It has a lot of information about the instructions and a three and a half pages of descriptionin Volume 3B.  Here's a slice from the description of rdtsc

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever
the processor is reset. See “Time Stamp Counter” in Chapter 17 of the
Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B
, for specific details of the time stamp counter behavior.

 

 

0 Kudos
morca
Beginner
5,374 Views

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever
the processor is reset.

Yes I read that. So that confirms that the time slice to update the TSC is 1/freq with the current value of frequency.

0 Kudos
McCalpinJohn
Honored Contributor III
5,373 Views

NO!!!

The TSC increments at the rate of the *nominal* clock frequency, not the *current* CPU frequency!

It is not at all clear *how* or *when* the TSC is updated, since the minimum overhead to read the TSC is at least 20 core clock cycles (on SKX -- it requires more cycles than this on earlier processors).   Many implementations that would be consistent with the observed behavior are plausible.

The fixed-function counter "reference cycles not halted" is easier to understand.  On processors before Skylake, the counter is incremented every 10 ns by an amount equal to the nominal frequency multiplier.  E.g., on a 2.1 GHz Sandy Bridge or Haswell, this counter increments by 21 every 10 ns (while the processor is not halted).   This means that any value read from this counter is always an integral multiple of 21 (something which is *not* true of TSC values).   The corresponding programmable counter (Event 0x3C, Umask 0x01) increments by 1 every 10 ns, so you need to remember to multiply values by the nominal clock multiplier (21 in this example) to get numbers in the same units as the TSC uses.

For Skylake Xeon it is a bit different -- on a 2.1 GHz Xeon Platinum 8160, the fixed-function "reference cycles not halted" counter increments by 84 every 40 ns (while the processor is not halted).  This is problematic -- when running at 3.7 GHz, I can read the fixed-function "reference cycles not halted" 6 or 7 times and obtain the same value each time (always exactly divisible by 84) before the counter eventually increments by 84. The corresponding programmable counter (Event 0x3C, Umask 0x01) increments by 1 every 40 ns, so you need to remember to multiply the values by (in this example) 84 to get numbers in the same units as the TSC uses.

0 Kudos
morca
Beginner
5,373 Views

OK. The explanation now raises a question on how much it is reliable to read TSC?! and who benefit that?

 

In other words, when to use and when not use TCS?

0 Kudos
McCalpinJohn
Honored Contributor III
5,373 Views

The RDTSC and RDTSCP instructions provide low-overhead access to a counter that increments at a fixed frequency.   They are very useful for timing relatively short pieces of code, especially in cases where you cannot be sure that the process being measured is pinned to a single logical processor (a requirement for the RDPMC instruction to be useful), and/or where you cannot be sure that the core is active for the entire measurement interval (a requirement for the fixed-function cycle counters to be useful).

I use these instructions pretty much every day....

A typical use case involves reading the TSC and all three of the fixed-function counters.  From these I can compute lots of interesting things:

  • (Elapsed Reference Cycles Not Halted) / (Elapsed TSC cycles) = fraction of the time the core was active
  • (Elapsed Actual Cycles Not Halted) / (Elapsed Reference Cycles Not Halted) * Base Frequency = average frequency while not halted
  • (Elapsed Instructions Retired) / (Elapsed Actual Cycles Not Halted) = average instructions per cycle while not halted
0 Kudos
Reply