Thanks for the replies. I

morca · ‎08-13-2018

Hello There is an intrinsic _rdtsc according to [1]. The questions are: 1- What is the unit of the output? It is an unsigned number. Is that nano second? clock cycle? ... 2- Why there is a form _rdtscp [2] that takes an address as an argument? I don't understand that. I want to get the timestamp. What is the purpose of supplying an address for that? [1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,4255&text=rdt [2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4067,602,4255,4256&text=rdt

James_C_Intel2 · ‎08-14-2018

Since these intrinsics are a thin veneer over the underlying machine instructions, you need to consult the Intel® 64 and IA-32 Architectures Software Developer Manuals. You will find the description of these instructions in Volume 2.

McCalpinJohn · ‎08-14-2018

The description of the instructions are important, but won't tell you what the operating system decides to put in the IA32_TSC_AUX register.

On Linux systems, the low-order 12 bits (bits 11:0) of the IA32_TSC_AUX register are set to the logical processor number, while the next 12 bits (bits 23:12) are set to the socket number. The hardware guarantees that the TSC and IA32_TSC_AUX register are read atomically, so that if the TSCs are not synchronized, you know which logical processor number you were running on when you executed the instruction. This is also a very easy way to check to see if the scheduler is moving a process (or thread) without requiring interaction with the OS.

morca · ‎08-25-2018

Thanks for the replies. I want to know how exactly TSC register is updated? At every processor cycle? So, if all power savings are disabled and the CPU frequency is 3.2GHz, then each cycle will be 0.312 ns. By calling __rdtsc() two times and finding the difference, we are able to measure the time. For example, if the diff vale is 100, then the region of interest will be 31.2 ns.

Am I right?

I have seen some topic discussing that. However, the exact answer is not clear yet.

TimP · ‎08-26-2018

tsc updates every buss cycle (once per multiplier number of CPU cycles).

morca · ‎08-27-2018

So, is the value of bus cycle available? Where can I find the multiplier value?

McCalpinJohn · ‎08-27-2018

The TSC increments at the rate of the reference clock (i.e., the nominal processor frequency), independent of the actual core frequency.

From my measurements, it is not obvious that from the point of view of the core, the TSC is ever updated, except on demand.

The overhead of the RDTSC and RDTSCP instructions is high enough that it does not appear to be possible to understand exactly how it is updated, but it does not appear to update once per bus clock. It looks like it interpolates between increments of the reference clock, so that the values are always increasing, but by variable amounts. On a Xeon Platinum 8160 running at 3.7 GHz (nominal 2.1 GHz), repeated calls to RDTSC have a minimum TSC delta of 12, an average TSC delta of a bit over 14, and a maximum delta of 16. These values increase if the core frequency is lower (e.g., the minimum delta is 14 cycles at 3.5 GHz, and almost 50 cycles at 1.0 GHz), suggesting that the operation takes 20-24 core clocks. RDTSCP shows a minimum increment of 18 (TSC) cycles between consecutive calls on the same system (also at 3.7 GHz).

Determining the frequency of the reference clock is a bit of a pain in user space. It is trivial to read the value from bits 15:8 of MSR 0xCE (MSR_PLATFORM_INFO), but this must be done in the kernel. There is a convoluted procedure to obtain the nominal frequency from the "Brand ID String" provided by the CPUID instruction.

There are codes and notes at https://github.com/jdmccalpin/low-overhead-timers that may make some of this more clear... (or not)....

morca · ‎08-28-2018

I read this paragraph from section 17.15 in Volume 3 of the developers manual.

The specific processor configuration determines the behavior. Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.

So, That means when processor is constantly running at 2GHz, TSC is incremented every 0.5ns, a.k.a the clock period. However, on real systems where frequency changes, it depends on the time epochs that CPU is clocked at specific frequency.

So, there is no statement about bus cycle or reference clock or ... I think our terminology are not the same. What do you exactly mean by core/reference frequency? Do you mean that the maximum frequency written on the CPU box is the nominal frequency?

James_C_Intel2 · ‎08-28-2018

Thanks for the replies. I want to know how exactly TSC register is updated? At every processor cycle? So, if all power savings are disabled and the CPU frequency is 3.2GHz, then each cycle will be 0.312 ns. By calling __rdtsc() two times and finding the difference, we are able to measure the time. For example, if the diff vale is 100, then the region of interest will be 31.2 ns.

Please read the fine manual. It has a lot of information about the instructions and a three and a half pages of descriptionin Volume 3B. Here's a slice from the description of rdtsc

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever
the processor is reset. See “Time Stamp Counter” in Chapter 17 of the Intel® 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3B, for specific details of the time stamp counter behavior.

morca · ‎08-28-2018

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever
the processor is reset.

Yes I read that. So that confirms that the time slice to update the TSC is 1/freq with the current value of frequency.

McCalpinJohn · ‎08-28-2018

NO!!!

The TSC increments at the rate of the *nominal* clock frequency, not the *current* CPU frequency!

It is not at all clear *how* or *when* the TSC is updated, since the minimum overhead to read the TSC is at least 20 core clock cycles (on SKX -- it requires more cycles than this on earlier processors). Many implementations that would be consistent with the observed behavior are plausible.

The fixed-function counter "reference cycles not halted" is easier to understand. On processors before Skylake, the counter is incremented every 10 ns by an amount equal to the nominal frequency multiplier. E.g., on a 2.1 GHz Sandy Bridge or Haswell, this counter increments by 21 every 10 ns (while the processor is not halted). This means that any value read from this counter is always an integral multiple of 21 (something which is *not* true of TSC values). The corresponding programmable counter (Event 0x3C, Umask 0x01) increments by 1 every 10 ns, so you need to remember to multiply values by the nominal clock multiplier (21 in this example) to get numbers in the same units as the TSC uses.

For Skylake Xeon it is a bit different -- on a 2.1 GHz Xeon Platinum 8160, the fixed-function "reference cycles not halted" counter increments by 84 every 40 ns (while the processor is not halted). This is problematic -- when running at 3.7 GHz, I can read the fixed-function "reference cycles not halted" 6 or 7 times and obtain the same value each time (always exactly divisible by 84) before the counter eventually increments by 84. The corresponding programmable counter (Event 0x3C, Umask 0x01) increments by 1 every 40 ns, so you need to remember to multiply the values by (in this example) 84 to get numbers in the same units as the TSC uses.

morca · ‎08-28-2018

OK. The explanation now raises a question on how much it is reliable to read TSC?! and who benefit that?

In other words, when to use and when not use TCS?

McCalpinJohn · ‎08-28-2018

The RDTSC and RDTSCP instructions provide low-overhead access to a counter that increments at a fixed frequency. They are very useful for timing relatively short pieces of code, especially in cases where you cannot be sure that the process being measured is pinned to a single logical processor (a requirement for the RDPMC instruction to be useful), and/or where you cannot be sure that the core is active for the entire measurement interval (a requirement for the fixed-function cycle counters to be useful).

I use these instructions pretty much every day....

A typical use case involves reading the TSC and all three of the fixed-function counters. From these I can compute lots of interesting things:

(Elapsed Reference Cycles Not Halted) / (Elapsed TSC cycles) = fraction of the time the core was active
(Elapsed Actual Cycles Not Halted) / (Elapsed Reference Cycles Not Halted) * Base Frequency = average frequency while not halted
(Elapsed Instructions Retired) / (Elapsed Actual Cycles Not Halted) = average instructions per cycle while not halted

Intrinsic functions _rdtsc and _rdtscp