Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

RDTSCP & cache misses

George_P_2
Beginner
1,794 Views

Hi all,

I would like to benchmark some parts of my application using RDTSCP counters. My only concern with RDTSCP is if it takes into consideration L1/L2/LLC misses. The plan is to benchmark how long takes to service requests/tasks using RDTSCP. I  pin threads  & interrupts are handled on a separate core.

Thanks in advance,

George

 

0 Kudos
12 Replies
George_P_2
Beginner
1,794 Views

forgot to mention that software runs on Ivy Bridge  .

0 Kudos
Patrick_F_Intel1
Employee
1,794 Views

Hello George,

I'm not sure what you mean by 'does RDTSCP take into account L1/L2/L3 misses'. The rdtsc and rdtscp instructions are independent of misses. The instructions just return, for rdtsc, the TimeStampCounter (TSC) and , for rdtscp, the TSC and MSR_TSC_AUX (an msr with a value to indicate which cpu you were running on when you did the rdtscp instruction).

So rdtscp reads 2 MSRs (so it has additional overhead) and it also waits for all instructions issued prior to it to finish before the rdtscp finishes. So there is more overhead with rdtscp compared to rdtsc.

If you've pinned your threads to a particular cpu, I would use rdtsc rather than rdtscp if you are trying to time sections of code.

Are you asking if rdtscp will wait for the previously issued loads and stores to complete before the rdtscp returns? I believe the answer is yes... but it may not accomplish what you are trying to do.

From the SDM volume 2 description of rdtscp:

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter.
However, subsequent instructions may begin execution before the read operation is performed.

So the rdtscp will wait for already started misses to complete but, while the rdtscp is waiting, subsequent instructions may start (perhaps issuing other misses).

What are you trying to accomplish? If you are trying to write a cache/memory latency checker, there are examples of how to do this around.

Pat

0 Kudos
George_P_2
Beginner
1,794 Views

Hi Pat,

Thanks for your feedback.

I am aware that rdtscp is slower than rdtsc. I want to use it because it avoids reordering.

size_t start = getRDTSCP()

operationA()

operationB()

size_t end = getRDTSCP()

latency = end -start

out of order execution can happen inside opA/opB but 'end' counter doesn't move....

I am trying to measure the latency of callbacks/tasks in my system, the callbacks are not just functions with calculations, ie not just CPU bound but reading from memory mapped files/shared memory/socket write/read etc so there would be definitely a few L2 and potentially LLC misses.

One option is to use clock_gettime/gettimeofday and extract the microseconds. the other, I suppose, is to use something like rdtscp  but I am not sure what happen if my code has a cache misses and the cpu pipeline is empty or dependent on the missing data so the cpu cant execute any instruction till missing data arrive in L1 cache. Does rdtscp  gives reliable results?

Thanks,

George

 

 

 

 

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,794 Views

Unless you are trying to time extremely small sections of code (less than a few hundred cycles), I have not had any trouble with the reordering allowed by RDTSC or the weaker reordering allowed by RDTSCP.  

Overheads for each of these are in the range of 30-40 cycles, depeding on the processor and the exact code used to save the counter values.  The lowest overhead is obtained with inline assembler that only saves the low-order 32 bits of the TSC.  If you are saving an sequence of TSC values in an array it also helps to make sure that the array is in the L1 cache in a modified state before use.
 

0 Kudos
George_P_2
Beginner
1,794 Views

Hi John,

the section of code I am trying to measure, depending on the type of task, can be from a few hundred cycles to tens of thousand cycles. 

The majority of the cases will be on the thousands side. As I mentioned in my previous post, I am not interested in micro benchmarking but to create a small benchmark class for latency statistics to measure how long it takes a particular request / task to complete in microseconds.

Thanks,

George

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,794 Views

I think that RDTSCP is likely the best approach to this measurement scenario.   The overhead is low, the TSC increments all the time in recent processors.  The primary difficulty is programmatically determining the rate at which the TSC increments.

It is unfortunate that there is no user-mode hardware instruction to obtain either the reference clock rate (or period) or the base multiplier used by the TSC.    For recent systems (at least those that are not overclocked), the reference clock rate is 100 MHz (so the reference clock period is 10 ns).  

The base multiplier is typically obtained using something horrific like:

expr `grep "^cpu MHz" /proc/cpuinfo | head -1 | awk '{printf "%.0f\n", $4}'` / 100

The TSC multiplier is available in bits 15:8 of MSR_PLATFORM_INFO (0xCE), but reading that typically requires root access, which is inconvenient (at best -- impossible at worst).     This would be something useful to put into the user-mode CPUID instruction (hint, hint).

0 Kudos
George_P_2
Beginner
1,794 Views

Thanks John.

My main concern was if TSC keeps increasing no matter what.  power saving is disabled, we run at full speed all the time. C0 state.

I assume for Ivy Bridge is the case...

 

 

 

0 Kudos
McCalpinJohn
Honored Contributor III
1,794 Views

From Section 17.14.1 of Volume 3 of the Intel SW Developer's Manual:

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8].

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward.

On Linux systems this feature is reported as the combination of "constant_tsc" and "nonstop_tsc" in the "flags" field from /proc/cpuinfo.

It has been a while since an Intel processor (except for the Xeon Phi) did not support the "invariant TSC" feature.

0 Kudos
Travis_D_
New Contributor II
1,794 Views

McCalpin, John wrote:

I think that RDTSCP is likely the best approach to this measurement scenario.   The overhead is low, the TSC increments all the time in recent processors.  The primary difficulty is programmatically determining the rate at which the TSC increments.

It is unfortunate that there is no user-mode hardware instruction to obtain either the reference clock rate (or period) or the base multiplier used by the TSC.    For recent systems (at least those that are not overclocked), the reference clock rate is 100 MHz (so the reference clock period is 10 ns).  

The base multiplier is typically obtained using something horrific like:

expr `grep "^cpu MHz" /proc/cpuinfo | head -1 | awk '{printf "%.0f\n", $4}'` / 100

The TSC multiplier is available in bits 15:8 of MSR_PLATFORM_INFO (0xCE), but reading that typically requires root access, which is inconvenient (at best -- impossible at worst).     This would be something useful to put into the user-mode CPUID instruction (hint, hint).

One reasonable approach is to do a one-time calibration by comparing tsc readings against a system clock such as clock_gettime(). Of course, you should throw out outliers and so on, but I have found this approach can be very accurate. Depending on exactly what clock you calibrate against, in some cases this can give you a better result than reading the theoretically correct value from the MSR, since true clock frequency might drift as much as 1% due to temperature fluctuations or even intentional frequency modulation, but a good calibration can detect this.

Another complication is that on Skylake client (but not Xeon aka SKX - see Dr. McCalpin's answer below), the TSC reference frequency is no longer based exactly on the nominal frequency (e.g., a 100 MHz x 26 = 2600 MHz on an i7-6700HQ) but rather on a multiple of the 24 MHz crystal clock, which often doesn't give the same result (on that CPU you get 2592 MHz instead). So if you use the nominal frequency you'll often have an additional source of error. Some more discussion in this answer and the comments.

0 Kudos
McCalpinJohn
Honored Contributor III
1,794 Views

Waking up a very old thread here with information on features that only apply to very new processors...

As is often the case, the best solution depends on what it is you are trying to accomplish.  On my Skylake Xeon processors (Xeon Platinum 8160), the Always Running Clock is 25 MHz (documented in 18.7.3 of the December 2017 revision of Volume 3 of the SWDM), and the CPUID-based ratio is 168/2=84.  25 MHz * 84 matches the 2.1 GHz that one would expect from the label, from the CPUID Brand String, and from the MSR_PLATFORM_INFO bits 15:8 (Maximum Non-Turbo Ratio) of 21 (decimal).

The programmable performance counter event CPU_CLK_THREAD_UNHALTED.REF_XCLK (Event 0x3c, Umask 0x01) increments by 84 when the 25 MHz Always Running Clock increments.  Previous processors have incremented the clock by the 100MHz-based ratio when the 100 MHz clock increments (e.g., increment by 26 once every 10 ns on a 2.6 GHz Xeon E5-2690 v3).   Although this is a similar behavior, there is one critical difference --  on SKX, the overhead for reading the performance counter is much smaller than 84 cycles, so it is possible to repeated calls to this counter can return exactly the same value -- not a desirable property in a clock!   On a Xeon E5-2680 v4 (2.4 GHz), the deltas between repeated calls are typically 72 (3 "ticks" of the 100 MHz clock), or 96 (4 "ticks").   On the Xeon Platinum 8160, 64 consecutive executions of RDPMC programmed to CPU_CLK_THREAD_UNHALTED.REF_XCLK returned 23 deltas of 84 and 40 deltas of zero.

I can easily understand why the legacy p-state controls on SKX are configured for compatibility with the 100 MHz clock used in previous systems, but it is harder to understand why the new HWP controls are also programmed as if the base clock were 100 MHz.

In any case, I am very glad that the SKX systems use a 25 MHz clock instead of a 24 MHz clock -- it is hard enough to make sense of these systems already....

 

0 Kudos
Travis_D_
New Contributor II
1,794 Views

McCalpin, John wrote:

Although this is a similar behavior, there is one critical difference --  on SKX, the overhead for reading the performance counter is much smaller than 84 cycles, so it is possible to repeated calls to this counter can return exactly the same value -- not a desirable property in a clock!

Curious - did you mention rdpmc of CPU_CLK_THREAD_UNHALTED.REF_XCLK, but does the same issue also apply to rdtsc and rdtscp? I think they are based on the same underlying clock, but perhaps rdtsc/rdtscp also add in an adjusted (adjusted for frequency) cycle count since the last ARC tick and so are apparently more precise and won't return identical results to consecutive calls

In any case, I am very glad that the SKX systems use a 25 MHz clock instead of a 24 MHz clock -- it is hard enough to make sense of these systems already....

For sure. It's too bad SKL uses the 24 MHz clock since this is not going to be a divisor of most popular nominal chip frequencies (2400, 3000, 3600 and 4200 MHz being notable exception), so the rdtsc (ARC) clock will always going to be slightly off with respect to things based on the 100 MHz BCLK or nominal frequencies. Not all tools are prepared to handle this, and so can sometimes give weird results like negative C-state residency as a result.

I updated my comment above to clarify that the 24 MHz issue only applies to SKL, not SKX apparently (I'm curious what pattern future client and server parts will follow).

0 Kudos
McCalpinJohn
Honored Contributor III
1,794 Views

The programmable event CPU_CLK_THREAD_UNHALTED.REF_XCLK shows almost the same pattern as the fixed-function clock, except that it increments by 1 instead of by the 84 that I see on the the Xeon Platinum 8160.  

The number of times a single value is repeated before a change depends on the frequency of the processor and the extent of the extra code around the RDPMC instruction (e.g., concatenating the two 32-bit output registers and storing the results).   At high frequencies and low overhead (e.g., inline assembly followed by a save of the low-order 32-bits only), both the fixed-function and programmable reference cycle counters may return the same value up to 7-8 times before incrementing.

0 Kudos
Reply