Solved: I appreciate if there is any - Page 2

Sergey_S_Intel2 · ‎10-07-2015

Modern CPUs have quite a lot of performance counters, how to read them? I know many performance monitoring and profiler programs and libraries (PAPI, Vtune, profiler, linux_perf, etc.) but all these methods requires additional computation time (intrusiveness).

Also, modern Intel CPUs support rdpmc instruction and Linux OS (currently) support this instruction in user-level.

I would like to understand how to use GCC intrinsic to get CPU cycles and instruction executed to profile some function in my C code.

I understand I have to pin program execution to particular CPU core. Let’s assume the CPU is Haswell.

I appreciate for some small example of rdpmc usage.

For example, the code might looks like this

long long get_cycles(){
    unsigned int a=0, d=0;
    int ecx=(1<<30)+1; //What counter it selects?
    __asm __volatile("rdpmc" : "=a"(a), "=d"(d) : "c"(ecx));
    return ((long long)a) | (((long long)d) << 32);
}

int main (int argc, char* argv[])
{
    long long start, finish;
    
    start = get_cycles();
    for (i = 0; i < 1000; i++)  {
        foo();
    }
    finish = get_cycles();
    printf("Total cycles : %ld\n",((double)(finish-start))/1000.0);
    return 0;
}

What ecx variable in get_cycles() must contain to provide CPU cycles and Instruction executed?

Thank you

McCalpinJohn · ‎10-21-2015

The performance counters are complicated largely because the hardware is complicated, and secondarily because Intel does not want to expose microarchitectural implementation details without good reason. (Patent trolls can be quite creative at re-interpreting the patents that they own to claim that a big company is violating the patents -- but they need to have some idea of how the processor is implemented to make these claims.)

Some aspects of hardware performance counters probably need to be restricted to elevated privilege levels. For example, configuring the hardware performance counters to generate interrupts has the potential to severely impact system performance and usability. On the other hand, most of what the performance counters do is perfectly safe -- the vendors do a very good job of ensuring that programming random bits into the performance counter control registers is "safe" -- you may not be able to interpret the results, but the processor runs just fine.

I prefer that the hardware performance counters remain as low-level features, and not as registers that get saved and restored on context switches. (It would be hard to use the counters to measure things like context switch overhead if they were swapped in and out.) But leaving the counters as "raw" low-level features means that they cannot easily be shared, and it means that they provide a potentially high-bandwidth covert channel between processes.

In the high performance computing world where I work, systems are seldom time-shared, so we don't really need to worry about either sharing the counters or about covert channels. In the production environment a job is assigned a set of nodes and no other user is allowed access to those nodes for the duration of the job. The nodes are still shared between the OS (and all its subsidiary processes) and the user (and all the auxiliary processes that the user might cause to be started), but since this is the standard mode of operation, dealing with this sharing is part of the performance puzzle that we are trying to understand.

To use the hardware performance counters manually, a variety of tools are needed:

For the hardware performance counters in the processor cores, I build the "rdmsr" and "wrmsr" command-line tools from "msrtools-1.2".
1. I use a script to configure the global configuration registers and the PERFEVTSEL* MSRs for the programmable core counters.
2. For whole-program measurements, I read the counters using the "rdmsr" program before and after the execution (taking care that the run is short enough that the counters can't be incremented more than 2^48 times during the run). You can also use "perf stat" for these sorts of measurements.
3. For interval measurements inside the code, I program the counters using the script, then use the RDPMC command to read them at the desired locations in the code.
For the "uncore" counters there are three different interfaces used, depending on the processor model:
1. Some "uncore" counters use MSRs and can be configured using "wrmsr" as above. Unfortunately these can only be read from inside the kernel (since the RDMSR instruction can only be executed at ring 0). If the program is being run by root (or is owned by root and has the setuid bit set), then the program can open the /dev/cpu/*/msr device files and read or write the counters using pread() or pwrite() calls. These are kernel calls so they cost a few thousand cycles each, but there is nothing that can be done about this. (One thing that could help is to build a kernel module that could return multiple MSR values with a single call.)
2. Some "uncore" counters are in "PCI configuration space". The root user can read/write these counters using the "setpci" command-line program. As with the MSR-based counters, a root user can open the device driver files (in /proc/bus/pci, I think) and read/write the counters using pread() and pwrite() commands (limited to 32-bit transactions).
3. Some processors include "uncore" counters in a different range of memory-mapped IO space. Working with these is an advanced topic....

Here is a fairly typical script that I use to set up the counters (edited for clarity):

#!/bin/bash

export NCORES=`cat /proc/cpuinfo | grep -c processor`
echo "Number of cores is $NCORES"
export MAXCORE=`expr $NCORES - 1`

# Enable all counters in IA32_PERF_GLOBAL_CTRL
#   bits 34:32 enabled the three fixed function counters
#	bits 7:0 enable the eight programmable counters
echo "Checking IA32_PERF_GLOBAL_CTRL on all cores"
echo "  (should be 00000007000000ff)"
for core in `seq 0 $MAXCORE`
do
	echo -n "$core "
	~/bin/rdmsr -p $core -x -0 0x38f
	~/bin/wrmsr -p $core 0x38f 0x00000007000000ff
done

# Core Performance Counter Event Select MSRs
#   Counter	 MSR
#	   0    0x186
#	   1    0x187
#	   2    0x188
#	   3    0x189
#	   4    0x18a
#	   5    0x18b
#	   6    0x18c
#	   7    0x18d

# Dump all performance counter event select registers on all cores
if [ 0 == 1 ]
then
	echo "Printing out all performance counter event select registers"
	echo "MSR    CORE    CurrentValue"
	for PMC_MSR in 186 187 188 189 18a 18b 18c 18d
	do
		for CORE in `seq 0 $MAXCORE`
		do
			echo -n "$PMC_MSR $CORE "
			~/bin/rdmsr -p $core -0 -x 0x${PMC_MSR}
		 done
	done
fi

# Counter 0 Uops Dispatched on Port 0		0x004301a1
# Counter 1 Uops Dispatched on Port 1		0x004302a1
# Counter 2 Uops Dispatched on Port 2		0x004304a1
# Counter 3 Uops Dispatched on Port 3		0x004308a1
# Counter 4 actual core cycles unhalted		0x0043003c
# Counter 5 Uops Dispatched on Port 5		0x004320a1
# Counter 6 cycles with no uops delivered from back end to
#   front end & there is no back end stall	0x0143019c
# Counter 7 Uops issued from RAT to RS		0x0043010e

echo "Programming counters 0,1,2,3"
for core in `seq 0 $MAXCORE`
do
	~/bin/wrmsr -p $core 0x186 0x004301a1
	~/bin/wrmsr -p $core 0x187 0x004302a1
	~/bin/wrmsr -p $core 0x188 0x004304a1
	~/bin/wrmsr -p $core 0x189 0x004308a1
	~/bin/wrmsr -p $core 0x18a 0x0043003c
	~/bin/wrmsr -p $core 0x18b 0x004320a1
	~/bin/wrmsr -p $core 0x18c 0x0143019c
	~/bin/wrmsr -p $core 0x18d 0x0043010e
done

View solution in original post

Kumar_C_ · ‎11-18-2016

I>> The fixed-function counters are independent of each other, so as long as they are enabled you can read any or all of them. Again, I don't know how to do this with the perf_events interface.

I suppose the alternate interface you use is the rdmsr and wrmsr, if that is what you use, I can't use it because I do not have the root privileges to access the MSR registers.

Let me describe exactly what I am looking for, I am benchmarking the MPI collectives with an aim to understand if there is variation across the runs. For ex, the following are the timings reported for a 3 consecutive runs of 1 process MPI Barrier measured with the rdtscp:

MPI_Barrier 1 1 0.0000105019

MPI_Barrier 2 1 0.0000945573

MPI_Barrier 3 1 0.0000133098

My aim is to understand why the high value of 0.00009455 occurred.

The corresponding PMU counter values (instructions and reference cycles measured using 1 << 30, (1 << 30)+2)

cycles:325

inst:196

cycles:2288

inst:201

cycles:416

inst:201

I can correlate the higher rdtscp value with the higher cycles (2288) values, however, there is no increase in the instructions reported.

This puzzles me to understand what is happening.

I used perf events with the options

pe.exclude_kernel = 0;

pe.exclude_hv = 0;

pe.exclude_idle = 0;

so, ideally it should account for all the kernel events.

However, the perf_event_open man page says

"

PERF_COUNT_HW_INSTRUCTIONS

Retired instructions. Be careful, these can be affected by various issues, most notably hardware

interrupt counts."

Does that mean that this counter does not account for hardware interrupts.

If perf_events is not the right interface, can you please suggest what should be interface I should use.

My goal is to identify the source of event that is causing the 2288 cycle latency.

Thank you!

McCalpinJohn · ‎11-20-2016

Applications running on multiple cores are going to show significant variability on any system, and it is not always possible to understand why specific cases ran slowly.

I recommend getting a thorough understanding of the statistics before deciding whether the slow result is worth paying attention to. Good statistics require at least several hundred measurements.

Your two "fast" MPI barrier measurements are fairly close to the lower limit of what is theoretically possible on a 2-socket system. The "slow" result is not very slow -- if any of the cores involved in the barrier take a timer interrupt, it is likely to take at least 2000 cycles. This can be expected to happen once every millisecond with typical OS configurations.

Kumar_C_ · ‎11-20-2016

>> Your two "fast" MPI barrier measurements are fairly close to the lower limit of what is theoretically possible on a 2-socket system. The "slow" result is not very slow -- if any of the cores involved in the barrier take a timer interrupt, it is likely to take at least 2000 cycles. This can be expected to happen once every millisecond with typical OS configurations.

The results I provided before are for 1 process MPI. So, assuming no process migration as I am binding the process to a core, may be the "slow" result is due to the timer interrupt on that single core. Yes, I did ran the experiment 5000 iterations, and just as a sample I have provided the timing for 3 iterations where "slow" timing occurred. I have measured the number of instructions and the cycles (ref cycles) for each iteration and the statistics are as below:

1 process MPI Barrier

Freq. #ofinstructions

23 197
7 198
969 202
1 203

Freq. #ofcycles

3 273
116 286
557 299
147 312
53 325
50 338
16 351
9 364
12 377
10 390
5 403
4 416
1 429
3 442
4 455
2 468
1 494
2 533
1 611
1 689
1 845
1 910
1 2288

If timer interrupt is indeed the reason, it would be nice to have a way to verify this. Thank you! Even if we assume that 2288 cycles is due to the timer interrupt, it is not clear to why there is such a variation.

McCalpinJohn · ‎11-21-2016

I am not sure what an MPI_Barrier() call is supposed to do if there is only one MPI task?

The "number of instructions" range is very tight -- that is good news.

The "number of cycles" range is also fairly tight -- 92.3% of the results are in the range of 286 to 338 cycles, and >80% of the results are within 5% of 300 cycles, which is not bad at all.

The sum of all of the times is about 313,000 cycles, which would be about 0.125 milliseconds with a 2.5 GHz TSC clock. For this aggregate execution time you would not expect any timer interrupts with the standard 1 millisecond timer period, but seeing one is not particularly surprising. Seeing only 1 slow (>2000 cycle) iteration out of the 1000 results shown is consistent with a random system interrupt.

The small (26 of 1000) number of results in the range of 500-1000 cycles are a bit harder to understand -- they look too fast to include an OS interrupt, but this can be very difficult to analyze in detail. They are not a major contributor to the average latency -- excluding all of the results that took 400 cycles or more only reduces the average time by 2.2% (from 313 cycles to 306 cycles).

Kumar_C_ · ‎11-21-2016

The run was on a Intel KNL core which has a 1.3 GHz TSC clock.

The small 26 out of 1000 are what puzzles me most, actually I am interested in understanding what is causing them. A similar variability was observed even with a simple matrix matrix multiplication code.

McCalpinJohn · ‎11-22-2016

If the slow iterations are repeatable (in a statistical sense), then performance counters are the primary tool to work with.

I would start by trying to rule out OS interference, then I would look for differences in cache behavior between the "typical" and "slow" cases. It is not at all clear that KNL has enough performance counter events in the core+L1+L2 to be useful, and the limit of 2 counters per logical processor will make it very hard to be sure that the counts you are getting from different iterations are slow for the same reason(s).

I considered trying to use 4 threads bound to the 4 logical processors of one physical core to get access to all 8 counters, but Volume 1 of the Xeon Phi x200 Performance Monitoring Reference Manual (332972-001, section 1.2.1) says that "AnyThread" support is limited to the three architectural performance monitoring events provided by the fixed-function counters, so this approach is (apparently) useless for cache monitoring.

I have gotten some good results using the core's offcore response counters, but am still a bit confused about their scope -- some of the results I have seen are consistent with the counts being per-core, but most of the results I have seen are consistent with the counts being per-tile.

For data and/or coherence traffic outside the tile there are tons of counters, but I have not yet had a chance to evaluate them in any systematic fashion.

Zirak · ‎02-16-2017

I have difficulties in running RDPMC. Coud any one provide me code on how to enable CR4 to use RDPMC please?

Zirak · ‎02-22-2017

I appreciate if there is any sample code to serialize RDPMC?

McCalpinJohn · ‎02-23-2017

Setting the CR4.PCE bit requires kernel code, which is going to be different for each operating system. In Linux operating systems starting with kernel 3.4, this bit is set using the "set_in_cr4()" function in the kernel by code in $KERNEL/source/arch/x86/kernel/cpu/perf_event.c

static int
x86_pmu_notifier(struct notifier_block *self, unsigned long action, void *hcpu)
{
    // ... //

    case CPU_STARTING:
        if (x86_pmu.attr_rdpmc)
            set_in_cr4(X86_CR4_PCE);
    // ... //
}

"Serializing" RDPMC can mean a number of different things, depending on whether you want to enforce ordering on all instructions or on a class of instructions (such as memory references). A common technique to force full serialization is to execute the CPUID instruction before the RDPMC instruction. (You will need to copy the target PMC number back into %ecx after the CPUID instruction.) If I recall correctly, the CPUID instruction is the only user-mode instruction that fully serializes a core, but the overhead is quite high -- somewhere in the range of 100 to 300 cycles on most processors.

If you only want to order the execution of the RDPMC instruction with respect to certain instructions, then you can create a false dependency using the inputs and outputs. For example, if you want to read the performance counter after computing a value (or loading from memory), you can use the result as part of the computation of the value in %ecx. The processor recognizes many idioms for zeroing registers, so you need to make this at least multi-step process. For example, if you have computed a value (or loaded a value) into %r8, then you can ensure that the RDPMC instruction will not execute until after this value is computed by doing something like:

do whatever you need to do to compute the value in %r8

add the desired PMC number to the value in %r8 and save the sum in %r9

subtract the value in %r8 from the value in %r9 and save the result in %ecx

execute RDPMC

This only adds about 2 cycles and 2 instructions between the computation of the value in %r8 and the execution of the RDPMC instruction.

Note that this will not prevent other instructions following the RDPMC instruction to be executed early --- either while %r8 is being computed or during the add/subtract cycles or concurrently with the RDPMC instruction. It may be impossible to prevent this from happening without adding a fully serializing instruction like CPUID after the RDPMC.

Similar tricks can be used to force specific instructions to execute after the RDPMC -- just create a dependency between the output of the RDPMC (either %eax or %edx) and an input of the subsequent instruction(s). In most cases you will need to perform something like the add/subtract trick to prevent the specific value in the RDPMC output from changing the results of your program. Not all instructions have inputs (e.g., RDTSC), and it may not be practical to force a dependency on *all* subsequent instructions, but this approach can be a useful start.

My comments at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/697093#comment-1886115 and the graph attached to that post may be helpful in understanding some of the issues. That discussion was about RDTSC and RDTSCP, but many of the same ordering and overlap issues apply to RDPMC.

Peter_Cordes · ‎02-12-2019

__builtin_ia32_rdpmc (aka __rdpmc() in x86intrin.h) being treated as a pure function is finally fixed in GCC 6.5, 7.4+, 8.3+, and 9.x.

The nightly build of gcc pre9.0 trunk on Godbolt shows it does work properly now: https://godbolt.org/z/FUgmGe

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87550

McCalpinJohn · ‎02-12-2019

I should note that some of my comments above about ordering are incorrect. Although LFENCE was originally defined as a "load fence", it is now architecturally defined as an execution fence as well, and its overhead is much lower than the traditional approach of using CPUID.

This is discussed in the comments starting at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/785240#comment-1926549, and has also been included in my updated notes on timing short code sections http://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

jaehyuk · ‎10-05-2020

Dear dr bandwidth.

I would like to make use of RDPMC to read the L1 cache miss.

On my coffelake machine, it provides MEM_LOAD_RETIRED.L1_HIT and by measuring two performance counters for this flag, I think I can get whether my memory access his L1 or miss. I would like to use RDPMC directly on my code but don't know how to set ecx register for MEM_LOAD_RETIRED.L1_HIT event. Could you help me how to configure ecx register of MEM_LOAD_RETIRED.L1_HIT event?

McCalpinJohn · ‎10-06-2020

Enabling and programming the core performance counters requires setting values in a number of MSRs. The infrastructure is described in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Manual ("SWDM"), while Chapter 19 contains the tables of the specific "events" that each processor model can monitor.

On Linux systems, I use the "rdmsr" and "wrmsr" programs from msr-tools to read and write the MSRs. This will require root privileges on any sane system. It requires becoming very familiar with Chapter 18 of Vol3 of the SWDM, as well as exercising perfect control over process and thread placement during the measurement period.

A much better approach for getting started is using something like the Linux "perf stat" command for whole-program measurements, or (if you need to read the counters from inside the program) the PAPI or LIKWID libraries.

How to read performance counters by rdpmc instruction?