[PCM] Intel CPU performance counters to get real core utilization % with HT

Francesco_G_ · ‎10-14-2015

Hi all,
this is my first post here in Intel forum but I have read more or less all the threads regarding PCM and core utilization with PC.
I still have doubts and questions on how effectively use PCM to get the real core utilization % when HT is active.

First (stupid?) question: I know all OS treat the logical cores as if they were physical (I have dual core i7 with HT, so the OS shows 4 cores in total), and we all know that the utilization % of these cores is actually not an accurate metric. So why PCM still shows 4 cores? I should expect it shows the aggregate data for the two physical cores: e.g. the IPC should be calculated on core basis, not thread based.. Am I wrong?

Second question: I have seen that IBM Power processors include a specific register for that purpose, the PURR. And the whole point of it is providing a correct utilization figure that is able to correlate with application throughput. Also, Solaris is able to provide some internal CPU utilization figures with the 'pgstat' command. However, it doesn't seem to be anything out there to do the same on Intel x86 CPUs. Do you think that using a combination of PCs I can achieve something similar to the PURR?

Third question: when I run a benchmark CPU intensive (from low to heavy load) I get these three values:
- the PHYSICAL CORE IPC still remains around 0.8~0.9 for both low and heavy benchmark load, and it corresponds to ~23% core util for cores in active state.
- Instructions per nominal CPU cycle values span from ~0.1 to ~0.9 and it correspond to a growing Core utilization over time interval (from ~2% to ~23%).
- C0 core residency goes from ~8% up to ~99.8%.
Is C0 core residency the best metric to evaluate the global CPU utilization, since it represents the real time the CPU (whole, not single cores?) is doing useful job? If so, I wonder why I always have IPC much much lower than the max IPC I could get (i.e. 4) even if my benchmark stresses the CPU at its maximum.

Thanks a lot for your time and replies.
Francesco

McCalpinJohn · ‎10-14-2015

Most of the Intel hardware performance counters in the core can be configured to count for the logical processor only or for the core as a whole. This is typically controlled by bit 21 of the IA32_PERFEVTSEL* MSRs.

I don't know how PCM exposes this functionality, but it is certainly there in the hardware.

One caveat is that on some processors there are bugs that show up when you are running with HyperThreading enabled, or that show up when you try to count for the logical processor only. Some of these are mentioned in the comments in the tables of Chapter 19 of Volume 3 of the Intel Architectures Software Developer's Manual, while others are listed in the processor "specification update" documents.

Francesco_G_ · ‎10-15-2015

for which OS have you compiled (maybe) last Intel PCM source code ? And using which tools ?

I'm working on Ubuntu 15.04, I compiled last PCM version using the Makefile..

Most of the Intel hardware performance counters in the core can be configured to count for the logical processor only or for the core as a whole. This is typically controlled by bit 21 of the IA32_PERFEVTSEL* MSRs.

I don't know how PCM exposes this functionality, but it is certainly there in the hardware.

Otherwise, how can I use perf to get logical processor PC? And how can I combine them to obtain metrics such PURR or SPURR?

Thanks

McCalpinJohn · ‎10-15-2015

I can't comment on the PURR metric, since I don't know what it is.... (I don't recall this from my days on the IBM POWER4/5/6 design teams, but I have forgotten a lot of details in the last decade.)

It is important to understand the difference between various metrics used for "utilization". From the OS perspective, a core is "busy" if a process is running on it. In most cases this is highly correlated with C0 state, since a core must be in C0 to be operational, and the OS typically puts each core into C1 state fairly quickly if there are no processes ready to run on that core. (I don't know how this works on HyperThreaded systems, presumably the core can't really go into C1 unless both logical processors are idle, but I don't know how the OS manages this or accounts for it.)

The OS definition counts a core as "busy" whether the process that is executing on it is executing 4 instructions per cycle or one instruction every 200 cycles. The flip side of this is that the aggregated "core utilization" counts every Logical Processor as a core, so running one thread per core will only show 50% "utilization", even if the thread running on that core is using all of the available resources in each cycle.

So "utilization" may not be a helpful metric. In some ways IPC is better, but this metric requires caution as well.

As a side note, "CPU intensive" is not a particularly well-defined term.

In some contexts "CPU intensive" might mean that the processor is "busy" (in the OS sense) and not either completely idle or waiting for IO (typically disk or network IO).
In other contexts "CPU intensive" means that (in addition to attribute #1) the application has a very low fraction of stall cycles due to memory accesses. (This may not be easy to evaluate in practice, but in principle it means that the performance is very similar to the performance that would be obtained if all memory references hit in the L1 Data Cache and all instruction fetches hit in the Instruction Cache.)
In other contexts "CPU intensive" means that (in addition to attributes #1 and #2), the application has a sustained instruction retirement rate that is "close to" the limits of the hardware. This requires that the distribution of instructions matches the available hardware resources (i.e., number of arithmetic units, number of load/store units, etc) at a fine enough granularity (and with sufficient independence of operands) that the hardware can keep "most" of the functional units busy on each cycle.

The first definition of "CPU intensive" is very commonly used in the worlds of databases, web servers, etc, where many workloads are "IO intensive".

The second definition of "CPU intensive" is very common in High Performance Computing.

The third definition is not particularly common, but shows up in discussions related to energy use, power dissipation, and the impact of power/thermal concerns on frequency, performance, battery life, etc.

IPC is a tricky metric to work with because it takes a lot of analysis to know whether all of the instructions are actually necessary for execution of the program. Suppose you have a "CPU-intensive" code for which the performance is limited by a loop that does ADD instructions.

As a very high-level example (i.e., turning this into a working example is left as an exercise for the reader, but see https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/506515 for a real example), the compiler might produce a loop with four instructions:

LOOP:
       LOAD (Pointer), input register1;
       ADD input register1, input register2, output register;
       Decrement Pointer (and set condition code);
       Branch if not zero to LOOP:

The compiler might also choose to unroll this loop to do 8 ADDs with memory operands and a total of 10 instructions:

LOOP:
       ADD (Pointer), input register, output register;
       ADD (Pointer+1), input register, output register;
       ADD (Pointer+2), input register, output register;
       ADD (Pointer+3), input register, output register;
       ADD (Pointer+4), input register, output register;
       ADD (Pointer+5), input register, output register;
       ADD (Pointer+6), input register, output register;
       ADD (Pointer+7), input register, output register;
       Decrement Pointer  by 8 (and set condition code);
       Branch if not zero to LOOP:

These loops might execute in exactly the same number of cycles because the pointer decrement and branch can typically be done concurrently with the ADD operation. But the IPC will be quite different due to the different instruction counts -- if the performance is 1 cycle per add, the first version will have an IPC of 4, while the second version (at exactly the same performance) will have an IPC of 1.25.

Note that in both cases these might be considered (category #2) CPU-intensive programs. Execution time is completely limited by the ability of the processor to execute the ADD instructions, with negligible stalls for IO or for memory references.

If the performance is the same, then there is no reason to prefer the first version over the second, and therefore no reason to desire higher IPC without understanding exactly what instructions are required to get the job done.

Of course if the binary code is fixed, then improving the IPC is the same as improving time-to-solution, which is a good thing. In that case (as in many others) it is preferable to focus on the time-to-solution rather than on the IPC. To make any sense of the IPC you need to know exactly what the assembly code is doing, how the assembly code is changing as you make changes to the compiler options or source code, how each assembly code version maps to the hardware, and how fast each should be able to execute. Sometimes this enables me to rearrange the source code or change the compiler options to generate much faster execution, but often it is simply an exercise in learning that my understanding of how the hardware works is wrong.

Francesco_G_ · ‎10-16-2015

Thanks for explanations about IPC, now I understand that it should really be considered only if I really know what the application code does at assembly level.

My goal is to understand at high level how the utilization of modern CPU with HT varies on increasing workload. This is because I already assessed that we have no more linear relationships, since I got non linear growth after ~50% of workload. If I cannot rely on classical CPU utilization %, I tried to think about IPC or instructions per nominal CPU cycles (and related utilizations), or other metrics that PCM exposes (like C0 core residency).

Also, I thought I could use PCs to derive new metrics: UnHalted Core Cycles, UnHalted Reference Cycles and Instruction Retired seemed to be the most promising ones.

You sure have more knowledge than me about how to use these PCs or PCM to get THE metric that can be used to draw a linear Utilization-Throughput diagram (like PURR seems to achieve) OR to estimate correctly the real utilization and headroom of physical cores. Do you have any suggestions?

For reference, this is PURR definition: PURR provides an actual count of physical processing time units that a hardware thread has used. The hardware increments for PURR is done based on how each hardware thread is using the resources of the processor core. The PURR counts in proportion to the real time clock (timebase).
PURR patent: http://www.google.com/patents/US8230440

What's more, I have found other metrics from esxtop tool: PCPU UTIL(%), CORE UTIL(%) and PCPU USED(%). Maybe these could be a way to obtain what I'm looking for, but I don't know which PCs consider to get them.

Thanks a lot for your time!

McCalpinJohn · ‎10-16-2015

To cianfa72: As I tried to make clear, this example is an illustration of principles, not a working piece of code that can be analyzed.

The example discussed at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/506515 is based on an actual working code and the corresponding performance counter values.

shang__xiaowei · ‎09-07-2019

Hi Dr. John and Francesco,

Thanks for your helpful information.

I have the same question as Francesco proposed. Have you found a practical solution to measure the retired instructions of a logical core (i.e., hardware thread or hyper-thread) through using PCM tool or other tools? I know the hardware support is available (i.e., Bit 21 of IA32_PERFEVTSELx) but I don't know whether there are some easily used tools (e.g., Perf or PCM), which have exposed APIs to get the retired instructions of a logical core directly.

Thanks much for your help in advance.

McCalpinJohn · ‎09-08-2019

It is certainly possible to get per-Logical-Processor counts from "perf stat" using the "-a -A" options (with suitable permissions). I use "perf stat -a -A sleep 10" as a quick check of background activity on otherwise idle systems. If you want to map from per-Logical-Processor counts to process counts, the processes need to be bound so that they cannot be moved from the Logical Processor that you expect them to be running on....

shang__xiaowei · ‎09-08-2019

Hi Dr. John,

Thanks much for your explanations. I still have two questions.

1, if "perf stat -a -A ..." is used, the '-A' option makes perf set bit 21 of IA32_PERFEVTSELx so that the performance counters are counted for an appointed logical processor. Am I right?

2, if bit 21 of IA32_PERFEVTSELx is selected for an appointed logical processor (i.e., '-A' option of perf is used), all the performance counters (e.g., branch-misses, cache misses, TLB misses, etc.) are counted separately for the logical processor?

Thank you very much.

McCalpinJohn · ‎09-12-2019

Each logical processor has its own copy of each of the IA32_PERFEVTSEL* registers, so when measuring each logical processor independently bit 21 is *not* set. One easy way to figure out what is set is to simply look at the registers -- something like:

perf stat -a -A print_core_counters.sh

where the "print_core_counters.sh" script is something like:

#!/bin/bash
rdmsr -a -c 0x186
rdmsr -a -c 0x187
rdmsr -a -c 0x188
rdmsr -a -c 0x189

and "rdmsr" is compiled from rdmsr.c in the msr-tools package.

By default, "perf stat" measures a combination of hardware and software counters chosen by the developer. You can select your own events on the command line, using either named events (from "perf list") or raw hardware events.

shang__xiaowei · ‎09-12-2019

Thanks for your answers. So, by default, bit 21 of IA32_PERFEVTSELx is 0, which means the hardware counters are returned for the logical processor, right?

McCalpinJohn · ‎09-13-2019

There are no defaults for the hardware -- the software programs whatever it wants. The "perf stat" program normally tracks counts by process, but with the "-a -A" options it tracks by logical processor. You can request that "perf stat" set bit 21 by using "raw" events, but I don't know of any other cases where "perf stat" uses the AnyThread feature....

shang__xiaowei · ‎09-14-2019

Got it. Thank you very much ;-).