hardware performance counters

NPund · ‎05-03-2018

Hi all,

I am new to this development thing. I am working on Hardware Performace Counter. I had some very basic questions for my understanding.

1. Are HPC specific to each core or are shared among different cores?

2. Where are they located?

3. If specific to each core, what type of data perf is providing?

Thank You

Thomas_W_Intel · ‎05-04-2018

Welcome to the exciting world of hardware performance counters. There are counters that count events are specific to a core (e.g. the number of instructions retired, or the number of mispredicted branches) and there are "uncore" counters that can count events like data transfers in the memory controller. Depending on what they are counting, the counters are located in the core or in the uncore respectively. The supported events depend on the processor architecture that you are using. You can find a list of events at https://download.01.org/perfmon/index/

McCalpinJohn · ‎05-04-2018

As Thomas mentioned, there are hardware performance counters in a number of different units of the processor, differentiated by the mechanism(s) used for accessing the counters.

The phrase "performance counters" most commonly refers to the performance counters in the cores. The infrastructure for these counters is described in Chapter 18 of Volume 3 of the Intel Architectures Software Developer's Manual (Intel document 325384 -- the most recent revision is 066, published in March 2018). The events that can be counted are mostly different on different processor models -- Chapter 19 of the same document briefly describes the events for each model. When HyperThreading is enabled, these counters typically only increment for events pertaining to the specific Logical Processor that is reading the counter. Exceptions are noted in the event descriptions in Chapter 19.

The details vary by processor model, but the core performance counters typically include events related to

Overall instruction fetch/issue/dispatch(execution)/retirement
- Some processors have additional events to count events related to "stalls" at various points in the pipeline
Branch execution, conditional branches taken/not taken, and mispredicted conditional branches
Load/store operations, including
- Translation Lookaside Buffer accesses/misses, and Page Table Walks
- Accesses to the L1 Data Cache, including hits, misses, and fills
- Accesses to the L2 Cache
- Accesses to the L3 Cache (where applicable)
Floating-point operations can be counted by SIMD width and by operand width (Broadwell and newer cores)

Important caveats:

Event descriptions are almost always terse, and it can require significant research (or familiarity with the microarchitecture) to make sense of the descriptions.
Not all events give correct answers, and it can require significant research to decide if a specific event is accurate enough for your intended use.
Not all events give useful answers for performance analysis and tuning. (Performance counter events are often included by the design team to measure certain application characteristics that may have an influence on future processor designs, or to aid with the tuning of heuristics used in various dynamically adaptive hardware mechanisms.)
Accessing the counters brings up additional issues.
- When enabled, these counters are counting events that happen on a specific Logical Processor.
- If you want to limit counting to specific processes, or have counting that is able to "follow" a process that is migrated across logical processors, then additional software infrastructure is required to save/restore counters on context switches.
- This "virtualization" is implemented by the "perf events" subsystem in Linux, for example. It makes the counters easier to use, but sometimes harder to understand.