Vtune Events -- what does "counts at the hardware thread level" mean?

Michael_C_13 · ‎03-09-2017

Hello Intel Forum Gurus,

I am using Vtune on Knight's Corner. The event documentation (https://software.intel.com/en-us/node/589941) describes many events as being "counted at the hardware thread level." Does this mean they are summed for all active hardware threads (in other words, the value reported is the total count for all active threads on the device)? Or, alternatively, does "counts at the hardware thread level" mean that such events are counted for a single hardware thread on each core, then summed over cores (which is how INSTRUCTIONS_EXECUTED is calculated, I believe)?

For example, EXEC_STAGE_CYCLES is described as "counts at the hardware thread level." For a toy highly-compute-bound application, I noticed that when I run with 4 OpenMP threads per core, the EXEC_STAGE_CYCLES event is almost exactly equal to CPU_CLK_UNHALTED/4, which suggests to me that EXEC_STAGE_CYCLES is counted for a single hardware thread per core, then summed over cores.

Thanks in advance for your help,

Michael

McCalpinJohn · ‎03-10-2017

In this case "counted at the hardware thread level" means that the counts are limited to the current hardware thread (which is equivalent to a "logical processor").

Most of the documents refer to this property as having "thread scope", rather than "core scope". The latter includes counts from the activity of any hardware thread running on that physical core. For slightly obscure reasons, Intel has deliberately chosen to use a different nomenclature for hardware multithreading on the Knights Corner processors, but for Xeon Phi x200 (Knights Landing) the standard nomenclature seems to have returned.

VTune probably has a variety of ways to deal with aggregating the counts across different physical cores, but in this case the underlying counts are specific to each logical processor.

The performance monitor events on most Intel products support an "anythread" bit, which causes events which normally have "thread scope" to have "core scope" instead. This bit is supported on KNC (bit 21 of the IA32_PerfEvtSel0/1 MSRs). With some hacking, you can program different events in the programmable counters of the four threads on each core to get 8 counters per core -- with each counter getting counts from any of the four hardware threads.

This "AnyThread" feature is only supported for a very small number of events on the Xeon Phi x200 (Knights Landing) processor, and the comments in Section 18.2.3.1 of Volume 3 of the Intel Architectures SW Developer's Manual on the limitations of the "AnyThread" approach could be interpreted as a hint that the "AnyThread" approach may eventually be deprecated. Sections 18.6 & 18.7 of V3 of the SWDM note that the AnyThread bit is ignored on Silvermont & Goldmont processors -- that limitation may be the source of the restriction on Xeon Phi x200 (KNL).

Many subtleties arise with HyperThreading enabled. For example, the event CPU_CLK_UNHALTED.CORE with thread context will increment whenever the logical processor is not halted. Any combination of logical processors can be active at the same time, so (for example), on a 3 GHz core with 2 threads, each CPU_CLK_UNHALTED.CORE counter can increment up to 3 billion times per second, and the sum can be up to 6 billion increments per second. It is not possible to tell from these events how often both threads are active, or both threads are idle. The "AnyThread" bit is helpful here -- it can be used to directly count all the cycles in which at least one thread is active, which indirectly tells how many cycles neither thread is active. (If you don't know the frequency, additional complexities arise....)

Michael_C_13 · ‎03-13-2017

Hi John,

Thank you for your thorough reply, but I'm not trying to write a driver or anything so complex. Also I don't believe KNC has hyperthreading, which hopefully simplifies things. I just want to understand what the hardware events mean, and the metrics computed from those events that are reported in the General Exploration viewpoint.

When I switch to the "Core/ H/W Context / Function / Call Stack" grouping in the Vtune GUI, it seems to indicate that all the events recorded by a General Exploration analysis are counted individually for each hardware thread (see events.png attached below; CPU_CLK_UNHALTED, EXEC_STAGE_CYCLES, and INSTRUCTIONS_EXECUTED are shown as examples). The event count for each core is reported by Vtune as the sum of the count for that core's hardware threads, and the event count for the entire application is reported as the sum of the counts for all threads on all cores (not pictured, but I can confirm this by switching to the "Function / Call Stack" grouping and adding up the counts for all functions). This means that my intuition about INSTRUCTIONS_EXECUTED in my first post was wrong (INSTRUCTIONS_EXECUTED is not recorded for a single hardware thread on each core then summed over cores, rather, it is counted individually for each thread and summed over all threads on all cores). Whatever, it still seems straightforward enough.

The challenge arises when I try to reconcile this behavior with the description of the INSTRUCTIONS_EXECUTED and CPU_CLK_UNHALTED events and CPI metric given here (https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding).

First of all, that document says

This event (CPU_CLK_UNHALTED) is counted at the core level – for a particular sample, all the threads running on the same core will have the same value.

In the attached screenshot, that is clearly not the case. Different hardware threads have different values of CPU_CLK_UNHALTED.

Secondly, the document as well as Vtune's internal mouseover text boxes say that Average CPI Per Thread is given by CPU_CLK_UNHALTED/INSTRUCTIONS_EXECUTED, and that Average CPI Per Core is then given by CPI Per Thread/Num Threads per Core. This does not make sense to me if INSTRUCTIONS_EXECUTED is a sum over all threads. Intuitively, I expect that

CPI Per Thread = Avg active cycles per core/avg instructions executed by an individual thread.

and that

CPI Per Core = Avg active cycles per core/avg number of instructions executed by all threads on that core
= Avg active cycles per core/( num threads per core*avg instructions per thread ).

However, if INSTRUCTIONS_EXECUTED is a sum over all threads on all cores (which appears to be the case in the Vtune GUI) then:

CPI Per Thread (as alleged by document) = CPU_CLK_UNHALTED/INSTRUCTIONS_EXECUTED
    = Total cycles executed by all cores/( Total num cores*num threads per core*avg instructions per thread )
    = Avg cycles executed per core*num cores/( Total num cores*num threads per core*avg instructions per thread ).
    = Avg cycles executed per core/( num threads per core*avg instructions per thread ).

This result is what I intuitively expected to represent CPI Per Core, not CPI Per Thread.

I feel like either the documentation is wrong, or I am misunderstanding the meanings of CPU_CLK_UNHALTED and INSTRUCTIONS_EXECUTED. Sorry for being so nitpicky but this seems fundamental to understanding Vtune on KNC.

Regards,
Michael

McCalpinJohn · ‎03-13-2017

There are a couple of difficult issues here....

KNC has something very much like "HyperThreading", but Intel does not call it by that name. There are a couple of reasons for that, but one important factor is that the hardware thread support cannot be "turned off" on KNC -- the only mode of operation supported is the mode that supports four thread contexts ("logical processors") per physical core.
On most Intel processors, CPU_CLK_UNHALTED is a thread-specific value -- i.e., you can "halt" individual logical processors associated with a physical core and the CPU_CLK_UNHALTED event will stop incrementing on those logical processors.
- The link that you point to says that CPU_CLK_UNHALTED has "core-scope" on KNC. This means that it increments (by one) whenever *any* thread is active on that logical processor. You can get this effect on other processors by setting the AnyThread bit in the performance counter control register.
The actual instantaneous value of the CPU_CLK_UNHALTED counter should be the same on all cores, but VTune uses a "sampling" methodology to assign performance to bits of code, and this sampling is not guaranteed to sample the logical processors equally. In particular, if a logical processor is not being used, you would not expect to get any samples associated with that logical processor.

I try to avoid sampling-based measurement approaches, and I am not a huge fan of multi-threading when doing performance analysis, so I find this confusing as well. For my codes it always sufficed to try 1, 2, 3, 4 threads per core and simply choose the version that gave the shortest execution time. Trying to understand CPI or IPC when you are changing the number of threads per physical core is bound to be confusing.