I'm trying to understand how to program the performance counters when HyperThreading (HTT) is enabled ...
Way back on the Netburst (P4) processors in order to separate events by Logical Processor (LP) I would program the same event on two different counters, set the ActiveThread bits in the CCCRs to 11b (any thread), and set the T0_OS/USR bits in one ESCR (to count on LP0 only) and the T1_OS/USR bits in the other ESCR (to count LP1 only). Doing so I could get LP-specific counts.
As I read the docs for the Nehalem processors, it appears that the 4 performance counters are also shared by all the LPs in the core. The PerfEvtSelX registers have a new ANY bit (bit 21). The docs (Software Developers Manual Vol 3B, chapter 18, page 18-54) state: When set to 1, it enables counting the associated event conditions (including matching the threads CPL with the OS/USR setting of IA32_PERFEVTSELx) occurring across all logical processors sharing a processor core. When bit 21 is 0, the counter only increments the associated event conditions (including matching the threads CPL with the OS/USR setting of IA32_PERFEVTSELx) occurring in the logical processor which programmed the IA32_PERFEVTSELx MSR.
Does that mean that the hardware "remembers" which LP programmed the counter(s)? The controls don't seem to allow one to specify which LP in the core is the one that should increment the counter.
If, for example, I run with affinity to the processor which is LP0 in a core, and program PerfEvtSel0 to count INST_RETIRED.ANY_P and clear bit 21 (ie. ANY=0), then run with affinity to the processor which is LP1 on the same core, and program PerfEvtSel1 to also count INST_RETIRED.ANY_P and also clear bit 21: - Since the controls for counter 0 were programmed by code running in LP0, will Counter 0 only count instructions retired by LP0? - Same for counter 1 - will it only count instructions retired by LP1?
I guess the general question is: how should the counters be programmed to separate events by LPs?
The question also applies to the Fixed-Function counters. There seem to be only 3 of them, shared by the LPs in the core, so using the Fixed-Function counters it is not possible, for example, to count INSTR_RETIRED.ANY (on Fixed-Function counter 0) and separate the count by LP. It seems that you can choose to count for both LPs (the core) or for only the LP which programmed the Fixed-Function counter.