Solved: Hi Dr. Bandwidth,

Min_X_ · ‎12-01-2016

Hi,

1. Are the events listed in table 19-10 in the ia32 manual(http://courses.cs.washington.edu/courses/cse451/15au/readings/ia32-3.pdf) supposed to be associated with each core of the processor?

2. Clearing the performance counter on some event is do-able from RING0 by simply clearing the corresponding PMC, right?

3. As for the CYCLE_ACTIVITY event, with event number A3H, what do CYCLE_ACTIVITY.STALLS_L2_PENDING and CYCLE_ACTIVITY.CYCLES_LDM_PENDING measure?

4. Why is it possible that I can get zero MEM_LOAD_UOPS_RETIRED.L1_MISS and zero MEM_LOAD_UOPS_RETIRED.L2_MISS but nonzero CYCLE_ACTIVITY.STALLS_L2_PENDING? And, why is it possible that CYCLE_ACTIVITY.STALLS_L1D_PENDING is zero but CYCLE_ACTIVITY.STALLS_L2_PENDING is not?

Thanks.

Min

McCalpinJohn · ‎12-02-2016

That version of the manual (055) is slightly out of date. The current version (060) is available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
1. In the newer manual, the table for 3rd Generation Core i7 events is Table 19-11.
2. These performance counter events (unless otherwise specified in the description) are unique to each "logical processor". If HyperThreading is disabled, then this is the same as physical cores. Section 18.2.3 discusses the extensions for HyperThreaded systems, allowing the event to be programmed to count in either "per-logical-processor" or "per-physical-core" mode.
In Ring 0 you can clear the counts by using WRMSR to write a value of zero the MSR address of the IA32_PMC* register (0xC1, 0xC2, etc).
1. If you need to write a value that is larger than 1<<31, see Section 18.2.5 "Full-Width Writes to Performance Counter Registers"
CYCLE_ACTIVITY.* is unusual...
1. It is a little tricky to understand because it is set up to enable a logical AND of conditions, rather than the usual logical OR of conditions specified by the Umask values.
2. It helps to look at the function of each Umask bit individually, then figure out how they are used when combined.
  1. Bit 0: CYCLES_L2_PENDING increments in any cycle in which a demand load is currently pending resolution of an L2 miss.
  2. Bit 1: CYCLES_LDM_PENDING increments in any cycle in which a demand load is currently pending resolution of a memory access.
    1. This wording is strange -- note that the same Umask is described as counting L3 misses on Skylake platforms.
    2. But Skylake has another Umask that is described as incrementing for pending memory loads.
    3. Perhaps the difference relates to getting the data from memory vs getting the data from another L3? The documentation does not appear to be sufficient to disambiguate, so directed testing may be needed.
  3. Bit 2: CYCLES_NO_EXECUTE increments in "cycles of dispatch stalls".
    1. This wording is a bit ambiguous -- there are several ways to define "stalls".
    2. The use of the term "dispatch" suggests that this is related to the transfer of uops from the Reservation Station to the execution ports -- i.e, as measured by the UOPS_DISPATCHED_PORT.* (Event 0xA1) sub-events.
    3. "Dispatch stall" would therefore mean "cycles in which no uops are issue to any of the execution ports".
    4. This is probably the best place to define "stalls", but I think it is possible for uops to be issued to an execution port and then rejected (for retry later), and these "non-productive" dispatches might prevent this counter event from incrementing. Since I would consider a "non-productive" dispatch to be equivalent to "no dispatch", this scenario would result in undercounting of "stalls". It is not clear whether there is enough public information to develop directed tests related to this hypothesis.
  4. Bit 3: CYCLES_L1D_PENDING increments in any cycle in which a demand load is currently pending resolution of an L1 Data Cache miss.
3. CYCLE_STALLS_L2_PENDING is Umask 0x05, which combines CYCLES_L2_PENDING and CYCLES_NO_EXECUTE.
  1. For most events, this combination of two Umask bits would represent a logical OR -- either L2 pending or no execute.
  2. Table 19-11 is missing an important piece of the event description that is included in Table 19-3 and which can be found in the database files used for these events by VTune (Intel Amplifier XE). That extra piece of information is the "CMASK" (Counter Mask -- see Section 18.2).
  3. Reviewing the CMASK values in Table 19-3 provides the information needed to infer how these counters achieve the logical AND function.
    1. Bit 0: CYCLES_L2_PENDING increments by one in any cycle when the condition is true.
    2. Bit 1: CYCLES_LDM_PENDING increments by two in any cycle when the condition is true.
    3. Bit 2: CYCLES_NO_EXECUTE increments by four in any cycle when the condition is true.
    4. Bit 3: CYCLES_L1D_PENDING increments by eight in any cycle when the condition is true.
  4. CYCLE_STALLS_L2_PENDING sets bits 0 and 2, giving four possibilities:
    1. Both conditions are false: no increment
    2. CYCLES_L2_PENDING is true, but CYCLES_NO_EXECUTE is false: increment by one
    3. CYCLES_L2_PENDING is false, but CYCLES_NO_EXECUTE is true: increment by four
    4. Both conditions are true: increment by five
  5. The last possibility is the one we want (the logical AND), so setting the CMASK to five will cause the performance counter to increment (by one) only when the underlying event increment is five or more.
4. CYCLE_STALLS_LDM_PENDING is analogous, using a CMASK of 6 (decimal), and CYCLE_STALLS_L1D_PENDING is also analogous, using a CMASK of 12 (decimal).
I don't have any specific information on the inconsistencies that you have observed, but they are not surprising. There are many bugs in the performance counters and not all of them are documented....
1. You will need to re-check to see if the anomalies are still there when the CMASK values are programmed correctly.
2. In "Desktop 3rd Generation Intel Core Processor Family: Specification Update" (document 326766, revision 022, April 2016), errata BV98 states that several performance monitoring events (including MEM_LOAD_UOPS_RETIRED.*) can either fail to increment or can increment spuriously when operating with HyperThreading enabled.

View solution in original post

McCalpinJohn · ‎12-02-2016

That version of the manual (055) is slightly out of date. The current version (060) is available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
1. In the newer manual, the table for 3rd Generation Core i7 events is Table 19-11.
2. These performance counter events (unless otherwise specified in the description) are unique to each "logical processor". If HyperThreading is disabled, then this is the same as physical cores. Section 18.2.3 discusses the extensions for HyperThreaded systems, allowing the event to be programmed to count in either "per-logical-processor" or "per-physical-core" mode.
In Ring 0 you can clear the counts by using WRMSR to write a value of zero the MSR address of the IA32_PMC* register (0xC1, 0xC2, etc).
1. If you need to write a value that is larger than 1<<31, see Section 18.2.5 "Full-Width Writes to Performance Counter Registers"
CYCLE_ACTIVITY.* is unusual...
1. It is a little tricky to understand because it is set up to enable a logical AND of conditions, rather than the usual logical OR of conditions specified by the Umask values.
2. It helps to look at the function of each Umask bit individually, then figure out how they are used when combined.
  1. Bit 0: CYCLES_L2_PENDING increments in any cycle in which a demand load is currently pending resolution of an L2 miss.
  2. Bit 1: CYCLES_LDM_PENDING increments in any cycle in which a demand load is currently pending resolution of a memory access.
    1. This wording is strange -- note that the same Umask is described as counting L3 misses on Skylake platforms.
    2. But Skylake has another Umask that is described as incrementing for pending memory loads.
    3. Perhaps the difference relates to getting the data from memory vs getting the data from another L3? The documentation does not appear to be sufficient to disambiguate, so directed testing may be needed.
  3. Bit 2: CYCLES_NO_EXECUTE increments in "cycles of dispatch stalls".
    1. This wording is a bit ambiguous -- there are several ways to define "stalls".
    2. The use of the term "dispatch" suggests that this is related to the transfer of uops from the Reservation Station to the execution ports -- i.e, as measured by the UOPS_DISPATCHED_PORT.* (Event 0xA1) sub-events.
    3. "Dispatch stall" would therefore mean "cycles in which no uops are issue to any of the execution ports".
    4. This is probably the best place to define "stalls", but I think it is possible for uops to be issued to an execution port and then rejected (for retry later), and these "non-productive" dispatches might prevent this counter event from incrementing. Since I would consider a "non-productive" dispatch to be equivalent to "no dispatch", this scenario would result in undercounting of "stalls". It is not clear whether there is enough public information to develop directed tests related to this hypothesis.
  4. Bit 3: CYCLES_L1D_PENDING increments in any cycle in which a demand load is currently pending resolution of an L1 Data Cache miss.
3. CYCLE_STALLS_L2_PENDING is Umask 0x05, which combines CYCLES_L2_PENDING and CYCLES_NO_EXECUTE.
  1. For most events, this combination of two Umask bits would represent a logical OR -- either L2 pending or no execute.
  2. Table 19-11 is missing an important piece of the event description that is included in Table 19-3 and which can be found in the database files used for these events by VTune (Intel Amplifier XE). That extra piece of information is the "CMASK" (Counter Mask -- see Section 18.2).
  3. Reviewing the CMASK values in Table 19-3 provides the information needed to infer how these counters achieve the logical AND function.
    1. Bit 0: CYCLES_L2_PENDING increments by one in any cycle when the condition is true.
    2. Bit 1: CYCLES_LDM_PENDING increments by two in any cycle when the condition is true.
    3. Bit 2: CYCLES_NO_EXECUTE increments by four in any cycle when the condition is true.
    4. Bit 3: CYCLES_L1D_PENDING increments by eight in any cycle when the condition is true.
  4. CYCLE_STALLS_L2_PENDING sets bits 0 and 2, giving four possibilities:
    1. Both conditions are false: no increment
    2. CYCLES_L2_PENDING is true, but CYCLES_NO_EXECUTE is false: increment by one
    3. CYCLES_L2_PENDING is false, but CYCLES_NO_EXECUTE is true: increment by four
    4. Both conditions are true: increment by five
  5. The last possibility is the one we want (the logical AND), so setting the CMASK to five will cause the performance counter to increment (by one) only when the underlying event increment is five or more.
4. CYCLE_STALLS_LDM_PENDING is analogous, using a CMASK of 6 (decimal), and CYCLE_STALLS_L1D_PENDING is also analogous, using a CMASK of 12 (decimal).
I don't have any specific information on the inconsistencies that you have observed, but they are not surprising. There are many bugs in the performance counters and not all of them are documented....
1. You will need to re-check to see if the anomalies are still there when the CMASK values are programmed correctly.
2. In "Desktop 3rd Generation Intel Core Processor Family: Specification Update" (document 326766, revision 022, April 2016), errata BV98 states that several performance monitoring events (including MEM_LOAD_UOPS_RETIRED.*) can either fail to increment or can increment spuriously when operating with HyperThreading enabled.

Min_X_ · ‎12-02-2016

Hi Dr. Bandwidth,

Your clarification is extremely helpful. I never expect any individual to provide so much insightful information within one response;-)

As for the question on clearing PMC, to clarify, once the PMC registers get erased, the counting will starts from zero when I bind the PMC to some events later on, right?

One potential mistake in your explanation on CYCLE_STALLS_LDM_PENDING is that, since its umask is 06H (CYCLES_LDM_PENDING AND CYCLES_NO_EXECUTE), we should set its Cmask to 06H to count the CYCLE_STALLS_LDM_PENDING. It seems that the Cmask of each of these CYCLE_STALLS_XXX_PENDING events should be set to the same value as its Umask value.

Thanks.

Min

McCalpinJohn · ‎12-05-2016

Thanks for the catch -- I have updated my answer to show the correct CMASK of 6 (decimal) for CYCLE_STALLS_LDM_PENDING and added CYCLE_STALLS_L1D_PENDING (which uses a CMASK value of 12).

Once the PMC registers are cleared, they won't change unless the counter is enabled or another process writes a new value. Unfortunately there is not really any way to prevent other processes from using the counters, so I very seldom clear them -- I just leave them in "free-running" mode and take differences.

Questions on 3rd Gen i7 Performance Counters