DEMAND_CODE_RD PMC

Min_X_ · ‎03-01-2017

Hi,

I am confused by the following PMCs on Ivy Bridge:

0260: OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD

02b0: OFFCORE_REQUESTS.DEMAND_CODE_RD

2024: L2_RQSTS.CODE_RD_MISS

Is there any connections among these three PMCs?

Thanks

Min

McCalpinJohn · ‎03-01-2017

The Event 0x60, Umask 0x02 OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD increments by the number of outstanding DEMAND_CODE_RD transactions in each cycle. Events like this are called "occupancy" events, because they increment by the number of transactions "occupying" a queue or buffer every cycle. For example, if there are 4 outstanding DEMAND_CODE_RD transactions in the current cycle, the counter will increment by 4. If none of the transactions is completed in the next cycle, it will again increment by 4. This continues until one of the DEMAND_CODE_RD requests is returned to the core, leaving 3 outstanding, and the counter will increment by 3. This continues until another DEMAND_CODE_RD request is returned to the core, leaving 2 outstanding, and the counter will increment by 2. Etc, etc, etc.

Setting the CMASK field (bits 31:24) in the PERFEVTSEL register will change the event to incrementing by one if the unmodified number of increments is greater than or equal to the value in the CMASK field. So setting CMASK to 1 will cause the counter to increment by one in any cycle in which there is at least one outstanding DEMAND_CODE_RD transaction.
Taking the counts with and without the CMASK field set to 1 allows you to compute the average number of outstanding offcore DEMAND_CODE_RD transactions (during periods in which at least one DEMAND_CODE_RD is outstanding).

The Event 0xB0, Umask 0x02 OFFCORE_REQUESTS.DEMAND_CODE_RD measures the total number of DEMAND_CODE_RD transactions sent offcore (i.e., beyond the L2).

Dividing the OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD (occupancy) value by the number of transactions from OFFCORE_REQUESTS.DEMAND_CODE_RD will give the average number of cycles that each transaction was outstanding.

The event 0x24, Umask 0x20 L2_RQSTS.CODE_READ_MISS should be similar to Event 0xB0, Umask 0x02 OFFCORE_REQUESTS.DEMAND_CODE_RD, but they may not be identical.

The DEMAND_CODE_RD events include "demand" in the name, which suggests that they won't count prefetches from the L1 instruction cache that miss in the L2. The L2_RQSTS.CODE_READ_MISS might include hardware prefetches of instructions. I am not aware of any reliable, detailed descriptions of how Intel processors handle prefetching of code into the L1 Instruction Cache, so it may be difficult to come to a definite conclusion.
The 0x24 L2_RQSTS.* family of events might include requests that are rejected and retried. On some processors, some of the sub-events of event 0x24 say that transactions that are rejected are not counted, but since the documentation is not uniform across Umasks or across processors, it is reasonable to be suspicious.
- The 0xF0 L2_TRANS.* family of events appears to differ from the 0x24 family in that the 0xF0 events appear to be intended to include rejects, but again the documentation is inconsistent across products and Umasks, so it is hard to be sure.

Min_X_ · ‎03-01-2017

Hi John,

As for OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD, I am curious whether it's possible to inject some instructions, e.g. nop, so that the number of the outstanding requests gets zero.

Thanks.

Min

McCalpinJohn · ‎03-02-2017

I don't understand the question...

I have a lot of experience in creating code that produces desired behavior for data caches, but almost no experience in creating code intended to produce specific behaviors with regard to instruction caches.

One approach that is commonly used to generate cache misses is to set up strided accesses that overflow the cache associativity. Most Intel processors have a 32KiB, 8-way associative L1 Data Cache, so every 4KiB region maps to the cache congruence classes exactly once. As an example, setting up a code that jumps between instructions mapped to the first cache line of 9 different 4KiB pages should result in flushing the cache and missing the L1 Instruction Cache on every fetch. There are lots of things that can complicate this in practice. The interaction between the L1 Instruction Cache, the decoded Icache, and the micro-op queue is not at all clear, and I suspect it would take a lot of work to understand these interactions.

Another approach is to simply generate a huge "text" (code) region (much bigger than the L2 cache), and run through it repeatedly. If the code accesses are contiguous, the instructions are likely to be prefetched, but I don't know much about the infrastructure for prefetching instructions. The L2 hardware prefetchers can be disabled (as described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors), which should limit instruction prefetching to L1-based mechanisms. One could try running through the code in non-consecutive order to confuse prefetchers. If you do this using unconditional branches, then you can avoid the further complications of needing to understand the branch prediction mechanisms.

None of these topics are easy. Fortunately, the existing instruction caching, prefetching, and branch prediction mechanisms in recent Intel processors are extremely effective for most codes. I have only run across a few applications that have significant performance limitations due to instruction cache behavior, and none recently. The standard examples from 15 years ago when I worked at IBM were huge database infrastructures and full-scale RTL-level processor simulators.

Min_X_ · ‎03-03-2017

Basically, I intended to create a barrier on instruction execution so that no legacy instructions would affect the performance measurement of my target code execution.

Does CPUID provide this kind of guarantee: all instructions issued before it would be finished by the end of CPUID?

Thanks.

Min

McCalpinJohn · ‎03-03-2017

Yes, CPUID does guarantee that all prior instructions in program order finish before the CPUID instruction, and that no instructions after the CPUID instruction (in program order) start until the CPUID instruction is complete.

The penalty for this guarantee is that the CPUID instruction takes 100-200 (or more?) cycles to execute.

Min_X_ · ‎03-04-2017

Thanks, John.

Another question is whether HW_INTERRUPT event is supported in Ivy Bridge. If so, is it correct to use event: cb with umask: 01 to count the number of h/w interrupts on a specific core during the execution?

Min

McCalpinJohn · ‎03-06-2017

The HW_INTERRUPT events are only documented in Volume 3 of the SW Developer's Manual for the Xeon Phi (first generation == KNC) and Skylake processors. Sometimes additional events show up in the VTune database files, but in this case there is no difference. A third place to look is in the files at https://download.01.org/perfmon/, but again I see no evidence of the 0xCB HW_INTERRUPT event in any other processor generations.

In Linux, the operating system tracks interrupts by logical processor. I don't know exactly how this is done, but the counts are exposed in the /proc/interrupts interface. This is useful for coarse-grain monitoring.

An alternate approach is to count transitions between user and kernel space. On most recent Intel processors (including Ivy Bridge), the event CPL_CYCLES.RING0 (Event 0x5C, Umask 0x01) counts unhalted core cycles when the logical processor is executing in ring 0 (kernel) mode. If you set the EdgeDetect bit (bit 18) of the corresponding PERFEVTSEL register, the event will increment whenever the logical processor transitions between user mode (ring 3) and kernel mode (ring 0). I am sure that I don't understand enough of the subtleties of interrupts and exceptions on the Intel architecture to know what differences there might be between these transitions and the number of interrupts counted by the 0xCB event on Skylake, but I suspect that the values are quite similar.