I am confused by the following PMCs on Ivy Bridge:
Is there any connections among these three PMCs?
The Event 0x60, Umask 0x02 OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD increments by the number of outstanding DEMAND_CODE_RD transactions in each cycle. Events like this are called "occupancy" events, because they increment by the number of transactions "occupying" a queue or buffer every cycle. For example, if there are 4 outstanding DEMAND_CODE_RD transactions in the current cycle, the counter will increment by 4. If none of the transactions is completed in the next cycle, it will again increment by 4. This continues until one of the DEMAND_CODE_RD requests is returned to the core, leaving 3 outstanding, and the counter will increment by 3. This continues until another DEMAND_CODE_RD request is returned to the core, leaving 2 outstanding, and the counter will increment by 2. Etc, etc, etc.
The Event 0xB0, Umask 0x02 OFFCORE_REQUESTS.DEMAND_CODE_RD measures the total number of DEMAND_CODE_RD transactions sent offcore (i.e., beyond the L2).
The event 0x24, Umask 0x20 L2_RQSTS.CODE_READ_MISS should be similar to Event 0xB0, Umask 0x02 OFFCORE_REQUESTS.DEMAND_CODE_RD, but they may not be identical.
As for OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD, I am curious whether it's possible to inject some instructions, e.g. nop, so that the number of the outstanding requests gets zero.
I don't understand the question...
I have a lot of experience in creating code that produces desired behavior for data caches, but almost no experience in creating code intended to produce specific behaviors with regard to instruction caches.
One approach that is commonly used to generate cache misses is to set up strided accesses that overflow the cache associativity. Most Intel processors have a 32KiB, 8-way associative L1 Data Cache, so every 4KiB region maps to the cache congruence classes exactly once. As an example, setting up a code that jumps between instructions mapped to the first cache line of 9 different 4KiB pages should result in flushing the cache and missing the L1 Instruction Cache on every fetch. There are lots of things that can complicate this in practice. The interaction between the L1 Instruction Cache, the decoded Icache, and the micro-op queue is not at all clear, and I suspect it would take a lot of work to understand these interactions.
Another approach is to simply generate a huge "text" (code) region (much bigger than the L2 cache), and run through it repeatedly. If the code accesses are contiguous, the instructions are likely to be prefetched, but I don't know much about the infrastructure for prefetching instructions. The L2 hardware prefetchers can be disabled (as described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processo..., which should limit instruction prefetching to L1-based mechanisms. One could try running through the code in non-consecutive order to confuse prefetchers. If you do this using unconditional branches, then you can avoid the further complications of needing to understand the branch prediction mechanisms.
None of these topics are easy. Fortunately, the existing instruction caching, prefetching, and branch prediction mechanisms in recent Intel processors are extremely effective for most codes. I have only run across a few applications that have significant performance limitations due to instruction cache behavior, and none recently. The standard examples from 15 years ago when I worked at IBM were huge database infrastructures and full-scale RTL-level processor simulators.
Basically, I intended to create a barrier on instruction execution so that no legacy instructions would affect the performance measurement of my target code execution.
Does CPUID provide this kind of guarantee: all instructions issued before it would be finished by the end of CPUID?
Yes, CPUID does guarantee that all prior instructions in program order finish before the CPUID instruction, and that no instructions after the CPUID instruction (in program order) start until the CPUID instruction is complete.
The penalty for this guarantee is that the CPUID instruction takes 100-200 (or more?) cycles to execute.
Another question is whether HW_INTERRUPT event is supported in Ivy Bridge. If so, is it correct to use event: cb with umask: 01 to count the number of h/w interrupts on a specific core during the execution?
The HW_INTERRUPT events are only documented in Volume 3 of the SW Developer's Manual for the Xeon Phi (first generation == KNC) and Skylake processors. Sometimes additional events show up in the VTune database files, but in this case there is no difference. A third place to look is in the files at https://download.01.org/perfmon/, but again I see no evidence of the 0xCB HW_INTERRUPT event in any other processor generations.
In Linux, the operating system tracks interrupts by logical processor. I don't know exactly how this is done, but the counts are exposed in the /proc/interrupts interface. This is useful for coarse-grain monitoring.
An alternate approach is to count transitions between user and kernel space. On most recent Intel processors (including Ivy Bridge), the event CPL_CYCLES.RING0 (Event 0x5C, Umask 0x01) counts unhalted core cycles when the logical processor is executing in ring 0 (kernel) mode. If you set the EdgeDetect bit (bit 18) of the corresponding PERFEVTSEL register, the event will increment whenever the logical processor transitions between user mode (ring 3) and kernel mode (ring 0). I am sure that I don't understand enough of the subtleties of interrupts and exceptions on the Intel architecture to know what differences there might be between these transitions and the number of interrupts counted by the 0xCB event on Skylake, but I suspect that the values are quite similar.