Refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati...
IQ = instruction queue, see Figure 2-1, page 40 (labeled 'Instr Queue' in the figure)
PMH is Page Miss Handler. page 73
MITE = Legacy decode pipeline. This is the top of figure 2-1 from '32K L1 Instruction Cache' to 'Decoders'. See also the figure in section B.3.7.1 (page 683) where it it called the Legacy decode pipeline.
DSB is Decode Stream Buffer. This is also the 'Decoded Icache' in the section B.3.7.1 figure and '1.5K uOP Cache' in figure 2-1.
PBS... I assume you mean PEBS. PEBS is Precise Event Based Sampling. See section B.2.3.
LBR = Last Branch Record (or Register). See section B.2.3.4
For L2_TRANS.L2_FILL, this will take more digging but I expect they can be either due to either code misses (I$)or data (D$) misses.
For L2_TRANS.L2_WB, this should be entirely D$ (dcache).WB (writebacks) only occur when a cacheline in L2 is modified and needs to be evicted to memory. Code is usually only read-only (unless you have self-modified code) so code should not be getting 'written back' to memory.
Hope this helps,
MEM_UOPS_RETIRED.ALL_LOADS (PMC 0xD0 umask=0x81)
MEM_UOPS_RETIRED.ALL_STORES (PMC 0xD0 umask=0x82)
MEM_LOAD_UOPS_RETIRED.L1_HIT (PMC 0xD1 umask=0x1)
L1D.REPLACEMENT (pmc 0x51 umask=0x1)
%loads= 100.0*MEM_UOPS_RETIRED.ALL_LOADS/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%stores= 100.0*MEM_UOPS_RETIRED.ALL_STORES/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%L1D_load_hit = 100.0*(MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_load_miss = 100.0*(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_store_hit= min(100.0, 100.0*(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES - MEM_LOAD_UOPS_RETIRED.L1_HIT - L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES)
%L1D_store_miss= 100.0*zeroifneg(-MEM_UOPS_RETIRED.ALL_LOADS + MEM_LOAD_UOPS_RETIRED.L1_HIT + L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES
The last 2 equations have caps to keep them between 0 and 100. I've seen the last 2 equations be 'out of bounds' by up to 5%.
Hope this helps,
This lets us breakdown the stalls in the pipeline.
For instance, for a memory latency test (array size 40 MBs), the pipeline should be stalled on the backend waiting for memory. The breakdown shows:
For a memory latency test (array size4 KBs), the pipeline is still mostly stalled waiting on loads. The latency program is just a big unrolled loop of nothing but dependent link listloads.
If I do a memory read bandwidth test and shorten the array size to fit in L1D (down to 4KB) then I get the result below. For the read bw test, I just do a touch of each 64 byte cache line. The out-of-order pipeline is able to figure out the next load so that lots of loads are underway at the same time.
If I do a memory read bandwidth test with an array size of 40MBs I get the results below. Now the prefetchers can work effectively and bring the data quickly enough into L1D so that we still retire (relatively) a lot of uops (compared to memory latency test where we were 98% BE_bound).
cycles_DELIVER.1UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE
cycles_DELIVER.2UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_1_UOPS_DELIV.CORE
cycles_DELIVER.3UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_2_UOPS_DELIV.CORE
cycles_DELIVER.4UOPS =CPU_CLK_UNHALTED.THREAD - IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE
I think that the right hand side of each eqn can be figured out from the umask/cmask info in http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/s...
I believe the issue is that all of the IDQ*CYCLE* events only count while uops are being retired.
So the AVG.uops.per.cycle equation (in SDM Optimization manual section B.3.7.1) has to be adjusted.
If you compute (as in section B.3.7.1)
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ (CPU_CLK_UNHALTED.THREAD * 4))
Adj.AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then I think you'll find that
Adj.AVG.uops.per.cycle is = UOPS_RETIRED.ANY/CPU_CLK_UNHALTED.THREAD
I'll explain this more tomorrow but it is very late now and I have to go to bed.
Sorry to confuse you.
I probably shouldn't have mentioned about the "event only counting when uops are beingretired".
That was just late night speculation.
Section B.3.2 talks about %retiring and says it is:
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; where N=4
Here is what I see:
Currently Section B.3.7.1 defines:
AVG.uops.per.cycle = (4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) + (%FE.DELIVER.1UOPS ) ) / 100
I found that AVG.uops.per.cycle was sometimes much higher than
uops.per.cycle = UOPS_RETIRED.ALL/CPU_CLK_UNHALTED.THREAD
I also observed that, if I compute an 'adjusted AVG.uops.per.cycle' as in:
adjusted AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then 'adjusted AVG.uops.per.cycle' is = uops.per.cycle.
At this point I'm not going to speculate why this extra factor is necessary.
If you use the factor, do you get an 'adjusted AVG.uops.per.cycle' that agrees with 'uops.per.cycle'?
If not, I might have a programming error (always a possibility).