Link Copied
Hello perfwise,
Refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati...
IQ = instruction queue, see Figure 2-1, page 40 (labeled 'Instr Queue' in the figure)
PMH is Page Miss Handler. page 73
MITE = Legacy decode pipeline. This is the top of figure 2-1 from '32K L1 Instruction Cache' to 'Decoders'. See also the figure in section B.3.7.1 (page 683) where it it called the Legacy decode pipeline.
DSB is Decode Stream Buffer. This is also the 'Decoded Icache' in the section B.3.7.1 figure and '1.5K uOP Cache' in figure 2-1.
PBS... I assume you mean PEBS. PEBS is Precise Event Based Sampling. See section B.2.3.
LBR = Last Branch Record (or Register). See section B.2.3.4
For L2_TRANS.L2_FILL, this will take more digging but I expect they can be either due to either code misses (I$)or data (D$) misses.
For L2_TRANS.L2_WB, this should be entirely D$ (dcache).WB (writebacks) only occur when a cacheline in L2 is modified and needs to be evicted to memory. Code is usually only read-only (unless you have self-modified code) so code should not be getting 'written back' to memory.
Hope this helps,
Pat
MEM_UOPS_RETIRED.ALL_LOADS (PMC 0xD0 umask=0x81)
MEM_UOPS_RETIRED.ALL_STORES (PMC 0xD0 umask=0x82)
MEM_LOAD_UOPS_RETIRED.L1_HIT (PMC 0xD1 umask=0x1)
L1D.REPLACEMENT (pmc 0x51 umask=0x1)
%loads= 100.0*MEM_UOPS_RETIRED.ALL_LOADS/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%stores= 100.0*MEM_UOPS_RETIRED.ALL_STORES/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%L1D_load_hit = 100.0*(MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_load_miss = 100.0*(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_store_hit= min(100.0, 100.0*(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES - MEM_LOAD_UOPS_RETIRED.L1_HIT - L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES)
%L1D_store_miss= 100.0*zeroifneg(-MEM_UOPS_RETIRED.ALL_LOADS + MEM_LOAD_UOPS_RETIRED.L1_HIT + L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES
The last 2 equations have caps to keep them between 0 and 100. I've seen the last 2 equations be 'out of bounds' by up to 5%.
Hope this helps,
Pat
UOPS_ISSUED.ANY
UOPS_RETIRED.RETIRE_SLOTS
This lets us breakdown the stalls in the pipeline.
For instance, for a memory latency test (array size 40 MBs), the pipeline should be stalled on the backend waiting for memory. The breakdown shows:
%FE_Bound 0.71%
%Bad_Speculation 0.03%
%Retiring 0.98%
%BE_Bound 98.27%
For a memory latency test (array size4 KBs), the pipeline is still mostly stalled waiting on loads. The latency program is just a big unrolled loop of nothing but dependent link listloads.
%FE_Bound 0.091%
%Bad_Speculation 0.002%
%Retiring 6.619%
%BE_Bound 93.288%
If I do a memory read bandwidth test and shorten the array size to fit in L1D (down to 4KB) then I get the result below. For the read bw test, I just do a touch of each 64 byte cache line. The out-of-order pipeline is able to figure out the next load so that lots of loads are underway at the same time.
%FE_Bound 0.414%
%Bad_Speculation 0.003%
%Retiring 99.539%
%BE_Bound 0.044%
If I do a memory read bandwidth test with an array size of 40MBs I get the results below. Now the prefetchers can work effectively and bring the data quickly enough into L1D so that we still retire (relatively) a lot of uops (compared to memory latency test where we were 98% BE_bound).
%FE_Bound 0.907%
%Bad_Speculation 0.135%
%Retiring 11.023%
%BE_Bound 87.935%
Pat
cycles_DELIVER.1UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE
cycles_DELIVER.2UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_1_UOPS_DELIV.CORE
cycles_DELIVER.3UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_2_UOPS_DELIV.CORE
cycles_DELIVER.4UOPS =CPU_CLK_UNHALTED.THREAD - IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE
I think that the right hand side of each eqn can be figured out from the umask/cmask info in http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/s...
I believe the issue is that all of the IDQ*CYCLE* events only count while uops are being retired.
So the AVG.uops.per.cycle equation (in SDM Optimization manual section B.3.7.1) has to be adjusted.
If you compute (as in section B.3.7.1)
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ (CPU_CLK_UNHALTED.THREAD * 4))
and compute:
Adj.AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then I think you'll find that
Adj.AVG.uops.per.cycle is = UOPS_RETIRED.ANY/CPU_CLK_UNHALTED.THREAD
I'll explain this more tomorrow but it is very late now and I have to go to bed.
Pat
Sorry to confuse you.
I probably shouldn't have mentioned about the "event only counting when uops are beingretired".
That was just late night speculation.
Section B.3.2 talks about %retiring and says it is:
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; where N=4
Here is what I see:
Currently Section B.3.7.1 defines:
AVG.uops.per.cycle = (4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) + (%FE.DELIVER.1UOPS ) ) / 100
I found that AVG.uops.per.cycle was sometimes much higher than
uops.per.cycle = UOPS_RETIRED.ALL/CPU_CLK_UNHALTED.THREAD
I also observed that, if I compute an 'adjusted AVG.uops.per.cycle' as in:
adjusted AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then 'adjusted AVG.uops.per.cycle' is = uops.per.cycle.
At this point I'm not going to speculate why this extra factor is necessary.
If you use the factor, do you get an 'adjusted AVG.uops.per.cycle' that agrees with 'uops.per.cycle'?
If not, I might have a programming error (always a possibility).
Pat
%FE_Bound 0.210371
%Bad_Speculation 0.267872
%Retiring 70.738993
%BE_Bound 28.782764
%FE.DELIVER.0UOPS 0.110400
%DELIVER.1UOPS 0.061143
%DELIVER.2UOPS 0.010504
%DELIVER.3UOPS 0.061832
%DELIVER.4UOPS 99.756121
AVG.uops.per.cycle 3.992921
adj.AVG.uops.per.cycle 2.824552
uops.per.cycle 2.830101
For more complete information about compiler optimizations, see our Optimization Notice.