- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello perfwise,
Refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
IQ = instruction queue, see Figure 2-1, page 40 (labeled 'Instr Queue' in the figure)
PMH is Page Miss Handler. page 73
MITE = Legacy decode pipeline. This is the top of figure 2-1 from '32K L1 Instruction Cache' to 'Decoders'. See also the figure in section B.3.7.1 (page 683) where it it called the Legacy decode pipeline.
DSB is Decode Stream Buffer. This is also the 'Decoded Icache' in the section B.3.7.1 figure and '1.5K uOP Cache' in figure 2-1.
PBS... I assume you mean PEBS. PEBS is Precise Event Based Sampling. See section B.2.3.
LBR = Last Branch Record (or Register). See section B.2.3.4
For L2_TRANS.L2_FILL, this will take more digging but I expect they can be either due to either code misses (I$)or data (D$) misses.
For L2_TRANS.L2_WB, this should be entirely D$ (dcache).WB (writebacks) only occur when a cacheline in L2 is modified and needs to be evicted to memory. Code is usually only read-only (unless you have self-modified code) so code should not be getting 'written back' to memory.
Hope this helps,
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On Sandy Bridge, the L2 can be characterized as 'non-inclusive, non-exclusive'.
See Table 2-5 of the Optimization guide (URL in previous reply) for the cache policies and characteristics by cache level.
You asked "Data in the L3 is guaranteed to be in "either" the L1 or the L2, correct (because it's inclusive). Correct?"
Yes, sort of...
Thecacheline can be in L1 and L2 and L3 or
The line can be in L1 and not in L2 but always in L3 or
The line can be in L2 and not in L1 but always in L3 or
The line can be only in L3 (not in L1 nor in L2).
A modified line in the L1 will be written back to the L2 if the L2 has a copy of the line or, if the line isn't in the L2, the line can be written back directly to the L3.
If the modified line is written back to the L2 the the linewon't be written back to the L3 unless the line is evicted from the L2 or the line is requested by another core.
The L3 keeps track of which core has the line. When the L3 gets a request for that line it checks that core's L1 for the line and then it checks the L2 for that line. If neither the L1 nor L2 have the line then the L3 copy is the most current.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm looking at the L1D hit and miss rates, or at least trying to determine what they are. There's very little documentation about how to measure them.. but I've found some documentation on the internet stating:
PMC 43 MASK 1 = ALL L1D req
PMC 43 MASK 2 = ALL L1D req - cacheable
PMC 51 MASK 1 = ALL L1D fills
PMC 51 MASK 2 = ALL L1D fills in modified state (accesses which modify the line requested?)
PMC 51 MASK 8 = ALL L1D evictions of modified data
Is there a PMC to measure MISS rates from the L1D cache. I see that PMC 48 MASK 2 measures something related to L1D Misses.. but I'm confused by the documentation here:
http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/snb/events/l1d_pend_miss.html
What does this last PMC measure specifically? Once a miss to the L1D occurs a FP allocation occurs and this PMC is "incremented" by the number of allocations currently outstanding? is this what this PMC measures? if so.. it's for both cacheable demand requests as well as HW prefetch requests.
Any help in clarifying how to measure L1D requests, L1D Hits, L1D misses, L1D writebacks (PMC 28 MASK F) is very helpful.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Following up, the cache protocal used is MESI. All "Invalid" requests are actually misses to the cache, correct? If so then to tabulate the misses to the L1D then you would only need to add the:
L1D LD's in I state: PMC 40 MASK 1
+
L1D ST's in I state: PMC 41 MASK 1
+
L1D WriteBacks in I state: PMC 28 MASK 1
Is this correct? Also, could you clarify what a L1D writeback in I state is? I can envision the load and store, which in MESI are misses to a cache which doesn't have the request allocated. What is the later.. a writeback of data from the L1 to the L2 that's invalid?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Event 0x43 is not in the SDM. Usually when an event is not in SDM it means that an issue was found with the event orthe eventwas not tested.
So I can't comment on event 0x43.
You've got event 0x51 pretty well characterized.
Usually there are 2 main ways to characterize misses.
We can use HitPerUop (hits per uop (micro-op)) which also permits comparing different sections of code.
Hit ratio is = 'hits / accesses' and tells you the percentage of accesses which hit the cache.
Miss = total accesses - total hits.
For the L1D Load Hit Ratio, you can use:
MEM_LOAD_UOPS_RETIRED.L1_HIT/MEM_UOPS_RETIRED.ALL_LOADS
The L1D Load Miss Ratio is:
(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
I don't see a MEM_STORE_UOPS_RETIRED.L1_HIT event so we apparently can't compute a L1D store hit ratio on Sandy Bridge.
I'm prettysure the PMC event 0x51 (L1D.*) counts lines fetched due to prefetchers.
I ran a simple test looping overa size 64KB array (so miss every cache line) and MEM_UOPS_RETIRED.ALL_LOADS was very closeto L1D.REPLACEMENT. Prefetchers were enabled so I conclude that L1D.REPLACEMENT counts lines fetched into the L1 due to demand and/or prefetchers.
You can compute a 'L1D load Hits per Uop' with:
L1DMEM_LOAD_UOPS_RETIRED.L1_HIT/UOPS_RETIRED.ALL.
The SDM only has 2 subevents for PMC 28 (L2_L1D_WB_RQSTS.*), umasks 0x4 and 0x8. I'm not sure these will be helpful for measuring L1 hits/misses.
Hope this helps,
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PMC 40 mask 1, PMC 41 mask 1, and PMC 28 mask 1 are not in the SDM.
I think you can calculate lots of info for the L1D with the 4 events:
MEM_UOPS_RETIRED.ALL_LOADS (PMC 0xD0 umask=0x81)
MEM_UOPS_RETIRED.ALL_STORES (PMC 0xD0 umask=0x82)
MEM_LOAD_UOPS_RETIRED.L1_HIT (PMC 0xD1 umask=0x1)
L1D.REPLACEMENT (pmc 0x51 umask=0x1)
%loads= 100.0*MEM_UOPS_RETIRED.ALL_LOADS/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%stores= 100.0*MEM_UOPS_RETIRED.ALL_STORES/(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES)
%L1D_load_hit = 100.0*(MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_load_miss = 100.0*(MEM_UOPS_RETIRED.ALL_LOADS - MEM_LOAD_UOPS_RETIRED.L1_HIT)/MEM_UOPS_RETIRED.ALL_LOADS
%L1D_store_hit= min(100.0, 100.0*(MEM_UOPS_RETIRED.ALL_LOADS + MEM_UOPS_RETIRED.ALL_STORES - MEM_LOAD_UOPS_RETIRED.L1_HIT - L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES)
%L1D_store_miss= 100.0*zeroifneg(-MEM_UOPS_RETIRED.ALL_LOADS + MEM_LOAD_UOPS_RETIRED.L1_HIT + L1D.REPLACEMENT)/MEM_UOPS_RETIRED.ALL_STORES
The last 2 equations have caps to keep them between 0 and 100. I've seen the last 2 equations be 'out of bounds' by up to 5%.
Hope this helps,
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry to take so long to reply. I had to do some other work.
Can you try the technique in Section B.3.2 of theSDM optimization guide "Locating Stalls in the Microarchitecture Pipeline"?
The technique uses the 3 sandy bridge events:
IDQ_UOPS_NOT_DELIVERED.CORE
UOPS_ISSUED.ANY
UOPS_RETIRED.RETIRE_SLOTS
This lets us breakdown the stalls in the pipeline.
For instance, for a memory latency test (array size 40 MBs), the pipeline should be stalled on the backend waiting for memory. The breakdown shows:
%FE_Bound 0.71%
%Bad_Speculation 0.03%
%Retiring 0.98%
%BE_Bound 98.27%
For a memory latency test (array size4 KBs), the pipeline is still mostly stalled waiting on loads. The latency program is just a big unrolled loop of nothing but dependent link listloads.
%FE_Bound 0.091%
%Bad_Speculation 0.002%
%Retiring 6.619%
%BE_Bound 93.288%
If I do a memory read bandwidth test and shorten the array size to fit in L1D (down to 4KB) then I get the result below. For the read bw test, I just do a touch of each 64 byte cache line. The out-of-order pipeline is able to figure out the next load so that lots of loads are underway at the same time.
%FE_Bound 0.414%
%Bad_Speculation 0.003%
%Retiring 99.539%
%BE_Bound 0.044%
If I do a memory read bandwidth test with an array size of 40MBs I get the results below. Now the prefetchers can work effectively and bring the data quickly enough into L1D so that we still retire (relatively) a lot of uops (compared to memory latency test where we were 98% BE_bound).
%FE_Bound 0.907%
%Bad_Speculation 0.135%
%Retiring 11.023%
%BE_Bound 87.935%
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can look here (http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/snb/events/idq_uops_not_delivered.html or google IDQ_UOPS_NOT_DELIVERED and then take the software.intel.com link) for a description of the event.
A cmask of 4 has the definition: Cycles per thread when 4 or more uops are not delivered to Resource Allocation Table (RAT).
So yes, if I've got all my negations correct then your question "when I use a CNT mask of 4, I am measuring the # of cycles where 0 uops were delivered" is correct.
The next question (cmask=3 means 1 or less uops delivered) is correct as well.
If cmask is 0 then you just get "Uops not delivered to Resource Allocation Table (RAT) per thread".
I think your "# clocks where ..." logic is ok as long as you are measuring event over the same interval.
So I'd expect that (assuming you are measuring everything over same # cycles):
EQN_1 = #cycles * (count_of_just_1_uops_retiring + 2 * count_of_2_uops_retiring + 3 * count_3_uops_retiring + 4 * count_of_4_retiring) is = retired_uops.all
and where
count_of_4_retiring = IDQ_UOPS_NOT_DELIVERED w umask= 0x1, cmask=0x1, invert bit= 0x1.
count_of_3_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=3 - IDQ_UOPS_NOT_DELIVERED w/CMASK=4
count_of_2_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=2 - IDQ_UOPS_NOT_DELIVERED w/CMASK=3count_of_1_retiring = IDQ_UOPS_NOT_DELIVERED CMASK=1 - IDQ_UOPS_NOT_DELIVERED w/CMASK=2
How close is EQN_1 to being equal?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
B.3.7.1 does it slightly differently:
cycles_DELIVER.1UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE
cycles_DELIVER.2UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_1_UOPS_DELIV.CORE
cycles_DELIVER.3UOPS = IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE - IDQ_UOPS_NOT_DELIVERED.CYCLES_2_UOPS_DELIV.CORE
cycles_DELIVER.4UOPS =CPU_CLK_UNHALTED.THREAD - IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE
I think that the right hand side of each eqn can be figured out from the umask/cmask info in http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/snb/events/idq_uops_not_delivered.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe the issue is that all of the IDQ*CYCLE* events only count while uops are being retired.
So the AVG.uops.per.cycle equation (in SDM Optimization manual section B.3.7.1) has to be adjusted.
If you compute (as in section B.3.7.1)
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ (CPU_CLK_UNHALTED.THREAD * 4))
and compute:
Adj.AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then I think you'll find that
Adj.AVG.uops.per.cycle is = UOPS_RETIRED.ANY/CPU_CLK_UNHALTED.THREAD
I'll explain this more tomorrow but it is very late now and I have to go to bed.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry to confuse you.
I probably shouldn't have mentioned about the "event only counting when uops are beingretired".
That was just late night speculation.
Section B.3.2 talks about %retiring and says it is:
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; where N=4
Here is what I see:
Currently Section B.3.7.1 defines:
AVG.uops.per.cycle = (4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) + (%FE.DELIVER.1UOPS ) ) / 100
I found that AVG.uops.per.cycle was sometimes much higher than
uops.per.cycle = UOPS_RETIRED.ALL/CPU_CLK_UNHALTED.THREAD
I also observed that, if I compute an 'adjusted AVG.uops.per.cycle' as in:
adjusted AVG.uops.per.cycle = %Retiring * AVG.uops.per.cycle / 100
then 'adjusted AVG.uops.per.cycle' is = uops.per.cycle.
At this point I'm not going to speculate why this extra factor is necessary.
If you use the factor, do you get an 'adjusted AVG.uops.per.cycle' that agrees with 'uops.per.cycle'?
If not, I might have a programming error (always a possibility).
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can't tell what the effective difference is between UOPS_RETIRED.RETIRE_SLOTS and UOPS_RETIRED.ALL. I think the number that the 2 counters return will be the same. But they do count different things.
UOPS_RETIRED.ALL counts simply what it says.
UOPS_RETIRED.RETIRE_SLOTS counts, for each cycle, the number of retirement slots used.
The 2 quantities should be the same and in my meausrements, they are the same to 4 significant digits.
I'm saying the equation for AVG.uops.per.cycle doesn't work as expected.
If, for example the %FE.DELIVER.0UOPS component of AVG.UOPS.per.cycle is not counting correctly then the number AVG.uops.per.cycle will be too high.
Just looking at another case below.
The %Retiring is 70%. The %DELIVER.4UOPS is 99%.
Does it make sense that you can be retiring 4UOPs 99% of the time but only retiring 70% of the time?
So 30% of the time you aren't retiring uops.
It seems %FE.DELIVER.0UOPS and/or %DELIVER.1UOPS and/or%DELIVER.2UOPS and/or %DELIVER.3UOPS may be undercounting.
There are many possible explanations that will require time to check.An event could be coded wrong, an event could be broken, my utility can have an error.
I have asked the guy who wrote that section of the SDM to help figure this out but he is very busy and it may be a week before he gets back to me.
In the meantime I'll try some other tools for collecting the events.
Please be patient and we'll figure this out.
Pat
%FE_Bound 0.210371
%Bad_Speculation 0.267872
%Retiring 70.738993
%BE_Bound 28.782764
%FE.DELIVER.0UOPS 0.110400
%DELIVER.1UOPS 0.061143
%DELIVER.2UOPS 0.010504
%DELIVER.3UOPS 0.061832
%DELIVER.4UOPS 99.756121
AVG.uops.per.cycle 3.992921
adj.AVG.uops.per.cycle 2.824552
uops.per.cycle 2.830101
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page