I am using Intel(R) Xeon(R) CPU E5630 @ 2.53GHz, by instrumentation method i have programmed 4PMCs and i have measured below details. Now i just want to calculate cost of l1 and l2 misses for this CPU. Is these provided details are enough to do that? if so, can u pls guide me with the steps/methods to calculate CPU (instruction) miss cost?
FFC0 Total instr retired: 11660644
FFC1 Total core cyc: 15823490
FFC2 Total ref cyc: 15891004
PMC0 L1I_HITS: 5044792
PMC1 L1I_MISS: 447433
PMC2 L2I_HITS: 38494
PMC3 L2I_MISSd: 46373
Total paks processed: 1000
I have read some documents, each of it follows different approach.. which makes me more confuse.Please advice the simple way to get this,may later i can follow complex approaches to get in depth infos, if needed .
I think it would be best to carefully craft a test program that forces the misses in a controlled manner
L1 all hits
L1 misses, vast majority L2 hits
L1 misses, L2 misses, vast majority L3 hit
L1 misses, L2 misses, L3 misses
And then use the performance counters only to assess the degree that you attain your goals.
John McCalpin "Dr. Bandwidth" may have such a diagnostic.
Note, the above should include NUMA node distances when you have multi-socket system.
There are many fundamental problems with trying to compute a "cost" for cache misses, especially if all you have is performance counter data.
The complexity of instruction fetch and decode in the Intel processors makes it very hard to understand what the performance counters are supposed to be counting.
I don't know if this performance counter event works as advertised, but Section 19.8 of Volume 3 of the SWDM describes an event "L1I.CYCLES_STALLED" (Event 0x80, Umask 0x04), which is supposed to count cycles in which instruction fetch is stalled due to an L1I miss, a ITLB miss, or an ITLB fault. This sounds like it is closely related to what you are looking for, and if it works it allows all the complexity to be ignored....
Thanks Jim , John.
As you pointed out, i have measured event L1I.CYCLES_STALLED, value with L1I.MISSES. by dividing STALLED cycles from L1I miss we do expect CPU cost in cycles. Result is always lies between 3~5(cycles). Is it constant , if the request is served by L2 cache on this platform?
cpu cost(approx) = L1I.CYCLES_STALLED/ L1I.MISSES --> 3~5
As this architecture does not have any opcode for L2I_CYCLES_STALLED(i understand it is unnecessary, were this will be counted part of L1 stall itself) , to confirm the behavior by calculating l2-l3 relationship.
We basically looking forward for data cache rather than instruction cache, where our application largely plays with. is there any indirect way to do that? also can you please tell me core vs un-core events (meaning,can un-core events configure and measured same as core events in PMC0-3)?
It is hard to tell how to interpret any of the instruction cache miss results without better documentation of what they are intended to count.
The L1I.CYCLES_STALLED event will include stalls due ti instruction fetches that miss in the L2 (since these are a subset of L1I misses).
Trying to interpret the 3-5 cycle average is not trivial. The descriptions of the events talk about "instruction fetches", but there are some missing details....
For the data caches, the complexity of the hardware is significantly higher, but there is somewhat more documentation. (Not enough, but definitely more.) The big problems with using hardware counters to understand "cost" of data cache misses are:
For processors newer than Westmere, there is a new performance counter event 0xA3 CYCLE_ACTIVITY.STALLS_L*_PENDING which can be used to count cycles in which there is both a "dispatch stall" (no uop issued to any of the execution ports) and a demand load miss pending from various levels of cache (determined by the Umask value). This counter does not guarantee that there is any causality between the execution stall and the load miss, but for L2 misses the latency is often high enough that the out-of-order execution mechanisms cannot fully hide the stall, and for L3 misses the latency is (for practical purposes) always high enough that the core will stall for a significant fraction of the cycles required to service the miss.