I am using Intel(R) Xeon(R) CPU E5630 @ 2.53GHz, by instrumentation method i have programmed 4PMCs and i have measured below details. Now i just want to calculate cost of l1 and l2 misses for this CPU. Is these provided details are enough to do that? if so, can u pls guide me with the steps/methods to calculate CPU (instruction) miss cost?
FFC0 Total instr retired: 11660644
FFC1 Total core cyc: 15823490
FFC2 Total ref cyc: 15891004
PMC0 L1I_HITS: 5044792
PMC1 L1I_MISS: 447433
PMC2 L2I_HITS: 38494
PMC3 L2I_MISSd: 46373
Total paks processed: 1000
I have read some documents, each of it follows different approach.. which makes me more confuse.Please advice the simple way to get this,may later i can follow complex approaches to get in depth infos, if needed .
- Parallel Computing
I think it would be best to carefully craft a test program that forces the misses in a controlled manner
L1 all hits
L1 misses, vast majority L2 hits
L1 misses, L2 misses, vast majority L3 hit
L1 misses, L2 misses, L3 misses
And then use the performance counters only to assess the degree that you attain your goals.
John McCalpin "Dr. Bandwidth" may have such a diagnostic.
Note, the above should include NUMA node distances when you have multi-socket system.
There are many fundamental problems with trying to compute a "cost" for cache misses, especially if all you have is performance counter data.
The complexity of instruction fetch and decode in the Intel processors makes it very hard to understand what the performance counters are supposed to be counting.
I don't know if this performance counter event works as advertised, but Section 19.8 of Volume 3 of the SWDM describes an event "L1I.CYCLES_STALLED" (Event 0x80, Umask 0x04), which is supposed to count cycles in which instruction fetch is stalled due to an L1I miss, a ITLB miss, or an ITLB fault. This sounds like it is closely related to what you are looking for, and if it works it allows all the complexity to be ignored....
Thanks Jim , John.
As you pointed out, i have measured event L1I.CYCLES_STALLED, value with L1I.MISSES. by dividing STALLED cycles from L1I miss we do expect CPU cost in cycles. Result is always lies between 3~5(cycles). Is it constant , if the request is served by L2 cache on this platform?
cpu cost(approx) = L1I.CYCLES_STALLED/ L1I.MISSES --> 3~5
As this architecture does not have any opcode for L2I_CYCLES_STALLED(i understand it is unnecessary, were this will be counted part of L1 stall itself) , to confirm the behavior by calculating l2-l3 relationship.
We basically looking forward for data cache rather than instruction cache, where our application largely plays with. is there any indirect way to do that? also can you please tell me core vs un-core events (meaning,can un-core events configure and measured same as core events in PMC0-3)?
It is hard to tell how to interpret any of the instruction cache miss results without better documentation of what they are intended to count.
The L1I.CYCLES_STALLED event will include stalls due ti instruction fetches that miss in the L2 (since these are a subset of L1I misses).
Trying to interpret the 3-5 cycle average is not trivial. The descriptions of the events talk about "instruction fetches", but there are some missing details....
- From the Optimization Manual, we know that the Instruction Fetch unit can fetch "up to" 16 (aligned) Bytes/cycle. So as a baseline we can assume that when the counters talk about "instruction fetch", they are talking about these "up to 16 Byte" fetches, not about fetching individual instructions.
- x86 instructions can occupy anywhere from 1 Byte to 15 Bytes.
- It is not clear what circumstances will cause the "instruction fetch" to be less than 16 Bytes.
- If four consecutive 16 Byte instruction fetches can be satisfied from a single cache line, does the L1I.MISSES counter count this as one L1I cache line miss, or as four L1I instruction fetch misses?
- This gets much more confusing if the fetches are not all 16 Bytes....
For the data caches, the complexity of the hardware is significantly higher, but there is somewhat more documentation. (Not enough, but definitely more.) The big problems with using hardware counters to understand "cost" of data cache misses are:
- The out-of-order execution of the processor is intended to overlap cache miss latencies with other work. Sometimes this works well, sometimes not well, and it is hard to tell the difference using just performance counters.
- The hardware prefetchers in the L1 and L2 are intended to move data closer to the processor so that the effectively latency is smaller (and easier to overlap using the out-of-order execution mechanisms). Sometimes this works well, sometimes not well, and it is hard to tell the difference using just performance counters.
- The hardware is capable of supporting multiple outstanding L1D cache misses (probably 10 on this core). The effective latency is strongly dependent on how well the cache misses overlap with each other. For L2 misses, some related information can be obtained using performance counter event 0x60 OFFCORE_REQUESTS_OUTSTANDING.*, but this counts both demand misses (which are more likely to cause core stalls) and L2 hardware prefetch misses (which are less likely to cause core stalls), so the results are unlikely to be definitive.
For processors newer than Westmere, there is a new performance counter event 0xA3 CYCLE_ACTIVITY.STALLS_L*_PENDING which can be used to count cycles in which there is both a "dispatch stall" (no uop issued to any of the execution ports) and a demand load miss pending from various levels of cache (determined by the Umask value). This counter does not guarantee that there is any causality between the execution stall and the load miss, but for L2 misses the latency is often high enough that the out-of-order execution mechanisms cannot fully hide the stall, and for L3 misses the latency is (for practical purposes) always high enough that the core will stall for a significant fraction of the cycles required to service the miss.