The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2
Calculate local and global miss rates
- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%
as for a 32 KByte 1st level cache; increasing 2nd level cache
L2 smaller than L1 is impractical
Global miss rate similar to single level cache rate provided L2 >> L1
Local miss rate not a good measure for secondary cache.
So I want to instrument the global and local L2 miss rate.
How about your opinion?
Sigehere S. wrote:In general speaking, there are below steps to optimize your program of using L1/L2 cache: 1. Use events such as MEM_LOAD_UOPS_RETIRED.L2_HIT (means L1_MISS) & MEM_LOAD_UOPS_RETIRED.L2_MISS to do event-based sampling data collection - know High L1/L2 cache misses in your code (you also can use predefined Memory Access Analysis directly, if you wont define your analysis type.) 2. Investigate how your code area (which has L1/L2 cache misses high) access your memory (load & write) - usually there is a loop or that function was called by another function which has a loop. 3. Investigate associated data structure which was used in loop, and understand memory layout. 4. Ensure that your algorithm accesses memory within 256KB, and cache line size is 64bytes. Please concentrate data access in specific area - linear address. For example, use "structure of array" instead of "array of structure" - assume you use p->a, p->b, etc. 5. Don't use big "stride" to access data in loop, and access memory within 64 bytes - it will be better. 6. Use "pad" in data structure if your data structure is not 64bit aligned in 64bit OS 7. Adjust your algorithm, if can use "invariable" data in loop (reduce "load" operations) 8. If you used shared data in threads for multithreaded application, use "lock" avoid false-sharing. 9. Other idea I miss, you can append. Again, you need to run code area with memory layout, then find idea to optimize it - use VTune(TM) Amplifier to verify. Hope it helps. Thanks, Peter
Can you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache in my program.
Peter Wang (Intel) wrote:Thanks Peter
It‘s good programming style to think about memory layout - not for specific processor, maybe advanced processor (or compiler's optimization switchers) can overcome this, but it is not harmful.