- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am a new user of Intel Vtune. I want to measure the L1 and L2 cache miss rate on intel Quad 4 Q6600 processor. The following formula is computing the L1 and L2 miss rate, am I right?
L1: L1D_CACHE_LD.I_STATE / L1D_CACHE_LD. MESI
L2: L2D_CACHE_LD.I_STATE / L2D_CACHE_LD. MESI
btw, I have another question about measuring the multithread application. I run two threads on the core0 and core1 of Q6600 which shared L2 cache. One thread is main thread, another is prefetch thread, How can I measue the impact of prefetch thread on the main thread? I mean how to evaluate the benefit ofprefetch thread?
hope for your responsing!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Peter
The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
Please reference.
Definitions:
- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2
Calculate local and global miss rates
- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%
as for a 32 KByte 1st level cache; increasing 2nd level cache
L2 smaller than L1 is impractical
Global miss rate similar to single level cache rate provided L2 >> L1
Local miss rate not a good measure for secondary cache.
cited from:people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
So I want to instrument the global and local L2 miss rate.
How about your opinion?
Hi,
Finally I understand what you meant:-) Actually Local miss rate and Global miss rate are NOT in VTune Analyzer's terminologies
Note that "$ Miss rate" also can be defined by the user, you used (divided by) "memory references", but VTune Analyzer used "instructions retired".
According to your reqirements, I suggest to defineG-miss rate, L-miss rate as:
G-miss rate = MEM_LOAD_RETIRED.L2_LINE_MISS/ INST_RETIRED.ANY
L-miss rate =MEM_LOAD_RETIRED.L2_LINE_MISS / MEM_LOAD_RETIRED.L1D_MISS
Again, that is your decision to define Event Ratio - VTune Analyzer provides typical event ratios, but the user can re-define what they like.
Regards, Peter
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Q6600 is Intel Core 2 processor.Yourmain thread and prefetch thread canaccess data in shared L2$. How to evaluate the benefit of prefetch thread? You can use VTune Analyzer to measure L2$ misses in main threadto compare two situations: 1) use prefetch thread; 2) don'tuse prefetch thread.
Measuring L2$ misses is tomodifysampling activity, "Configure Sampling"->Ratios->add "L2 Cache Miss Rate" to the list of "Selected Ratios:"
You can verify in "Selected events:" list, event L2_LINE_IN.SELF.ANY was added. Sampling result willdisplay "L2$ Miss Rate" data to you.
L2 Cache Miss Rate = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, you can find info inhelp file.
Hope it helps.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Q6600 is Intel Core 2 processor.Yourmain thread and prefetch thread canaccess data in shared L2$. How to evaluate the benefit of prefetch thread? You can use VTune Analyzer to measure L2$ misses in main threadto compare two situations: 1) use prefetch thread; 2) don'tuse prefetch thread.
Measuring L2$ misses is tomodifysampling activity, "Configure Sampling"->Ratios->add "L2 Cache Miss Rate" to the list of "Selected Ratios:"
You can verify in "Selected events:" list, event L2_LINE_IN.SELF.ANY was added. Sampling result willdisplay "L2$ Miss Rate" data to you.
L2 Cache Miss Rate = L2_LINES_IN.SELF.ANY / INST_RETIRED.ANY, you can find info inhelp file.
Hope it helps.
Regards, Peter
hi, Peter
Thanks for your response. I don't know why the L2 cache miss rate in the vtune mannual is different from the definition in the text. The global L2 miss rate is L2 miss number/Memory reference, the local L2 miss rate is L2 miss number /L2 reference. What do you think about the above definition? How to calculate the L2 cache miss rate according to the above formulor. I mean which event I should select to measure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.
L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.
Both above event miss rateswill be calculated by VTune Analyzerautomatically.
Hope that I have answered your questions.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.
L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.
Both above event miss rateswill be calculated by VTune Analyzerautomatically.
Hope that I have answered your questions.
Regards, Peter
Hi, Peter
The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
Please reference.
Definitions:
- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2
Calculate local and global miss rates
- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%
as for a 32 KByte 1st level cache; increasing 2nd level cache
L2 smaller than L1 is impractical
Global miss rate similar to single level cache rate provided L2 >> L1
Local miss rate not a good measure for secondary cache.
cited from:people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
So I want to instrument the global and local L2 miss rate.
How about your opinion?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Peter
The following definition which I cited from a text or an lecture from people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
Please reference.
Definitions:
- Local miss rate- misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2)
- Global miss rate-misses in this cache divided by the total number of memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
For a particular application on 2-level cache hierarchy:
- 1000 memory references
- 40 misses in L1
- 20 misses in L2
Calculate local and global miss rates
- Miss rateL1 = 40/1000 = 4% (global and local)
- Global miss rateL2 = 20/1000 = 2%
- Local Miss rateL2 = 20/40 = 50%
as for a 32 KByte 1st level cache; increasing 2nd level cache
L2 smaller than L1 is impractical
Global miss rate similar to single level cache rate provided L2 >> L1
Local miss rate not a good measure for secondary cache.
cited from:people.cs.vt.edu/~cameron/cs5504/lecture8.pdf
So I want to instrument the global and local L2 miss rate.
How about your opinion?
Hi,
Finally I understand what you meant:-) Actually Local miss rate and Global miss rate are NOT in VTune Analyzer's terminologies
Note that "$ Miss rate" also can be defined by the user, you used (divided by) "memory references", but VTune Analyzer used "instructions retired".
According to your reqirements, I suggest to defineG-miss rate, L-miss rate as:
G-miss rate = MEM_LOAD_RETIRED.L2_LINE_MISS/ INST_RETIRED.ANY
L-miss rate =MEM_LOAD_RETIRED.L2_LINE_MISS / MEM_LOAD_RETIRED.L1D_MISS
Again, that is your decision to define Event Ratio - VTune Analyzer provides typical event ratios, but the user can re-define what they like.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Finally I understand what you meant:-) Actually Local miss rate and Global miss rate are NOT in VTune Analyzer's terminologies
Note that "$ Miss rate" also can be defined by the user, you used (divided by) "memory references", but VTune Analyzer used "instructions retired".
According to your reqirements, I suggest to defineG-miss rate, L-miss rate as:
G-miss rate = MEM_LOAD_RETIRED.L2_LINE_MISS/ INST_RETIRED.ANY
L-miss rate =MEM_LOAD_RETIRED.L2_LINE_MISS / MEM_LOAD_RETIRED.L1D_MISS
Again, that is your decision to define Event Ratio - VTune Analyzer provides typical event ratios, but the user can re-define what they like.
Regards, Peter
Thanks very much.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure if I understand your words correctly - there is no concept for "global" and "local" L2 miss.
L2_LINES_INindicates all L2 misses, includinginstructions prefectching misses
MEM_LOAD_RETIRED.L2_LINE_MISS indicates all L2 misses, excludinginstructions prefetching misses.
Both above event miss rateswill be calculated by VTune Analyzerautomatically.
Hope that I have answered your questions.
Regards, Peter
this article : http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/
show us the MEM_LOAD_RETIRED.L2_LINE_MISS, is it more helpful than L2_LINES_IN?
and I need to calculate the ratio myself, or I can input the fomula in the vtune?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
this article : http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/
show us the MEM_LOAD_RETIRED.L2_LINE_MISS, is it more helpful than L2_LINES_IN?
and I need to calculate the ratio myself, or I can input the fomula in the vtune?
MEM_LOAD_RETIRED.L2_LINE_MISS is to measure L2 data cache misses.
L2_LINE_IN are both is for L2 data cache missesand L2 instruction cache misses. If you have many "branch" code, L2_LINE_IN is helpful!
You can use VTune Analyzer'sdefinition (default):
L2 Cache Miss Rate = L2_LINE_IN.SELF.ANY/ INST_RETIRED.ANY
This result will be displayed in VTune Analyzer's report! No action is required from user!
Or you can use yourself definition: for example, if you don't care of L2 missby Instruction prefetching.
L2 Cache Miss Rate = MEM_LOAD_RETIRED.L2_LINE_MISS / INST_RETIRED.ANY
This result will NOT be displayed in VTune Analyzer's report! And the user can't input this formula in report!
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MEM_LOAD_RETIRED.L2_LINE_MISS is to measure L2 data cache misses.
L2_LINE_IN are both is for L2 data cache missesand L2 instruction cache misses. If you have many "branch" code, L2_LINE_IN is helpful!
You can use VTune Analyzer'sdefinition (default):
L2 Cache Miss Rate = L2_LINE_IN.SELF.ANY/ INST_RETIRED.ANY
This result will be displayed in VTune Analyzer's report! No action is required from user!
Or you can use yourself definition: for example, if you don't care of L2 missby Instruction prefetching.
L2 Cache Miss Rate = MEM_LOAD_RETIRED.L2_LINE_MISS / INST_RETIRED.ANY
This result will NOT be display in VTune Analyzer's report! And the user can't input this formula in report!
Regards, Peter
my application measurement result is:
L2 cache miss(MEM_LOAD_RETIRED.L2_LINE_MISS) about 30-50%,
L1 cache miss: MEM_LOAD_RETIRED.L1 is 700, INST_RETIRED.ANY=0,that means L1 cache miss rate is infinite?
there is ideal result for cache miss rate?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
my application measurement result is:
L2 cache miss(MEM_LOAD_RETIRED.L2_LINE_MISS) about 30-50%,
L1 cache miss: MEM_LOAD_RETIRED.L1 is 700, INST_RETIRED.ANY=0,that means L1 cache miss rate is infinite?
there is ideal result for cache miss rate?
I don't know why you have "INST_RETIRED.ANY=0", I guess that data is sample count, not event count.
VTune Performance Analyzer has default SAV (Sample After Value) setting for selected event, "sample is zero" means - your app ran shortly (event count < SAV value). You can increase workload or change default SAV value (by modifying your vtune activity).
By the way, the penalty of L1 cache miss is low. Usually you can ignore this.
Regards, Peter
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
By the way, the penalty of L1 cache miss is low. Usually you can ignore this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sigehere S. wrote:In general speaking, there are below steps to optimize your program of using L1/L2 cache: 1. Use events such as MEM_LOAD_UOPS_RETIRED.L2_HIT (means L1_MISS) & MEM_LOAD_UOPS_RETIRED.L2_MISS to do event-based sampling data collection - know High L1/L2 cache misses in your code (you also can use predefined Memory Access Analysis directly, if you wont define your analysis type.) 2. Investigate how your code area (which has L1/L2 cache misses high) access your memory (load & write) - usually there is a loop or that function was called by another function which has a loop. 3. Investigate associated data structure which was used in loop, and understand memory layout. 4. Ensure that your algorithm accesses memory within 256KB, and cache line size is 64bytes. Please concentrate data access in specific area - linear address. For example, use "structure of array" instead of "array of structure" - assume you use p->a[], p->b[], etc. 5. Don't use big "stride" to access data in loop, and access memory within 64 bytes - it will be better. 6. Use "pad" in data structure if your data structure is not 64bit aligned in 64bit OS 7. Adjust your algorithm, if can use "invariable" data in loop (reduce "load" operations) 8. If you used shared data in threads for multithreaded application, use "lock" avoid false-sharing. 9. Other idea I miss, you can append. Again, you need to run code area with memory layout, then find idea to optimize it - use VTune(TM) Amplifier to verify. Hope it helps. Thanks, PeterCan you elaborate how will i use CPU cache in my program?
On OS level I know that cache is maintain automatically, On the bases of which memory address is frequently access.
but if we forcefully apply specific part of my program on CPU cache then it helpful to optimize my code.
Please give me proper solution for using cache in my program.
Please Please!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter Wang (Intel) wrote:Thanks PeterIt‘s good programming style to think about memory layout - not for specific processor, maybe advanced processor (or compiler's optimization switchers) can overcome this, but it is not harmful.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page