I am trying to find out the L1, L2 and LLC instruction and data cache misses of the application. I am using Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz. Can you please tell me how can I find instruction and data misses using intel vtune.
Secondly, if machine has hyper-threading active, should it be a good idea to turn it off to characterize the application.
Thanks for your reply MrAnderson
Yes, the processor is from the SandyBridge-EN family.
I have checked the presentation on "using intel vtune amplifier xe to tune software on the intel xeon processor E5 family". It has mentioned the formulas for LLC misses and L2 misses. (mentioned below).
% of cycles spent on memory access (LLC misses):
(MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS * 210) / CPU_CLK_UNHALTED.THREAD
% of cycles spent on last level cache access (2nd level misses that hit in LLC):
((MEM_LOAD_RETIRED.L3_HIT_PS * 40) + (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS
* 88) +
(MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS * 99)) / CPU_CLK_UNHALTED.THREAD
Can you please tell me what would be the formula for the L1 data and instruction misses?
Secondly, what are the keywords in intel vtune for the memory access (load and store) on my particular processor.
The latest releases of the VTune Amplifier XE 2013 include a "General Exploration" analysis type, which collect all relevant metrics and present them in a hierarchical display. I suggest you start with that and see if your questions aren't answered. And, again, per another thread, L1 misses are much less costly and typically not what you want to tune for. Tuning for L2/L3 will give you the most "bang for the buck."
>>>Secondly, if machine has hyper-threading active, should it be a good idea to turn it off to characterize the application>>>
It depends if your application can profit from use of Hyperthreading.
For example two threads with a large amount of floating-point code probably will not benefit from shared by two logical processors execution units.
Yes, ilyapolak, good point. Two thoughts to consider:
1. In general, your app will benefit from hyper-threading. The processor was designed that way! ;)
2. The only way to truly know is to test your application performance with and without hyper-threading enabled.
As far as your profiling efforts, if you plan to deploy on hyper-threaded systems, you should profile on a hyper-threaded system.
I think you are probably reacting to the stories of the "old" hyper-threading (Pentium® 4 processor days). The Intel Core microarchitecture processors have a redesigned, and very good, hyper-threading. Again, in general, your code will benefit from it.
Thanks for explaining the Hyper-threading concept but I am still confuse for cache misses formula.
I have used VTune Amplifier XE 2013 and run "General Exploration" analysis type. One parameter is ICACHE.MISSES.
How can I use this parameter to find the L1 instruction cache miss rate?
Moreover, what would be the formula of L2 instruction cache and data cache miss rate. Tutorial "using-intel-vtune-amplifier-xe-on-xeon-e5-family-1.0.pdf" shows the % of cycles spent of memory access.
I was thinking about the one scenario(mainly hypothetical) where two OS threads can fully benefit from HT.That can be achieved when two threads are scheduled to run at the same time and one of these threads contains mostly floating-point code and second thread contains mainly integer code.In this case CPU scheduler can reroute two different streams of machine code instructions to different execution ports.Stall can occur where at the same time the similar uops(like cmp jmp:condition) from both threads are about to be executed.
and this link:
Read also this document about the Core i7 performance analysis.
The memory cycles is the "cost" of your cache misses. If it is not significant, you do not need to focus on it and if you identified a poor cache miss rate in "cold" code, if would not be worth optimizing.
The General Exploration analysis type will try to highlight issues with "hot" code. It also supports our "Top-Down" approach, documented in all recent tuning guides. One of the guide authors offers this explanation:
With the new Top-Down approach there is a focus on determining where “pipeline slots” are stalled representing a real performance issue in the application as opposed to some of the old cycle accounting metrics. For example, the new metrics try and show “how often were you stalled while waiting for data from the LLC” as opposed to “how often did you hit in the LLC” since the latter may not be a performance issue if all the time waiting for LLC was hidden by real work from other instructions.
However, having said all of that, here is a post from one of our experts documenting formulas you can use to get the information you are requesting. It will require to you create custom analysis types. See the documentation for how to do that.
I have checked the links referred by you guys. This is what I gather, please correct me if i am wrong:
Calculate the impact of L1, L2, LLC misses in terms of cycles spent servicing them; use formula:
LLC cache miss impact:
(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD
LLCcache hit impact(ie misses from L2 THAT HIT IN LLC):
((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD
L2 cache hit impact (ie misses from L1 THAT HIT IN L2):
(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD
Calculate the cache miss rate:
L1 data cache miss rate: L1D_REPLACMENT/INST_RETIRED.ANY
L1 instruction cache miss rate: L1I_MISSES/ INST_RETIRED.ANY
L2 data cache miss rate: L2_LINES_IN.ANY / INST_RETIRED.ANY
However, there are another set of formulas to calculate the demand data miss rates
Demand Data L1 Miss Rate => cannot calculate.
Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)
Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)