@Maria:

Ayam · ‎03-05-2014

Hi,
I am trying to find out the L1, L2 and LLC instruction and data cache misses of the application. I am using Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz. Can you please tell me how can I find instruction and data misses using intel vtune.
Secondly, if machine has hyper-threading active, should it be a good idea to turn it off to characterize the application.

Regards,

David_A_Intel1 · ‎03-05-2014

Hi Maria:

Please check out the tuning guides for detailed information. It looks like that processor is in the SandyBridge-EN family.

Ayam · ‎03-05-2014

Thanks for your reply MrAnderson

Yes, the processor is from the SandyBridge-EN family.
I have checked the presentation on "using intel vtune amplifier xe to tune software on the intel xeon processor E5 family". It has mentioned the formulas for LLC misses and L2 misses. (mentioned below).

Formulas:
% of cycles spent on memory access (LLC misses):
(MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS * 210) / CPU_CLK_UNHALTED.THREAD

% of cycles spent on last level cache access (2nd level misses that hit in LLC):
((MEM_LOAD_RETIRED.L3_HIT_PS * 40) + (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS
* 88) +
(MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS * 99)) / CPU_CLK_UNHALTED.THREAD

Can you please tell me what would be the formula for the L1 data and instruction misses?
Secondly, what are the keywords in intel vtune for the memory access (load and store) on my particular processor.

Regards,

David_A_Intel1 · ‎03-05-2014

The latest releases of the VTune Amplifier XE 2013 include a "General Exploration" analysis type, which collect all relevant metrics and present them in a hierarchical display. I suggest you start with that and see if your questions aren't answered. And, again, per another thread, L1 misses are much less costly and typically not what you want to tune for. Tuning for L2/L3 will give you the most "bang for the buck."

Bernard · ‎03-06-2014

>>>Secondly, if machine has hyper-threading active, should it be a good idea to turn it off to characterize the application>>>

It depends if your application can profit from use of Hyperthreading.

For example two threads with a large amount of floating-point code probably will not benefit from shared by two logical processors execution units.

David_A_Intel1 · ‎03-06-2014

Yes, ilyapolak, good point. Two thoughts to consider:

1. In general, your app will benefit from hyper-threading. The processor was designed that way! ;)
2. The only way to truly know is to test your application performance with and without hyper-threading enabled.

As far as your profiling efforts, if you plan to deploy on hyper-threaded systems, you should profile on a hyper-threaded system.

I think you are probably reacting to the stories of the "old" hyper-threading (Pentium® 4 processor days). The Intel Core microarchitecture processors have a redesigned, and very good, hyper-threading. Again, in general, your code will benefit from it.

Ayam · ‎03-06-2014

Thanks for explaining the Hyper-threading concept but I am still confuse for cache misses formula.
I have used VTune Amplifier XE 2013 and run "General Exploration" analysis type. One parameter is ICACHE.MISSES.
How can I use this parameter to find the L1 instruction cache miss rate?
Moreover, what would be the formula of L2 instruction cache and data cache miss rate. Tutorial "using-intel-vtune-amplifier-xe-on-xeon-e5-family-1.0.pdf" shows the % of cycles spent of memory access.

Bernard · ‎03-06-2014

I was thinking about the one scenario(mainly hypothetical) where two OS threads can fully benefit from HT.That can be achieved when two threads are scheduled to run at the same time and one of these threads contains mostly floating-point code and second thread contains mainly integer code.In this case CPU scheduler can reroute two different streams of machine code instructions to different execution ports.Stall can occur where at the same time the similar uops(like cmp jmp:condition) from both threads are about to be executed.

Bernard · ‎03-06-2014

@Maria

Try the link below.Bear in mind that post is related to Nehalem.

http://software.intel.com/en-us/forums/topic/283851

Bernard · ‎03-06-2014

@Maria

and this link:

http://software.intel.com/en-us/forums/topic/295116

http://software.intel.com/en-us/forums/topic/292819

Read also this document about the Core i7 performance analysis.

http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

David_A_Intel1 · ‎03-06-2014

@Maria:

The memory cycles is the "cost" of your cache misses. If it is not significant, you do not need to focus on it and if you identified a poor cache miss rate in "cold" code, if would not be worth optimizing.

The General Exploration analysis type will try to highlight issues with "hot" code. It also supports our "Top-Down" approach, documented in all recent tuning guides. One of the guide authors offers this explanation:

With the new Top-Down approach there is a focus on determining where “pipeline slots” are stalled representing a real performance issue in the application as opposed to some of the old cycle accounting metrics. For example, the new metrics try and show “how often were you stalled while waiting for data from the LLC” as opposed to “how often did you hit in the LLC” since the latter may not be a performance issue if all the time waiting for LLC was hidden by real work from other instructions.

However, having said all of that, here is a post from one of our experts documenting formulas you can use to get the information you are requesting. It will require to you create custom analysis types. See the documentation for how to do that.

Ayam · ‎03-10-2014

I have checked the links referred by you guys. This is what I gather, please correct me if i am wrong:

Calculate the impact of L1, L2, LLC misses in terms of cycles spent servicing them; use formula:

LLC cache miss impact:
(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 THAT HIT IN LLC):
((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 THAT HIT IN L2):
(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Calculate the cache miss rate:

L1 data cache miss rate: L1D_REPLACMENT/INST_RETIRED.ANY
L1 instruction cache miss rate: L1I_MISSES/ INST_RETIRED.ANY
L2 data cache miss rate: L2_LINES_IN.ANY / INST_RETIRED.ANY

However, there are another set of formulas to calculate the demand data miss rates

Demand Data L1 Miss Rate => cannot calculate.

Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

L1, L2 and LLC instruction and data cache misses on E5-2420 using intel vtune