With Vtune 9.1, it's possible to estimate the percentage of cycles due to long latency data access, such as LLC miss and MLC miss.
Besides that, how to measure the miss ratio of each Cache level (L1/L2/L3 cache miss ratio) on Nehalem? What's the calculation formula?
We can estimate the % of cycles due to long latency data access,
For 3rdlevel misses: ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100
For 2ndlevel misses: (((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100
If percentage is significant (> 20%), consider reducing misses.
Use VTune Analyzer to drill down to source line and investigate why, change your code.
Thanks for your reply.
However, I'd like to know the fomula for calculating l1/L2/L3 cache miss ratio.
Intel manuals give fomulas for Itanium, core 2 processors, but excluding core i7 proessors.
Maybe the estimation of the percentage of cycles due to L2/L3 cache miss is a better metric than simple cache miss ratio?
Theseformulas(last time I posted) which indicates how Cache Misses impacts on application's run overall.
If you want to know cache miss ratio for different level, here are examples (for memory load):
1. L1: L1D_CACHE_LD.I_STATE / L1D_CACHE_LD. MESI
2. L2: (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM) / L2_RQSTS.LOADS
3. L3: MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM)
The user can define their event ratios by themselves, e.g. Instruction Cache Misses
I have tested the load latency as you pointed.
However, something is strange.
My platform is Intel Nehalem (Core i7) /linux.
Here are some events count collected by pfmon-3.9/perfmon2.
Index Description Counter Value
1 L1D_CACHE_LD:I_STATE (description not available)................. 1792152169
2 L1D_CACHE_LD:MESI (description not available).................... 3601420667
3 MEM_LOAD_RETIRED:LLC_MISS (description not available)............ 3203586
4 MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 331743878
5 MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM (description not available) 0
6 L2_RQSTS:LOADS (description not available)....................... 718837824
7 CPU_CLK_UNHALTED:THREAD (description not available).............. 7310483484
8 FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION (description not available). 1633902124
9 FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION (description not available). 0
Note that, if we calculate the MLC miss cost as
i(((MEM_LOAD_RETIRED.LLC_UNSHARED_HIT * 35) + (MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM * 74)) / CPU_CLK_UNHALTED.THREAD) * 100, the result percentage will be about 158.827%. This is intuitively wrong since all overhead should be smaller than the total run time.
There is similar results for the measurement of perfsuite-1.0.0/perfctr.
So I'm wondering if the calculation method is only applicapable to Vtune since Vtune does sampling, not count PMU event (as what pfmon and perfusite does).
This is the problemfor event based sampling to workwith perfmon2 like program,which uses PMUin processor, as well as event based sampling uses.
Perhaps there is conflicting for PMU resource sharing between VTune Analyzer and perfmon2, so the result is incorrect.
Perhaps my previous description is not very clear and causes misunderstanding.
Vtune works by sampling and call graph, no counting mode of PMU is provided.
The sampled value is stastically inprecise.
When I care about the absolute value of the event number, counting mode is a better choice, which pfmon-like tools can provide. However, when using pfmon to count PMU events(ALONE, not with Vtune, therefore NO PMU resource sharing issues), and applying the fomulas you gave (also, by Intel manuals), the MLC cache miss penalty (which is about 150% ) is confusing me. How can such a penaly larger than the total run time of a program?
So I'm wondering if the above formula applies to Vtune sampling results only (since MEM_LOAD_RETIRED.LLC_UNSHARED_HIT is a PEBS event)?
Or it also applies to the counted results of pfmon, but here pfmon doesn't work correctly? But another tool perfsuite also gives similar resutls.
How do you think?
Sampling data collection absolutely works in PMU counting mode, we can't trust sampling results since you used pfmon which uses PMU.
Observe your results -
MEM_LOAD_RETIRED:LLC_UNSHARED_HIT (description not available).... 331743878 -> this is much higher, and unreasonable
MEM_LOAD_RETIRED:OTHER_CORE_L2_HIT_HITM (description not available) 0
CPU_CLK_UNHALTED:THREAD (description not available).............. 7310483484
So please don't use sampling with application which will use PMU.
It seemed that LLC_MISS was high!
LLC Miss/Hit rates = 90318400 / (4810432+404736) = 17. That meant 1 LLC hit with 17 LLC misses, average.
There is no necessary to normalize PMU if you use Intel VTune Performance Analyzer.
Could it be special memory test from benchmark c_lu? Can you test other normal applications to compare results?
I was also trying to find out MPKI. I tried looking for the documentation but it was of not much help.
The problem is if we need INSTR_RETIRED when the sampling is taking place. But there is no such counter.
We have a counter which gives INSTR_RETIRED for the entire program and not only during sampling.
Also, is there any way to disable sampling in this case? I want to measure the miss events throughout the program.
The user canadd/remove eventINST_RETIRED.ANY by modifying sampling activity.
Why did you want to disable sampling data collection? Do you want collect performance data in your program?If so, VTune Analyzer can'tinterpret these data, but you still can use formulas we discussed above.
sorry I was not clear in my question.
I want to measure MPKI(miss per kilo instructions). Now using sampling events I got the number of L2 miss events.
Now to get MPKI, I need the number of instructions retired during this sampling. Is there any such counter for getting this?
I want to disable sampling because there is some slight variation(~2%) when I re-run the same program. To eliminate this, I want to collect the miss events throughout the program.
Now I might understand you need.
You have to use INST_RETIRED.ANY event for sampling data collection to know total instructions executed for interest of process or module. Meanwhile you may disable collecting of L2 miss events in your program, and remove other events in sampling configuration.
Next run session without sampling data collection, you can collect the miss events throughout the program?
So you can know MPKI = L2 misses *1000 / INST_RETIRED.ANY