Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4975 Discussions

How to get the L1,L2 Cache Miss of an intel i5 Sandy Bridge

eandy
Beginner
1,003 Views
Hello,
I am analysing the SPEC CPU2006 Benchmarks on different CPUs with INTEL Vtunes and it works fine, so far.
I have a Problem to count the Cache misses of the L1, L2 and L3 Cache. I cannot find the "Event Names" to do that.
The CPU is an INTEL i5-2400 (Sandy Bridge)
At all other CPUs it was easy so find the names of the events.
Can you please help me?
regards
Andr
0 Kudos
1 Solution
Shannon_C_Intel
Employee
1,003 Views
Hi Andre,
The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.
However I use the following formulas for impact, which are exclusive of each other:

LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 THAT HIT IN LLC):

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 THAT HIT IN L2):
(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Thesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for SINGLE SOCKET processors based on Intel Microarchitecture Codename Sandy Bridge:

Demand Data L1 Miss Rate => cannot calculate.

Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".
Hope this helps!

View solution in original post

0 Kudos
16 Replies
Kirill_R_Intel
Employee
1,003 Views

Some useful info about Sandy Bridge events can be found here:
http://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-2nd-generation-intel-core-processor-family/
http://software.intel.com/en-us/articles/two-part-webinar-and-two-videos-posted-all-covering-sandy-bridge-performance-tuning/

The cache miss formulas should look this way:

L3 cache miss
(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

L2 cache miss
((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L1 cache miss
((12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) + (26 * MEM_LOAD_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS) + (180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)) / CPU_CLK_UNHALTED.THREAD

0 Kudos
eandy
Beginner
1,003 Views
Thank you for your help!
But are this realy theformulas to get the miss rate of the L1,L2,L3 cache?
Because with these formulas I get the result that the miss rate of the L3 is bigger than the L2 and this is bigger than the L1,shouldnotitbethe other way around?
or do I think the wrong way?
regards
Andr
0 Kudos
eandy
Beginner
1,003 Views
Hello,
It would be enough if I can get the sum of the L1, L2 and L3 Caches-misses htte.
But for sandy Bridge i cannot find the events or the fomulas.
can you help me?
Andr
0 Kudos
Shannon_C_Intel
Employee
1,004 Views
Hi Andre,
The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.
However I use the following formulas for impact, which are exclusive of each other:

LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 THAT HIT IN LLC):

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 THAT HIT IN L2):
(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Thesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for SINGLE SOCKET processors based on Intel Microarchitecture Codename Sandy Bridge:

Demand Data L1 Miss Rate => cannot calculate.

Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".
Hope this helps!

0 Kudos
eandy
Beginner
1,003 Views
Thank you for your Help!
0 Kudos
Divino_C_
New Contributor I
1,003 Views

Hello Shannon, and every one.

Where come from these constants that you use in your formulas? For instance, the 26, 43 and 60 in this one:

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

I guess those are the cycles needed to service a hit in LLC on each circumstance. Am I correct? I need to apply this analysis for Ivy Bridge, do you know where I can find these constants for it?

Thanks!

0 Kudos
Bernard
Valued Contributor I
1,003 Views

@Divino

You asked an interesting question.I agree with you that those constants can represent a cycles needed to service the events which are part of formulae.

0 Kudos
Divino_C_
New Contributor I
1,003 Views

Hi Iliyapolak,

In section 2.2.5.1 of this document [1] there is a table showing the best case latency for cache accesses. However, I believe these infos are for Sandy Bridge.

[1] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

0 Kudos
sun_s_
Beginner
1,003 Views

@ Kirill Rogozhin (Intel)

There are 2 problems I encountered when I tried to use your formula to calculate the cache miss.

1.When I calculate the L3 miss ratio, I get 90%. But my test application code is just one line using function printf. Therefore , it can't be that

   big. And when I calculate the L2 miss ration, the result is even bigger than 1 which is obvious not correct.

2.When I use hardware event :MEM_LOAD_RETIRED.LLC_HIT_PS , it shows that it's a invalid event. But on the platform of Sandybridge,

I think this event should be valid. So, I've no idea what's happening.

Any help would be appreciated.

Sun.

0 Kudos
Bernard
Valued Contributor I
1,003 Views

Divino C. wrote:

Hi Iliyapolak,

In section 2.2.5.1 of this document [1] there is a table showing the best case latency for cache accesses. However, I believe these infos are for Sandy Bridge.

[1] http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

It is Core microarchitecture not SandyBridge(hebr. Gesher) which is Nehalem succesor.

0 Kudos
Kirill_R_Intel
Employee
1,003 Views

sun s., as Shannon stated above, the formulas I provided are not cache miss rates, they represent cache miss impact. For cache miss rates refer to the formulas she specified: Demand Data L2 Miss Rate and Demand Data L3 Miss Rate.

0 Kudos
sun_s_
Beginner
1,003 Views

@Kirill Rogozhin (Intel)

Thanks.Now I see.

0 Kudos
Linan_H_
Beginner
1,003 Views

Shannon Cepeda (Intel) wrote:

Hi Andre,
The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.
However I use the following formulas for impact, which are exclusive of each other:

LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 THAT HIT IN LLC):

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 THAT HIT IN L2):
(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Thesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for SINGLE SOCKET processors based on Intel Microarchitecture Codename Sandy Bridge:

Demand Data L1 Miss Rate => cannot calculate.

Demand Data L2 Miss Rate =>
(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>
(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

Demand Data L3 Miss Rate =>
L3 demand data misses / (sum of all types of demand data L3 requests) =>
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".
Hope this helps!

Hi Shannon,

    The formulas that you provided above are for sandy bridge, right?

    I am working on i3 3220(ivy bridge) with intel vtune amplifier and I wanna to measure the L1, L2 ,L3 cache miss impact, but I cannot find formulas related. Can I use the above formulas to profiling programs running on machines with ivy bridge architecture? Could you please give some help?

0 Kudos
Peter_W_Intel
Employee
1,003 Views

@ Linan H

Please go to this site, to get the article (for specific processor) which is Intel(R) VTune(TM) Amplifier tuning guide, the article should have proper L2/LLC formulas.   

0 Kudos
Linan_H_
Beginner
1,003 Views

Peter Wang (Intel) wrote:

@ Linan H

Please go to this site, to get the article (for specific processor) which is Intel(R) VTune(TM) Amplifier tuning guide, the article should have proper L2/LLC formulas.   

Hi Peter,

    Thanks for your help!

    I've found the L2/LLC cache miss impact in the document "Using_Intel_VTune_Amplifier_XE_on_3rd_Generation_Intel_Core_Processors_1.0", but how about the L1 cache miss impact formula? If there is a L1 cache miss impact formula for sandy bridge, I believe there should be the relevant formulas for ivy bridge. Could you please help me to figure it out?

0 Kudos
Peter_W_Intel
Employee
1,003 Views

 >...but how about the L1 cache miss impact formula?

The reason is that penalty of L1 Miss is low, approximate ~6 cycles (in my experience). If you really need this, configure it by your self:

For example:

Formula: % of cycles spent on L1 Misses

(6 * MEM_LOAD_UOPS_RETIRED.L1_MISS_PS) / CPU_CLK_UNHALTED.THREAD

Thresholds: Investigate if  "% of cycles spent on L1 Misses" > 0.2 (20%)

 

0 Kudos
Reply