- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.

However I use the following formulas for impact, which are exclusive of each other:

LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 *THAT HIT IN LLC*):

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 *THAT HIT IN L2*):

(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Thesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for *SINGLE SOCKET* processors based on Intel Microarchitecture Codename Sandy Bridge:

*Demand Data* L1 Miss Rate => cannot calculate.*Demand Data* L2 Miss Rate =>

(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>

(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

*Demand Data* L3 Miss Rate =>

L3 demand data misses / (sum of all types of demand data L3 requests) =>

MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".

Hope this helps!

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Some useful info about Sandy Bridge events can be found here:

http://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-2nd-generation-intel-core-processor-family/

http://software.intel.com/en-us/articles/two-part-webinar-and-two-videos-posted-all-covering-sandy-bridge-performance-tuning/

The cache miss formulas should look this way:

L3 cache miss

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

L2 cache miss

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L1 cache miss

((12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) + (26 * MEM_LOAD_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS) + (180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)) / CPU_CLK_UNHALTED.THREAD

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.

However I use the following formulas for impact, which are exclusive of each other:

LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2 *THAT HIT IN LLC*):

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

L2 cache hit impact (ie misses from L1 *THAT HIT IN L2*):

(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREAD

Thesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for *SINGLE SOCKET* processors based on Intel Microarchitecture Codename Sandy Bridge:

*Demand Data* L1 Miss Rate => cannot calculate.*Demand Data* L2 Miss Rate =>

(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>

(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

*Demand Data* L3 Miss Rate =>

L3 demand data misses / (sum of all types of demand data L3 requests) =>

MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)

To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".

Hope this helps!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello Shannon, and every one.

Where come from these constants that you use in your formulas? For instance, the 26, 43 and 60 in this one:

((26 * MEM_LOAD_UPOS_RETIRED.LLC_HIT_PS) + (43 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS) + (60 * MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS)) / CPU_CLK_UNHALTED.THREAD

I guess those are the cycles needed to service a hit in LLC on each circumstance. Am I correct? I need to apply this analysis for Ivy Bridge, do you know where I can find these constants for it?

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@Divino

You asked an interesting question.I agree with you that those constants can represent a cycles needed to service the events which are part of formulae.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Iliyapolak,

In section 2.2.5.1 of this document [1] there is a table showing the best case latency for cache accesses. However, I believe these infos are for Sandy Bridge.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

There are 2 problems I encountered when I tried to use your formula to calculate the cache miss.

1.When I calculate the L3 miss ratio, I get 90%. But my test application code is just one line using function printf. Therefore , it can't be that

big. And when I calculate the L2 miss ration, the result is even bigger than 1 which is obvious not correct.

2.When I use hardware event :MEM_LOAD_RETIRED.LLC_HIT_PS , it shows that it's a invalid event. But on the platform of Sandybridge,

I think this event should be valid. So, I've no idea what's happening.

Any help would be appreciated.

Sun.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Divino C. wrote:

Hi Iliyapolak,

In section 2.2.5.1 of this document [1] there is a table showing the best case latency for cache accesses. However, I believe these infos are for Sandy Bridge.

It is Core microarchitecture not SandyBridge(hebr. Gesher) which is Nehalem succesor.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

sun s., as Shannon stated above, the formulas I provided are not cache miss rates, they represent cache miss impact. For cache miss rates refer to the formulas she specified: *Demand Data* L2 Miss Rate and *Demand Data* L3 Miss Rate.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Shannon Cepeda (Intel) wrote:

Hi Andre,

The formulas Kirill gives above are to calculate the impacts of L1, L2, and L3 misses in terms of cycles spent servicing them. These impacts can be measured at the function or whole application level, depending on which values you plug into them.

However I use the following formulas for impact, which are exclusive of each other:LLC cache miss impact:

(180 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / CPU_CLK_UNHALTED.THREAD

LLCcache hit impact(ie misses from L2

THAT HIT IN LLC):L2 cache hit impact (ie misses from L1

THAT HIT IN L2):

(12 * MEM_LOAD_UOPS_RETIRED.L2_HIT) / CPU_CLK_UNHALTED.THREADThesecan all be added to see the total impact of everything that missed the L1.

However, you asked for Miss Rate formulas, which I interpret as (misses / total requests) for a given cache level. This is a bit hard to calculate as there are several types of requests - demand requests come from the application vs. prefetch requests generated by the hardware; code requests vs. data requests, etc. There are not events to count all of the different combinations, but I can give you these, which are for demanded data (not prefetches or instructions) and are applicable for

SINGLE SOCKETprocessors based on Intel Microarchitecture Codename Sandy Bridge:

Demand DataL1 Miss Rate => cannot calculate.

Demand DataL2 Miss Rate =>

(sum of all types of L2 demand data misses) / (sum of L2 demanded data requests) =>

(MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS) / (L2_RQSTS.ALL_DEMAND_DATA_RD)

Demand DataL3 Miss Rate =>

L3 demand data misses / (sum of all types of demand data L3 requests) =>

MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS / (MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS)To collect all of these events, you would want to create a new custom analysis type (small button at the top left of the analysis type pane in the standalone GUI), choose "New Hardware Event-Based Sampling Analysis", and then add all of the above events. Use the default sample after values, and hit OK when you are done adding them. Then run the analysis, and view the data as "Hardware Event Counts" (which will be the default). Do not use the results as "Hardware Event Sample Counts".

Hope this helps!

Hi Shannon,

The formulas that you provided above are for sandy bridge, right?

I am working on i3 3220(ivy bridge) with intel vtune amplifier and I wanna to measure the L1, L2 ,L3 cache miss impact, but I cannot find formulas related. Can I use the above formulas to profiling programs running on machines with ivy bridge architecture? Could you please give some help?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

@ Linan H

Please go to this site, to get the article (for specific processor) which is Intel(R) VTune(TM) Amplifier tuning guide, the article should have proper L2/LLC formulas.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Peter Wang (Intel) wrote:

@ Linan H

Please go to this site, to get the article (for specific processor) which is Intel(R) VTune(TM) Amplifier tuning guide, the article should have proper L2/LLC formulas.

Hi Peter,

Thanks for your help!

I've found the L2/LLC cache miss impact in the document "Using_Intel_VTune_Amplifier_XE_on_3rd_Generation_Intel_Core_Processors_1.0", but how about the L1 cache miss impact formula? If there is a L1 cache miss impact formula for sandy bridge, I believe there should be the relevant formulas for ivy bridge. Could you please help me to figure it out?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

>...but how about the L1 cache miss impact formula?

The reason is that penalty of L1 Miss is low, approximate ~6 cycles (in my experience). If you really need this, configure it by your self:

For example:

Formula: % of cycles spent on L1 Misses

(6 * MEM_LOAD_UOPS_RETIRED.L1_MISS_PS) / CPU_CLK_UNHALTED.THREAD

Thresholds: Investigate if "% of cycles spent on L1 Misses" > 0.2 (20%)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page