Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Hardware Products
- Processors
- L1, L2 and LLC miss-rate analysis on Xeon e-7 8430

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

JoaoAlves95

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-07-2020
05:07 AM

363 Views

L1, L2 and LLC miss-rate analysis on Xeon e-7 8430

Good Afternoon,

I'm currently trying to measure the miss-rate of cache levels L1, L2 and LLC(L3) on a Xeon e7 4830 using perf raw event counts. I wouldn't like to download any other program. I've already read Intel® 64 and IA32 Architectures

Performance Monitoring Events (Westmere EP-DP??) but still have no clue which events relate to the misses for each of these cache levels.

I appreciate any help regarding the measurement of data cache miss-rate for all fo these levels, or just generic miss-rate measurements if the latter is not possible.

Thanks in advance for any insight,

João Alves

1 Solution

HadiBrais

New Contributor III

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-08-2020
12:17 PM

343 Views

It's important to precisely define what a cache miss means. I assume you're asking about demand data load requests. A combination of the following Westmere events can be used:

- MEM_INST_RETIRED.LOADS (Event=0x0B, UMask=0x01)
- MEM_LOAD_RETIRED.L1D_HIT (Event=0xCB, UMask=0x01)
- MEM_LOAD_RETIRED.L2_HIT (Event=0xCB, UMask=0x02)
- MEM_LOAD_RETIRED.LLC_UNSHARED_HIT (Event=0xCB, UMask=0x04)
- MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM (Event=0xCB, UMask=0x08)
- MEM_LOAD_RETIRED.LLC_MISS (Event=0xCB, UMask=0x10)

The L1D miss rate can be calculated as follows:

(MEM_INST_RETIRED.LOADS - MEM_LOAD_RETIRED.L1D_HIT) / MEM_INST_RETIRED.LOADS

In this formula, the the numerator represents the number of retired loads instructions that didn't hit in the L1D and the denominator represents the total number retired load instructions. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

The L2 data load miss rate can be calculated as follows:

(MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS) / (MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests that missed the L2 and the denominator represents the total number of demand data load requests to the L2 cache. This metric can be calculated using 4 programmable performance counters, so no event multiplexing is required.

The L3 miss rate (with respect to a single logical core) can be calculated as follows:

MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests from the logical core that missed the L3 and the denominator represents the total number of demand data load requests from the logical core to the L3. The requests that miss the L3 cache may be sourced from the local memory or from another socket. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

These formulas work well in general.

If you want to measure all of the metrics together in a single run of an application, a total of 6 events need to be measured but there are only 4 programmable counters, so event multiplexing is necessary. It'd be better in this case to utilize the event grouping feature of perf so that the events that are used to measure the same metrics are measured together. The L2 and L3 miss rates require measuring a total of 4 events, which can be form their own event group. The events used to measure the L1 miss rate can then be put in their own event group. Linux perf will then alternate between the two event groups. For more information on event groups in perf, refer to the "EVENT GROUPS" section of the following page: https://man7.org/linux/man-pages/man1/perf-list.1.html.

Link Copied

4 Replies

SergioS_Intel

Moderator

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-07-2020
08:52 PM

354 Views

Hello JoaoAlves95,

Thank you for contacting Intel Customer Support.

Please allow us some time to check on your question and we will get back to you as soon as possible.

Best regards,

Sergio S.

Intel Customer Support Technician

HadiBrais

New Contributor III

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-08-2020
12:17 PM

344 Views

It's important to precisely define what a cache miss means. I assume you're asking about demand data load requests. A combination of the following Westmere events can be used:

- MEM_INST_RETIRED.LOADS (Event=0x0B, UMask=0x01)
- MEM_LOAD_RETIRED.L1D_HIT (Event=0xCB, UMask=0x01)
- MEM_LOAD_RETIRED.L2_HIT (Event=0xCB, UMask=0x02)
- MEM_LOAD_RETIRED.LLC_UNSHARED_HIT (Event=0xCB, UMask=0x04)
- MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM (Event=0xCB, UMask=0x08)
- MEM_LOAD_RETIRED.LLC_MISS (Event=0xCB, UMask=0x10)

The L1D miss rate can be calculated as follows:

(MEM_INST_RETIRED.LOADS - MEM_LOAD_RETIRED.L1D_HIT) / MEM_INST_RETIRED.LOADS

In this formula, the the numerator represents the number of retired loads instructions that didn't hit in the L1D and the denominator represents the total number retired load instructions. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

The L2 data load miss rate can be calculated as follows:

(MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS) / (MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests that missed the L2 and the denominator represents the total number of demand data load requests to the L2 cache. This metric can be calculated using 4 programmable performance counters, so no event multiplexing is required.

The L3 miss rate (with respect to a single logical core) can be calculated as follows:

MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests from the logical core that missed the L3 and the denominator represents the total number of demand data load requests from the logical core to the L3. The requests that miss the L3 cache may be sourced from the local memory or from another socket. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

These formulas work well in general.

If you want to measure all of the metrics together in a single run of an application, a total of 6 events need to be measured but there are only 4 programmable counters, so event multiplexing is necessary. It'd be better in this case to utilize the event grouping feature of perf so that the events that are used to measure the same metrics are measured together. The L2 and L3 miss rates require measuring a total of 4 events, which can be form their own event group. The events used to measure the L1 miss rate can then be put in their own event group. Linux perf will then alternate between the two event groups. For more information on event groups in perf, refer to the "EVENT GROUPS" section of the following page: https://man7.org/linux/man-pages/man1/perf-list.1.html.

JoaoAlves95

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-09-2020
01:12 AM

332 Views

Thank you for such a detailed and well structured answer. This was exactly what I was looking for!

Meanwhile another question arised... If I would like to measure the same cache behaviour for a parallel program could I do it using the same counters?

HadiBrais

New Contributor III

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

08-09-2020
09:01 AM

318 Views

Meanwhile another question arised... If I would like to measure the same cache behaviour for a parallel program could I do it using the same counters?

Yes, these events are supported per thread and perf, by default, will measure them per software thread.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.