Solved: L1, L2 and LLC miss-rate analysis on Xeon e-7 8430

JoaoAlves95 · ‎08-07-2020

Good Afternoon,

I'm currently trying to measure the miss-rate of cache levels L1, L2 and LLC(L3) on a Xeon e7 4830 using perf raw event counts. I wouldn't like to download any other program. I've already read Intel® 64 and IA32 Architectures
Performance Monitoring Events (Westmere EP-DP??) but still have no clue which events relate to the misses for each of these cache levels.

I appreciate any help regarding the measurement of data cache miss-rate for all fo these levels, or just generic miss-rate measurements if the latter is not possible.

Thanks in advance for any insight,

João Alves

HadiBrais · ‎08-08-2020

It's important to precisely define what a cache miss means. I assume you're asking about demand data load requests. A combination of the following Westmere events can be used:

MEM_INST_RETIRED.LOADS (Event=0x0B, UMask=0x01)
MEM_LOAD_RETIRED.L1D_HIT (Event=0xCB, UMask=0x01)
MEM_LOAD_RETIRED.L2_HIT (Event=0xCB, UMask=0x02)
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT (Event=0xCB, UMask=0x04)
MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM (Event=0xCB, UMask=0x08)
MEM_LOAD_RETIRED.LLC_MISS (Event=0xCB, UMask=0x10)

The L1D miss rate can be calculated as follows:

(MEM_INST_RETIRED.LOADS - MEM_LOAD_RETIRED.L1D_HIT) / MEM_INST_RETIRED.LOADS

In this formula, the the numerator represents the number of retired loads instructions that didn't hit in the L1D and the denominator represents the total number retired load instructions. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

The L2 data load miss rate can be calculated as follows:

(MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS) / (MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests that missed the L2 and the denominator represents the total number of demand data load requests to the L2 cache. This metric can be calculated using 4 programmable performance counters, so no event multiplexing is required.

The L3 miss rate (with respect to a single logical core) can be calculated as follows:

MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests from the logical core that missed the L3 and the denominator represents the total number of demand data load requests from the logical core to the L3. The requests that miss the L3 cache may be sourced from the local memory or from another socket. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

These formulas work well in general.

If you want to measure all of the metrics together in a single run of an application, a total of 6 events need to be measured but there are only 4 programmable counters, so event multiplexing is necessary. It'd be better in this case to utilize the event grouping feature of perf so that the events that are used to measure the same metrics are measured together. The L2 and L3 miss rates require measuring a total of 4 events, which can be form their own event group. The events used to measure the L1 miss rate can then be put in their own event group. Linux perf will then alternate between the two event groups. For more information on event groups in perf, refer to the "EVENT GROUPS" section of the following page: https://man7.org/linux/man-pages/man1/perf-list.1.html.

View solution in original post

SergioS_Intel · ‎08-07-2020

Hello JoaoAlves95,

Thank you for contacting Intel Customer Support.

Please allow us some time to check on your question and we will get back to you as soon as possible.

Best regards,

Sergio S.

Intel Customer Support Technician

HadiBrais · ‎08-08-2020

It's important to precisely define what a cache miss means. I assume you're asking about demand data load requests. A combination of the following Westmere events can be used:

MEM_INST_RETIRED.LOADS (Event=0x0B, UMask=0x01)
MEM_LOAD_RETIRED.L1D_HIT (Event=0xCB, UMask=0x01)
MEM_LOAD_RETIRED.L2_HIT (Event=0xCB, UMask=0x02)
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT (Event=0xCB, UMask=0x04)
MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM (Event=0xCB, UMask=0x08)
MEM_LOAD_RETIRED.LLC_MISS (Event=0xCB, UMask=0x10)

The L1D miss rate can be calculated as follows:

(MEM_INST_RETIRED.LOADS - MEM_LOAD_RETIRED.L1D_HIT) / MEM_INST_RETIRED.LOADS

In this formula, the the numerator represents the number of retired loads instructions that didn't hit in the L1D and the denominator represents the total number retired load instructions. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

The L2 data load miss rate can be calculated as follows:

(MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS) / (MEM_LOAD_RETIRED.L2_HIT + MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests that missed the L2 and the denominator represents the total number of demand data load requests to the L2 cache. This metric can be calculated using 4 programmable performance counters, so no event multiplexing is required.

The L3 miss rate (with respect to a single logical core) can be calculated as follows:

MEM_LOAD_RETIRED.LLC_MISS / (MEM_LOAD_RETIRED.LLC_UNSHARED_HIT + MEM_LOAD_RETIRED.OTHER_CORE_HIT_HITM + MEM_LOAD_RETIRED.LLC_MISS)

The numerator represents the total number of demand data load requests from the logical core that missed the L3 and the denominator represents the total number of demand data load requests from the logical core to the L3. The requests that miss the L3 cache may be sourced from the local memory or from another socket. This metric can be calculated using 3 programmable performance counters, so no event multiplexing is required.

These formulas work well in general.

If you want to measure all of the metrics together in a single run of an application, a total of 6 events need to be measured but there are only 4 programmable counters, so event multiplexing is necessary. It'd be better in this case to utilize the event grouping feature of perf so that the events that are used to measure the same metrics are measured together. The L2 and L3 miss rates require measuring a total of 4 events, which can be form their own event group. The events used to measure the L1 miss rate can then be put in their own event group. Linux perf will then alternate between the two event groups. For more information on event groups in perf, refer to the "EVENT GROUPS" section of the following page: https://man7.org/linux/man-pages/man1/perf-list.1.html.

JoaoAlves95 · ‎08-09-2020

Thank you for such a detailed and well structured answer. This was exactly what I was looking for!

Meanwhile another question arised... If I would like to measure the same cache behaviour for a parallel program could I do it using the same counters?

HadiBrais · ‎08-09-2020

Meanwhile another question arised... If I would like to measure the same cache behaviour for a parallel program could I do it using the same counters?

Yes, these events are supported per thread and perf, by default, will measure them per software thread.

L1, L2 and LLC miss-rate analysis on Xeon e-7 8430

Intel® Xeon® Processors