Fine-Grained L1 Data Cache Miss Counting

Urs_M_ · ‎10-27-2016

Hey guys, I am trying to accurately count L1 data cache misses in my Assembly/C Code, by using Hardware Performance Counters. My goal is as follows: 1) Access some memory location 2) Measure the number of L1d cache misses and see if it increased (indicating that a cache miss occured) 3) Repeat I have a Linux box with an Intel Sandy Bridge CPU. I've already tried using perf_event.h but the read(...) system call seems to introduce quite some noise. I think the right way is (probably) to directly access the counters by writing some assembly code. John McCalpin has given the formula to compute L1d misses in this post (see "Best Answer"): https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/610581?language=en According to the Intel SDM Vol. 3 Section 19.6 we have the following for Sandy Bridge:

Event Number    Umask Value     Mnemonic
D1H                    02H                   MEM_LOAD_UOPS_RETIRED.L2_HIT
D1H                    04H                   MEM_LOAD_UOPS_RETIRED.L3_HIT
D1H                    20H                   MEM_LOAD_UOPS_RETIRED.L3_MISS

To read these counters I can use the RDPMC instruction, which requires ECX to be loaded with the correct value. My questions are as follows: 1) Are these performance counters accurate enough to measure as described above? 2) Are these events monitored automatically or do I somehow have to activate them? 3) What value do I need to load into ECX to get the above counters? I have read lots of documentation and other forum posts but I am still confused. It would be great if someone could direct me to some resource that explains this well, or maybe provide a short assemlby snippet (I could not find anything online).

Urs_M_ · ‎10-27-2016

From a different post I got the following code:

unsigned long rdpmc_test(unsigned idx)
{
    unsigned a, d, c;

    asm volatile
    (
        "rdpmc"
        : "=a" (a), "=d" (d)
        : "c" (c)
    );

    return *(((unsigned long) d) << 32) | ((unsigned long) a);
}

I can run this code in a Linux kernel module.

My problem now ist that I do not know how to choose 'idx'. From a different post I have that I can use 0 <= idx < 8. This means that somehow I need to "register" the events. E.g., register MEM_LOAD_UOPS_RETIRED.L2_HIT as idx = 0.

How can I register such an event?

McCalpinJohn · ‎10-27-2016

There are problems with these events on the Sandy Bridge processors. My systems are all Sandy Bridge Xeon E5 systems -- some of the problems apply to any processor with a Sandy Bridge core and some apply only to the processors with the "server" uncore.

The biggest problem is that 256-bit AVX loads will only increment the L1_HIT and HIT_LFB sub-events and will never increment the L2_HIT or L3_HIT sub-events. This probably applies to both server and client parts.
The MEM_LOAD_UOPS_RETIRED.LLC_MISS and LLC_HIT counters can undercount (by nearly 100% in my tests). There are workarounds described in the Xeon E5 processor specification update, and there is a simple implementation to enable and disable these workarounds at https://github.com/andikleen/pmu-tools/blob/master/latego.py ; (This is probably a "server-only" bug)
The MEM_LOAD_UOPS_RETIRED events can miscount badly if HyperThreading is enabled. This probably applies to both server and client parts.

The good news is that you don't need to use these events -- there are lots of alternatives. I have had good luck with the L1D.REPLACEMENT event to count L1 Data Cache misses. If I recall correctly, the MEM_UOPS_RETIRED.ALL_LOADS seems accurate as well. You can use these to compute hit/miss rates, but to understand the hit/miss rates you need to know the size(s) of the loads being used, and you need to pay attention to the HIT_LFB values.

All of these events require programming a variety of MSRs including the "global" counter controls and the PERFEVT_SEL MSRs that program the individual counters. The procedure is described in Chapter 18 of Volume 3 of the Intel Architecture SW Developer's Manual. Execution of the RDPMC instruction in user-mode code requires that the CR4.PCE configuration bit is set. If your kernel sets this by default, then everything is good. If the kernel clears this bit instead, you are probably out of luck.

The value that you put in %ecx before executing the RDPMC instruction is the number of the performance counter that you want to read. Depending on the system configuration this will be 0-3 or 0-7. Counter 0 is the one controlled by MSR 0x186 IA32_PERFEVTSEL0, etc.

Urs_M_ · ‎10-31-2016

Thank you very much for your great answer. These counters are way more complex than I expected (I'm very new to this stuff).

Zirak · ‎11-20-2016

HI,

There are alternatives, such as PAPI library, to count cache misses including L1, L2 and L3.

Thanks

ASing3 · ‎04-14-2017

Hi Zirak,

I am using following code sequence to measure L1 and L2 cache misses.

int Events[NUM_EVENTS]={PAPI_L1_LDM,PAPI_L2_LDM};

void thread_func()

{

retval = PAPI_start(EventSet);

call a function()

retval = PAPI_read(EventSet,values);
printf("%llu ,%llu \t",values[0],values[1]);

}

all other initialization has been done in main(). I am getting only 0's. Can you specify what is the problem?

My objective is to identify whether calling a function using a data variable caused L1 and/or L2 and/or L3 cache miss.

thanks

Ajit