- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Event Number Umask Value Mnemonic D1H 02H MEM_LOAD_UOPS_RETIRED.L2_HIT D1H 04H MEM_LOAD_UOPS_RETIRED.L3_HIT D1H 20H MEM_LOAD_UOPS_RETIRED.L3_MISSTo read these counters I can use the RDPMC instruction, which requires ECX to be loaded with the correct value. My questions are as follows: 1) Are these performance counters accurate enough to measure as described above? 2) Are these events monitored automatically or do I somehow have to activate them? 3) What value do I need to load into ECX to get the above counters? I have read lots of documentation and other forum posts but I am still confused. It would be great if someone could direct me to some resource that explains this well, or maybe provide a short assemlby snippet (I could not find anything online).
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From a different post I got the following code:
unsigned long rdpmc_test(unsigned idx) { unsigned a, d, c; asm volatile ( "rdpmc" : "=a" (a), "=d" (d) : "c" (c) ); return *(((unsigned long) d) << 32) | ((unsigned long) a); }
I can run this code in a Linux kernel module.
My problem now ist that I do not know how to choose 'idx'. From a different post I have that I can use 0 <= idx < 8. This means that somehow I need to "register" the events. E.g., register MEM_LOAD_UOPS_RETIRED.L2_HIT as idx = 0.
How can I register such an event?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are problems with these events on the Sandy Bridge processors. My systems are all Sandy Bridge Xeon E5 systems -- some of the problems apply to any processor with a Sandy Bridge core and some apply only to the processors with the "server" uncore.
- The biggest problem is that 256-bit AVX loads will only increment the L1_HIT and HIT_LFB sub-events and will never increment the L2_HIT or L3_HIT sub-events. This probably applies to both server and client parts.
- The MEM_LOAD_UOPS_RETIRED.LLC_MISS and LLC_HIT counters can undercount (by nearly 100% in my tests). There are workarounds described in the Xeon E5 processor specification update, and there is a simple implementation to enable and disable these workarounds at https://github.com/andikleen/pmu-tools/blob/master/latego.py ; (This is probably a "server-only" bug)
- The MEM_LOAD_UOPS_RETIRED events can miscount badly if HyperThreading is enabled. This probably applies to both server and client parts.
The good news is that you don't need to use these events -- there are lots of alternatives. I have had good luck with the L1D.REPLACEMENT event to count L1 Data Cache misses. If I recall correctly, the MEM_UOPS_RETIRED.ALL_LOADS seems accurate as well. You can use these to compute hit/miss rates, but to understand the hit/miss rates you need to know the size(s) of the loads being used, and you need to pay attention to the HIT_LFB values.
All of these events require programming a variety of MSRs including the "global" counter controls and the PERFEVT_SEL MSRs that program the individual counters. The procedure is described in Chapter 18 of Volume 3 of the Intel Architecture SW Developer's Manual. Execution of the RDPMC instruction in user-mode code requires that the CR4.PCE configuration bit is set. If your kernel sets this by default, then everything is good. If the kernel clears this bit instead, you are probably out of luck.
The value that you put in %ecx before executing the RDPMC instruction is the number of the performance counter that you want to read. Depending on the system configuration this will be 0-3 or 0-7. Counter 0 is the one controlled by MSR 0x186 IA32_PERFEVTSEL0, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much for your great answer. These counters are way more complex than I expected (I'm very new to this stuff).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI,
There are alternatives, such as PAPI library, to count cache misses including L1, L2 and L3.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zirak,
I am using following code sequence to measure L1 and L2 cache misses.
int Events[NUM_EVENTS]={PAPI_L1_LDM,PAPI_L2_LDM};
void thread_func()
{
retval = PAPI_start(EventSet);
call a function()
retval = PAPI_read(EventSet,values);
printf("%llu ,%llu \t",values[0],values[1]);
}
all other initialization has been done in main(). I am getting only 0's. Can you specify what is the problem?
My objective is to identify whether calling a function using a data variable caused L1 and/or L2 and/or L3 cache miss.
thanks
Ajit

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page