The hardware performance

m_faulkner · ‎12-20-2007

Hey all,

I'm trying to record the total number of L1 cache hits / misses. It's obvious with regards to the L1 Instruction cache, the two events are

L1I_READS: (counter: all)
L1I_MISSES: (counter: all)

However, I've not managed to find suitable events for the L1 data cache, namely an L1 data cache read and an L1 data cache miss.

I require the absolute numbers not a ratio. Any help would be very much appreicated

I'm trying to collect these stats on a Core 2 machine.

Thanks

Thomas_W_Intel · ‎01-08-2008

Hi,

you can analyze the L1D cache very exactly using the events L1D_CACHE_LD.*, L1D_CACHE_ST.*, and L1D_CACHE_LOCK.*. For example, you can get the number of cache misses using:

L1D_CACHE_LD.I_STATE + L1D_CACHE_ST.I_STATE + L1D_CACHE_LOCK.I_STATE

The event L1D_ALL_CACHE_REF might also be handy, which counts the sum of L1D_CACHE_LD.*, L1D_CACHE_ST.*, and L1D_CACHE_LOCK.*.

I hope this helps

Thomas

kamal444 · ‎01-12-2008

Hi,

I wanted to know if the same could be done for L2 Cache.

Basically, I am only interested in finding out the L2 Cache Data Misses and then calculate the miss ratio for an application.

Any help would be appreciated.

Regards,
Kamal

m_faulkner · ‎01-21-2008

Hey Thomas,

Thanks for your help. However, I'm still having a few problems

I tried using L1D_ALL_CACHE_REF, but when i use this event type nothing gets recorded. When i use the events (for cache hits) numbers seem to get recorded. However, from my (basic) understanding of the MESI I want to make sure that all cache misses get counted. When a cache line is not in the cache, but does exist in main memory, does it set a cache line to invalid before fetching in to the line or does it simply override the line?

Thanks

Matt

Thomas_W_Intel · ‎03-14-2008

Matt

"invalid" actually means that the cache line is not present in the cache :)

Kind regards

Thomas

joshua98 · ‎03-17-2008

Hi..

This is Joshua...

I'm using a VTune on Windows...

Do you have any idea of doing what you say above on Windows..?

I want to count the L1 cache misses...

And, that is not easy for me...^^

Thank you...

Thomas_W_Intel · ‎03-17-2008

Joshua,

the events do not depend on the OS but on the architecture, e.g. if you have a Core 2 Duo system, you can select the events that I listed above.

The GUI is slightly different on Windows, but the methodology is the same. If you have Windows-specific questions, I suggest that you ask in the forum about VTune for Windows.

Best regards

Thomas

SPYRIDON_T_ · ‎08-02-2014

I try to measure the performance of my intel processor but it seems that L1D_ALL_CACHE_REF is not equal with the L1D_CACHE_LD.MESI, L1D_CACHE_MESI., and L1D_CACHE_LOCK.MESI (the sum of them). Do you have any idea about this issue?

I am using perf tool of Linux

thank you very much

McCalpinJohn · ‎08-02-2014

Some random comments:

It is often difficult to tell whether an event that refers to "Invalid" state is referring to a cache miss or a cache hit on an invalid copy of the line or both. To make it worse, engineers working on different parts of the processor may adopt different conventions and these different conventions may make it into the documentation as inconsistencies in the nomenclature. The primary difference in behavior between missing in the cache and hitting on an invalid copy of a line is, of course, the need for the former case to choose a victim, while the latter case is able to use the cache entry occupied by the invalid copy of the line. (This is not required, but it is a sensible implementation choice.) There can also be differences in the coherence protocol for these two cases -- IBM's POWER6 processor uses different "invalid" states to help optimize coherence transactions. For the performance counters on Intel processors, it appears that "invalid" usually refers to both cache misses and cache hits on lines in the "invalid" state, but I don't know that the usage is 100% consistent.
It does not look like the Core 2 processors have enough performance counters to count all the L1 events mentioned above in a single run, so even though the documentation says that L1D_CACHE_LD.MESI + L1D_CACHE_ST.MESI + L1D_CACHE_LOCK.MESI should be exactly the same as L1D_ALL_CACHE_REF.MESI, there will almost always be some small variation due to run-to-run variations. As with any performance counter measurements, it is best to repeat the measurements several times and accept that there will be some variability across runs and that some performance monitor counts will be accrued by the measurement software itself.
General advice for those seeking help from the forums: Providing *specific* information will significantly improve your chances of getting useful advice. As mentioned above, each processor family/model has different performance monitoring capabilities, so we need to know the model name/number of the processor in question. Questions about consistency or inconsistency of counts also need to be specific. In the example above, it would be good to know which MESI bits were set in the L1D_CACHE_* measurements, whether the measurements were on the whole program (e.g., Linux "perf stat"), whether the measurements were multiplexed (e.g., with Linux "perf stat" when more events are requested than there are physical counters), whether the comparison is based on multiple runs of the program, whether the measurements were made by a sampling infrastructure (e.g., VTune), and (finally) specific numbers that show both the magnitude of the baseline counts and the amount of inconsistency.

SPYRIDON_T_ · ‎08-02-2014

Performance counter stats for 'java Harness HelloWorld' (10 runs):

           2214724 r0f40                                                         ( +- 0,74% )
         609492941 r0f41                                                         ( +- 0,37% )
           3677638 r0f42                                                         ( +- 5,27% )
        1760947374 r0243                                                         ( +- 0,30% )

0,952851801 seconds time elapsed ( +- 0,42% )

Running the Hello World program (repeatedly 10 times in a benchmark suite) you can see that the r0243 (L1D_ALL_CACHE_REF) are significantly higher than the sum of r0f40(L1D_LOAD_MESI), r0f41 (L1D_STORE_MESI) and r0f42 (L1D_LOCK_MESI), in contrast to what is written here: http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/pmm/events/l1d_all_cache_ref.html

My proc is a Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz

Still cant explain why, but the L1D_MESIS seems more logical to me than the r0243. Can the L1D_ALL_CACHE_REF be wrong'??

McCalpinJohn · ‎08-03-2014

It looks like this may be another victim of Intel's ambiguous processing naming scheme....

The discussions above started with a reference to a "Core 2" system (back in December 2007 when this thread started). The events L1D_CACHE_LD.MESI, L1D_CACHE_ST.MESI, L1D_CACHE_LOCK.MESI, and L1D_ALL_CACHE_REF.MESI are only documented for the Intel Xeon 3000, 3200, 5100, 5300, and Intel Core 2 Duo processors -- see Table 19-17 in Section 19.8 of Volume 3 of the Intel SW Developers Guide (document 325384-049).

The corresponding performance counter event select values (0x40, 0x41, 0x42, 0x43) are also described for a few other processor models -- using different names but similar descriptions:

Section 19.5, Table 19-11 for the Nehalem-based Core i7 processors
Section 19.6, Table 19-14 for the Westmere-based processors
Section 19.11, Table 19-20 for the Intel Core Solo and Core Duo processors

The performance monitoring events for the Intel Core i7-3517U are described in Table 19-5 of Section 19.3 of Volume 3 of the SW Developer's Guide. This table does not include descriptions for EventSelect codes 0x40, 0x41, 0x42, or 0x43.

The link above to the documentation at JAIST is for an SGI Altix system, which appears to be an Altix UV1000, using the Westmere EX processors. So the descriptions of section 19.6 should apply to that system, but not to your Ivy-Bridge-based Core i7-3517U processor.

The performance counter events for the Ivy-Bridge-based Core i7 that are most similar to the L1D_CACHE_*.MESI events discussed here are probably:

MEM_UOPS_RETIRED.ALL_LOADS, Event 0xD0, Umask 0x81
MEM_UOPS_RETIRED.ALL_STORES, Event 0xD0, Umask 0x82
MEM_UOPS_RETIRED.LOCK_LOADS, Event 0xD0, Umask 0x21

Only one of these is directly documented in Section 19.5, Table 19-5 of Volume 3 of the SW Developer's Guide, but the other two are used by the Intel VTune product -- I found them in the file "ivybridge_db.txt" in the Intel VTune distribution. The use of the events by VTune is encouraging, but the absence of documentation in the SW Developer's Guide should make one somewhat cautious in using these events. They probably work, but there might be cases for which they count incorrectly. I found one mention of the MEM_UOP_RETIRED events (Event 0xD0) in the "Desktop 3rd Generation Intel Core Processor Family Specification Update" publication (document 326766-016, January 2014), that says that these events can count in the wrong thread (logical processor) when HyperThreading is enabled. There may be additional errors that are not publicly documented.

SPYRIDON_T_ · ‎08-03-2014

Thank you for your answer. The events MEM_UOPS_RETIRED.ALL_LOADS,MEM_UOPS_RETIRED.ALL_STORES,MEM_UOPS_RETIRED.LOCK_LOADS are not supported by perf so I think that I should find another program to measure them.

However, in order to measure the L1I.READS in the same processor ( Intel Core i7-3517U) I used the (unmask value 03H event 80H) as described in the same manual at 19-12 table (Vol. 3B 19-53) Non-Architectural Performance Events In the Processor Core for Intel Core i7 Processor and Intel Xeon Processor Series (Contd). Is that right? I found a logical metric from this one.

In addition to measure the L2 and LLC performance I used the metrics from the manual-325384-051 included in the tables 19.6 (section 19.4) and 19.1 (section 19.1). The description says about the ivy arch in these tables.

Thank you very much for you help and your time!

Intel® Core™ i7 Processor and Intel® Xeon® Processor 5500 Series (Contd.)

McCalpinJohn · ‎08-04-2014

I don't know how "perf" does its name to event translation. The names I used in the last example were taken from the VTune "ivybridge_db.txt" file, so they don't quite match the names in Chapter 19 of Volume 3 of the SW Developer's Guide. You should be able to program them as "raw" events using "perf stat" if the system allows you to access any of the performance counters. On my Xeon E5-2680 system (Sandy Bridge EP), the following simple test works as expected:

$ perf stat -e r81d0 -e r82d0 -e r21d0 /bin/ls >/dev/null

Performance counter stats for '/bin/ls':

           669,204 r81d0
           404,241 r82d0
            12,108 r21d0

       0.001972686 seconds time elapsed

I don't know what the correct answer should have been for this case, but these numbers look reasonable.

For the Intel Core i7-3517U processor, Table 19-5 in Section 19.3 ("Performance Monitoring Events for 3rd Generation Intel Core Processors") describes the supported events. Lots of Intel processors have similar names, so it is easy to get confused. The command "cat /proc/cpuinfo" should return a bunch of lines starting with:

processor   : 0
vendor_id   : GenuineIntel
cpu family   : 6
model       : 62

To determine which table to use, convert the "model" value to hex and append it to the "cpu family" value to get "06_3AH" (the final "H" is a reminder that this is a hexadecimal value). Intel calls this the "DisplayFamily_DisplayModel" encoding or sometimes just the "CPUID signature".

In some cases all you need to do is search for the string "06_3AH" in Volume 3 of the SW Developer's Guide to find the sections that refer to this specific processor model. Unfortunately you will find that PDFs are not particularly good for searching, so in this case you also need to read the surrounding text to determine the other names that are used for this processor model. You should find section 19.3, but other sections that apply to this processor (such as 35.9) refer to this as the "3rd generation Intel Core processor family (based on Ivy Bridge microarchitecture)".

I don't know why the Intel documentation quit describing Event 0x80, Umask 0x01 "L1I.HITS" after the Westmere processor. The "L1I.MISSES" event is unchanged in the newer processors (Event 0x80, Umask 0x02, now called "ICACHE.MISSES"), but the corresponding instruction cache hit event is not listed in either Volume 3 of the SW Developer's Guide or in the Intel VTune event definition files for processors newer than Westmere. I would assume that the omission of the event means that it is either not reliable or that the instruction fetch process is so complex in the recent processors that it is not usefully informative. With the various instruction buffers in the new processors the idea of an instruction cache miss still makes sense, but there should be lots of cases for which there are no instruction cache accesses, so the concept of instruction cache "hit" is more difficult.

SPYRIDON_T_ · ‎08-06-2014

I don't know but still thes unmasked values r81d0 r82d0 r21d0 are not supported by my perf..

McCalpinJohn · ‎08-06-2014

Hmmm.... That is confusing -- "perf" does not do a lot of checking on arguments with "raw" event selections and the hardware does not do a lot of checking either.

So you are saying that the same system that allowed you to run the performance counters on 2014-08-02 for the "java Harness HelloWorld" does not allow you to specify the events that I listed? What sort of error message is delivered?

SPYRIDON_T_ · ‎08-06-2014

spyros@spyros-LIFEBOOK-UH572:~/SpyDacaPo-Linux$ perf stat -e r80d0,r81d0,r21d0,r01d0,r02d0 java Harness HelloWorld
HelloWorld

===== DaCapo Clojure Benchmarking 1.0 HelloWorld starting =====
Hello, world!
===== DaCapo Clojure Benchmarking 1.0 HelloWorld PASSED in 0 msec =====

Performance counter stats for 'java Harness HelloWorld':

   <not supported> r80d0
   <not supported> r81d0
   <not supported> r21d0
   <not supported> r01d0
   <not supported> r02d0

1.038309525 seconds time elapsed

Unfortunately none of these counters seems to be working.

processor   : 3
vendor_id   : GenuineIntel
cpu family   : 6
model       : 58
model name   : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping   : 9
microcode   : 0x12
cpu MHz       : 799.000
cache size   : 4096 KB

McCalpinJohn · ‎08-06-2014

Maybe the perf_events subsystem in your OS is blocking these events? They work fine on my system with Xeon E5-2660 v2 (Ivy Bridge EP) processors. These are DisplayModel_DisplayFamily 06_3EH, while yours are 06_3AH, but both are covered by section 19.4 of Vol 3 of the SW Developer's Guide.

processor   : 0
vendor_id   : GenuineIntel
cpu family   : 6
model       : 62
model name   : Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
stepping   : 4

My OS is actually old enough that it does not fully support this processor, but it does not attempt to block these performance counter events....

c3-501$ uname -a
Linux c3-501.discovery.tacc.utexas.edu 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux

Chaitali_C_ · ‎06-18-2015

Hello,

I want to know events for Intel Sandy Bridge that give count, not rate for L1D,L2 and L3 misses.

amplxe-runss -event-list gives list of events but no kind of direct event is there.

For calculating miss rate events are there..also mentioned in forums...but for count I am not getting info for

sandy bridge ....

Any comments on this?

Thanks,

Chaitali

McCalpinJohn · ‎06-18-2015

The hardware performance counters only count events, not rates. If software presents the results as a rate it is because the software has divided the count by the elapsed time. If the elapsed time is provided you can just multiply the rate by the time to get back to the original counts (at least to the accuracy of the values that the software has provided).

Although the events you are asking about sound straightforward, there are a remarkable number of complexities. Considering L1 cache misses only, there are many different types of transactions that you might want to count or not count:

L1 Instruction Cache
1. Demand Misses
2. Hardware Prefetch Misses
L1 Data Cache
1. Demand Load Misses
2. Demand Store Misses
3. Data Cache "Streaming Prefetcher" Load or RFO misses
4. Data Cache "IP-based Stride Prefetcher" Load or RFO misses

These events can be counted by the L1 cache (as misses) or by the L2 cache (as accesses). It is often the case that the types of transactions that can be counted on the two sides is different, and even when trying to count the same types of transactions it is often the case that the counts on the two sides are different. The latter will occur, for example, if the L1 Data Cache counts every time it tries to access the L2 cache (even if the request is rejected), while the L2 counts only accesses that it accepts. (There are many other possible sources and types of inconsistencies -- too many to try to list.) The performance counters might also have bugs, and those bugs might be different for the L1 side and the L2 side, and workarounds (if any) might apply to only one side, etc.

At the level of the L2 cache, you have all of the access types above, plus additional transaction types from the two L2 hardware prefetchers ("spatial prefetcher" and "streaming prefetcher"), plus accesses from the Page Miss Handler (Intel's name for the hardware Page Table Walker). (Aside: Page Table Entries are not typically stored in the L1 Data Cache, but are typically cached in the L2 and LLC caches. Translation entries at higher levels in the hierarchical address translation (PDE, PDPTE, and PML4) may or may not be cacheable, depending on the system configuration.) L2 cache misses can (in principle) be counted by the L2 (as misses), by the System Agent (as accesses), or by the L3 cache(s) (as accesses). Again, not all events can be counted in all of these places, and each counter is likely to have different idiosyncrasies and/or bugs.

At the level of the L3 cache, you have all of the complexity of the L2 cache, plus several additional complexities. Because the L3 cache is "inclusive" in most recent Intel processors, the L3 cache is the one that is snooped on IO accesses and the L3 cache is snooped on cache misses from any other processor(s) in the system. In addition, some (many?) Intel processors support the "Direct Cache Access" (DCA) facility, which (if I understand correctly) can cause writes from IO devices to be written directly into the L3 cache.

Intel's Amplifier XE can read the performance counts and include this information in the reports. Understanding how these counts map back to the program execution can be tricky, since the counts are associated with whatever instruction is executing when Amplifier XE decides to take a sample. On the plus side, Amplifier XE knows which counters to use and knows how to apply the workarounds that are available for some counter bugs. This is very likely the most effective approach for high-level analysis. I prefer to manually instrument the code and read the counters before and after sections of interest, but this is a lot more work and requires a lot of testing to understand exactly what the counters are counting.

Which events record absolute Number of L1 Data Cache Hits/Misses?