you can analyze the L1D cache very exactly using the events L1D_CACHE_LD.*, L1D_CACHE_ST.*, and L1D_CACHE_LOCK.*. For example, you can get the number of cache misses using:
L1D_CACHE_LD.I_STATE + L1D_CACHE_ST.I_STATE + L1D_CACHE_LOCK.I_STATE
The event L1D_ALL_CACHE_REF might also be handy, which counts the sum of L1D_CACHE_LD.*, L1D_CACHE_ST.*, and L1D_CACHE_LOCK.*.
I hope this helps
This is Joshua...
I'm using a VTune on Windows...
Do you have any idea of doing what you say above on Windows..?
I want to count the L1 cache misses...
And, that is not easy for me...^^
the events do not depend on the OS but on the architecture, e.g. if you have a Core 2 Duo system, you can select the events that I listed above.
The GUI is slightly different on Windows, but the methodology is the same. If you have Windows-specific questions, I suggest that you ask in the forum about VTune for Windows.
I try to measure the performance of my intel processor but it seems that L1D_ALL_CACHE_REF is not equal with the L1D_CACHE_LD.MESI, L1D_CACHE_MESI., and L1D_CACHE_LOCK.MESI (the sum of them). Do you have any idea about this issue?
I am using perf tool of Linux
thank you very much
Some random comments:
Performance counter stats for 'java Harness HelloWorld' (10 runs):
2214724 r0f40 ( +- 0,74% )
609492941 r0f41 ( +- 0,37% )
3677638 r0f42 ( +- 5,27% )
1760947374 r0243 ( +- 0,30% )
0,952851801 seconds time elapsed ( +- 0,42% )
Running the Hello World program (repeatedly 10 times in a benchmark suite) you can see that the r0243 (L1D_ALL_CACHE_REF) are significantly higher than the sum of r0f40(L1D_LOAD_MESI), r0f41 (L1D_STORE_MESI) and r0f42 (L1D_LOCK_MESI), in contrast to what is written here: http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjec...
My proc is a Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
Still cant explain why, but the L1D_MESIS seems more logical to me than the r0243. Can the L1D_ALL_CACHE_REF be wrong'??
It looks like this may be another victim of Intel's ambiguous processing naming scheme....
The discussions above started with a reference to a "Core 2" system (back in December 2007 when this thread started). The events L1D_CACHE_LD.MESI, L1D_CACHE_ST.MESI, L1D_CACHE_LOCK.MESI, and L1D_ALL_CACHE_REF.MESI are only documented for the Intel Xeon 3000, 3200, 5100, 5300, and Intel Core 2 Duo processors -- see Table 19-17 in Section 19.8 of Volume 3 of the Intel SW Developers Guide (document 325384-049).
The corresponding performance counter event select values (0x40, 0x41, 0x42, 0x43) are also described for a few other processor models -- using different names but similar descriptions:
The performance monitoring events for the Intel Core i7-3517U are described in Table 19-5 of Section 19.3 of Volume 3 of the SW Developer's Guide. This table does not include descriptions for EventSelect codes 0x40, 0x41, 0x42, or 0x43.
The link above to the documentation at JAIST is for an SGI Altix system, which appears to be an Altix UV1000, using the Westmere EX processors. So the descriptions of section 19.6 should apply to that system, but not to your Ivy-Bridge-based Core i7-3517U processor.
The performance counter events for the Ivy-Bridge-based Core i7 that are most similar to the L1D_CACHE_*.MESI events discussed here are probably:
Only one of these is directly documented in Section 19.5, Table 19-5 of Volume 3 of the SW Developer's Guide, but the other two are used by the Intel VTune product -- I found them in the file "ivybridge_db.txt" in the Intel VTune distribution. The use of the events by VTune is encouraging, but the absence of documentation in the SW Developer's Guide should make one somewhat cautious in using these events. They probably work, but there might be cases for which they count incorrectly. I found one mention of the MEM_UOP_RETIRED events (Event 0xD0) in the "Desktop 3rd Generation Intel Core Processor Family Specification Update" publication (document 326766-016, January 2014), that says that these events can count in the wrong thread (logical processor) when HyperThreading is enabled. There may be additional errors that are not publicly documented.
Thank you for your answer. The events MEM_UOPS_RETIRED.ALL_LOADS,MEM_UOPS_RETIRED.ALL_STORES,MEM_UOPS_RETIRED.LOCK_LOADS are not supported by perf so I think that I should find another program to measure them.
However, in order to measure the L1I.READS in the same processor ( Intel Core i7-3517U) I used the (unmask value 03H event 80H) as described in the same manual at 19-12 table (Vol. 3B 19-53) Non-Architectural Performance Events In the Processor Core for Intel Core i7 Processor and Intel Xeon Processor Series (Contd). Is that right? I found a logical metric from this one.
In addition to measure the L2 and LLC performance I used the metrics from the manual-325384-051 included in the tables 19.6 (section 19.4) and 19.1 (section 19.1). The description says about the ivy arch in these tables.
Thank you very much for you help and your time!
I don't know how "perf" does its name to event translation. The names I used in the last example were taken from the VTune "ivybridge_db.txt" file, so they don't quite match the names in Chapter 19 of Volume 3 of the SW Developer's Guide. You should be able to program them as "raw" events using "perf stat" if the system allows you to access any of the performance counters. On my Xeon E5-2680 system (Sandy Bridge EP), the following simple test works as expected:
$ perf stat -e r81d0 -e r82d0 -e r21d0 /bin/ls >/dev/null
Performance counter stats for '/bin/ls':
0.001972686 seconds time elapsed
I don't know what the correct answer should have been for this case, but these numbers look reasonable.
For the Intel Core i7-3517U processor, Table 19-5 in Section 19.3 ("Performance Monitoring Events for 3rd Generation Intel Core Processors") describes the supported events. Lots of Intel processors have similar names, so it is easy to get confused. The command "cat /proc/cpuinfo" should return a bunch of lines starting with:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
To determine which table to use, convert the "model" value to hex and append it to the "cpu family" value to get "06_3AH" (the final "H" is a reminder that this is a hexadecimal value). Intel calls this the "DisplayFamily_DisplayModel" encoding or sometimes just the "CPUID signature".
In some cases all you need to do is search for the string "06_3AH" in Volume 3 of the SW Developer's Guide to find the sections that refer to this specific processor model. Unfortunately you will find that PDFs are not particularly good for searching, so in this case you also need to read the surrounding text to determine the other names that are used for this processor model. You should find section 19.3, but other sections that apply to this processor (such as 35.9) refer to this as the "3rd generation Intel Core processor family (based on Ivy Bridge microarchitecture)".
I don't know why the Intel documentation quit describing Event 0x80, Umask 0x01 "L1I.HITS" after the Westmere processor. The "L1I.MISSES" event is unchanged in the newer processors (Event 0x80, Umask 0x02, now called "ICACHE.MISSES"), but the corresponding instruction cache hit event is not listed in either Volume 3 of the SW Developer's Guide or in the Intel VTune event definition files for processors newer than Westmere. I would assume that the omission of the event means that it is either not reliable or that the instruction fetch process is so complex in the recent processors that it is not usefully informative. With the various instruction buffers in the new processors the idea of an instruction cache miss still makes sense, but there should be lots of cases for which there are no instruction cache accesses, so the concept of instruction cache "hit" is more difficult.
Hmmm.... That is confusing -- "perf" does not do a lot of checking on arguments with "raw" event selections and the hardware does not do a lot of checking either.
So you are saying that the same system that allowed you to run the performance counters on 2014-08-02 for the "java Harness HelloWorld" does not allow you to specify the events that I listed? What sort of error message is delivered?
spyros@spyros-LIFEBOOK-UH572:~/SpyDacaPo-Linux$ perf stat -e r80d0,r81d0,r21d0,r01d0,r02d0 java Harness HelloWorld
===== DaCapo Clojure Benchmarking 1.0 HelloWorld starting =====
===== DaCapo Clojure Benchmarking 1.0 HelloWorld PASSED in 0 msec =====
Performance counter stats for 'java Harness HelloWorld':
<not supported> r80d0
<not supported> r81d0
<not supported> r21d0
<not supported> r01d0
<not supported> r02d0
1.038309525 seconds time elapsed
Unfortunately none of these counters seems to be working.
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 58
model name : Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz
stepping : 9
microcode : 0x12
cpu MHz : 799.000
cache size : 4096 KB
Maybe the perf_events subsystem in your OS is blocking these events? They work fine on my system with Xeon E5-2660 v2 (Ivy Bridge EP) processors. These are DisplayModel_DisplayFamily 06_3EH, while yours are 06_3AH, but both are covered by section 19.4 of Vol 3 of the SW Developer's Guide.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
stepping : 4
My OS is actually old enough that it does not fully support this processor, but it does not attempt to block these performance counter events....
c3-501$ uname -a
Linux c3-501.discovery.tacc.utexas.edu 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 x86_64 x86_64 GNU/Linux
I want to know events for Intel Sandy Bridge that give count, not rate for L1D,L2 and L3 misses.
amplxe-runss -event-list gives list of events but no kind of direct event is there.
For calculating miss rate events are there..also mentioned in forums...but for count I am not getting info for
sandy bridge ....
Any comments on this?
The hardware performance counters only count events, not rates. If software presents the results as a rate it is because the software has divided the count by the elapsed time. If the elapsed time is provided you can just multiply the rate by the time to get back to the original counts (at least to the accuracy of the values that the software has provided).
Although the events you are asking about sound straightforward, there are a remarkable number of complexities. Considering L1 cache misses only, there are many different types of transactions that you might want to count or not count:
These events can be counted by the L1 cache (as misses) or by the L2 cache (as accesses). It is often the case that the types of transactions that can be counted on the two sides is different, and even when trying to count the same types of transactions it is often the case that the counts on the two sides are different. The latter will occur, for example, if the L1 Data Cache counts every time it tries to access the L2 cache (even if the request is rejected), while the L2 counts only accesses that it accepts. (There are many other possible sources and types of inconsistencies -- too many to try to list.) The performance counters might also have bugs, and those bugs might be different for the L1 side and the L2 side, and workarounds (if any) might apply to only one side, etc.
At the level of the L2 cache, you have all of the access types above, plus additional transaction types from the two L2 hardware prefetchers ("spatial prefetcher" and "streaming prefetcher"), plus accesses from the Page Miss Handler (Intel's name for the hardware Page Table Walker). (Aside: Page Table Entries are not typically stored in the L1 Data Cache, but are typically cached in the L2 and LLC caches. Translation entries at higher levels in the hierarchical address translation (PDE, PDPTE, and PML4) may or may not be cacheable, depending on the system configuration.) L2 cache misses can (in principle) be counted by the L2 (as misses), by the System Agent (as accesses), or by the L3 cache(s) (as accesses). Again, not all events can be counted in all of these places, and each counter is likely to have different idiosyncrasies and/or bugs.
At the level of the L3 cache, you have all of the complexity of the L2 cache, plus several additional complexities. Because the L3 cache is "inclusive" in most recent Intel processors, the L3 cache is the one that is snooped on IO accesses and the L3 cache is snooped on cache misses from any other processor(s) in the system. In addition, some (many?) Intel processors support the "Direct Cache Access" (DCA) facility, which (if I understand correctly) can cause writes from IO devices to be written directly into the L3 cache.
Intel's Amplifier XE can read the performance counts and include this information in the reports. Understanding how these counts map back to the program execution can be tricky, since the counts are associated with whatever instruction is executing when Amplifier XE decides to take a sample. On the plus side, Amplifier XE knows which counters to use and knows how to apply the workarounds that are available for some counter bugs. This is very likely the most effective approach for high-level analysis. I prefer to manually instrument the code and read the counters before and after sections of interest, but this is a lot more work and requires a lot of testing to understand exactly what the counters are counting.