I am running a blocked MM code on a Haswell server.
Performance counter stats for 'taskset -c 0 binaries/matmul/matmul_tiled_sse_128.12.1536':
17,354,925,885 cycles [25.03%]
47,163,342,001 instructions # 2.72 insns per cycle [30.03%]
15,629,057,840 L1-dcache-loads [30.03%]
2,125,485,010 L1-dcache-load-misses # 13.60% of all L1-dcache hits [30.03%]
1,179,936,318 r53e124 [30.03%]
22,469,151 r532124 [30.03%]
49,919,592 r504f2e [19.99%]
4,875,407 r50412e [20.19%]
189,994,023 LLC-prefetches [20.17%]
Specifically I want to see how blocking affect L1,L2 and L3 reference/misses.
I used perf list and selected the following events for L2
Umask-00 : 0x21 : PMU : [DEMAND_DATA_RD_MISS] : None : Demand Data Read requests that miss L2 cache
Umask-01 : 0x41 : PMU : [DEMAND_DATA_RD_HIT] : None : Demand Data Read requests that hit L2 cache
Umask-12 : 0xe1 : PMU : [ALL_DEMAND_DATA_RD] : None : Any data read request to L2 cache
However, the numbers do not make sense to me.
First, amount of L2 reads (r53e124) is lower than l1-dcache-misses. I checked l1-icache-misses as well. But L2 reads exceeds the sum by a large amount. One reason can be L1 miss colaescing, where processor sends lot of L1 miss requests quickly to L2 and all belong to same cache line. Since it is a matrix multiplication code, those kind of patterns are expected? Is it right way to explain those numbers?
Second, L3 refereces (r504f2e) is much higher than L2 misses (r532124). I can't think of any reason here.
Are I thinking in the right direction? Have I chosen the right hardware counters?
In order to interpret these results we need to know exactly which events are programmed with the named events: L1-dcache-loads, L1-dcache-load-misses, and LLC-prefetches.
Unfortunately the structure of the performance counter subsystem of the Linux kernel appears deliberately designed to obfuscate the specific meaning of named events. Sometimes I am able to find the location in the kernel source where these events are defined, but today I am not having any luck.
As a horrible hack, I often resort to reading the values from the performance counter event select registers manually. For example:
# start a process under "perf stat" that stalls waiting on stdin and put it in the background # bind the process to core 1 so I know where to look for the counter programming taskset -c 1 perf stat -e L1-dcache-loads cat >/dev/null & # Now read the counters manually -- since the process context has changed, the # counters will be disabled, but "perf stat" typically does not modify the other bits # of the PERF_EVT_SEL MSRs rdmsr -p 1 0x186 rdmsr -p 1 0x187 rdmsr -p 1 0x188 rdmsr -p 1 0x189 rdmsr -p 1 0x18a rdmsr -p 1 0x18b rdmsr -p 1 0x18c rdmsr -p 1 0x18d
Since the definitions of the events are hard-coded into the kernel (and can change from revision to revision) I can't guarantee that the events on your system are the same as the events on my system.
On one of my Haswell systems running RHEL6.6 ( 2.6.32-504.1.3.el6.x86_64) I found:
- L1-dcache-loads is programmed incorrectly as Event 0xd0, Umask 0xf1
- The event is OK
- The umask for counting "all loads" should be 0x81, not 0xf1
- I don't know if setting the extra Umask bits causes the counter to be incorrect, but it would be better to explicitly program the desired event
- The value of one load every 3 instructions seems reasonable for a matrix multiplication kernel -- you can check this against the inner loop of the assembly code to see if it looks right for your implementation.
- L1-dcache-load-misses is programmed incorrectly as Event 0x51, Umask 0x01
- This Event+Umask is L1D.REPLACEMENT, which is the wrong event
- L1D.REPLACEMENT will include L1 dcache refills due to any cause -- demand load misses, demand store misses, prefetch load misses, etc.
- LLC-prefetches is incorrectly programmed using an OFF_CORE_RESPONSE event.
- The auxiliary MSR (0x1a6) is programmed to 0x00010030
- Bit 5 says to count RFO requests (store misses) generated by the L2 prefetchers
- Bit 6 says to count code reads generated by the L2 prefetchers
- Bit 16 says to count for any data supplier (local L3 hit, remote cache hit, memory, etc)
- The bits that are programmed appear to be correct, but the most important bit is missing -- bit 4 for read prefetches generated by the L2 prefetchers!
- It is not clear whether bits 7,8,9 should also be set -- if I understand correctly, these are the bits that cause the counter to increment for L2 hardware prefetches that fetch data into the LLC (but not directly into the L2).
- The auxiliary MSR (0x1a6) is programmed to 0x00010030
So the "named" events are correct in 0 of 3 cases --- way to go Ingo Molnar!!!!!
If I recall correctly, a blocked matrix multiply kernel will have 1/2 load misses and 1/2 store misses, which means that your actual L1 data cache read miss rate is a close match to the L2 demand read access rate.
Using the explicitly programmed L2 demand data read and demand data read miss counters it looks like you have a 2% L2 miss rate. This is what I would expect for a matrix multiplication kernel with a block size of 100.
If I recall correctly, the 0x2e events (LONGEST_LAT_CACHE references and misses) also count access due to demand load misses and demand RFO (store) misses, so these need to be considered carefully as well. Dividing the misses by two (assuming 1/2 were due to store misses) brings the numbers much closer to the L2 demand read miss value. It is still about 11% high, but it is hard to tell what is going on from this counter set. Getting the actual L2 demand RFO miss rate would allow you to compare L2 misses with LLC references.
It is going to take a fair amount of careful directed testing to understand these counters, but it looks like you are making a good start.