Hi,
I am running a blocked MM code on a Haswell server.
Performance counter stats for 'taskset -c 0 binaries/matmul/matmul_tiled_sse_128.12.1536':
17,354,925,885 cycles [25.03%]
47,163,342,001 instructions # 2.72 insns per cycle [30.03%]
15,629,057,840 L1-dcache-loads [30.03%]
2,125,485,010 L1-dcache-load-misses # 13.60% of all L1-dcache hits [30.03%]
1,179,936,318 r53e124 [30.03%]
22,469,151 r532124 [30.03%]
49,919,592 r504f2e [19.99%]
4,875,407 r50412e [20.19%]
189,994,023 LLC-prefetches [20.17%]
Specifically I want to see how blocking affect L1,L2 and L3 reference/misses.
I used perf list and selected the following events for L2
Umask-00 : 0x21 : PMU : [DEMAND_DATA_RD_MISS] : None : Demand Data Read requests that miss L2 cache
Umask-01 : 0x41 : PMU : [DEMAND_DATA_RD_HIT] : None : Demand Data Read requests that hit L2 cache
Umask-12 : 0xe1 : PMU : [ALL_DEMAND_DATA_RD] : None : Any data read request to L2 cache
However, the numbers do not make sense to me.
First, amount of L2 reads (r53e124) is lower than l1-dcache-misses. I checked l1-icache-misses as well. But L2 reads exceeds the sum by a large amount. One reason can be L1 miss colaescing, where processor sends lot of L1 miss requests quickly to L2 and all belong to same cache line. Since it is a matrix multiplication code, those kind of patterns are expected? Is it right way to explain those numbers?
Second, L3 refereces (r504f2e) is much higher than L2 misses (r532124). I can't think of any reason here.
Are I thinking in the right direction? Have I chosen the right hardware counters?
In order to interpret these results we need to know exactly which events are programmed with the named events: L1-dcache-loads, L1-dcache-load-misses, and LLC-prefetches.
Unfortunately the structure of the performance counter subsystem of the Linux kernel appears deliberately designed to obfuscate the specific meaning of named events. Sometimes I am able to find the location in the kernel source where these events are defined, but today I am not having any luck.
As a horrible hack, I often resort to reading the values from the performance counter event select registers manually. For example:
# start a process under "perf stat" that stalls waiting on stdin and put it in the background # bind the process to core 1 so I know where to look for the counter programming taskset -c 1 perf stat -e L1-dcache-loads cat >/dev/null & # Now read the counters manually -- since the process context has changed, the # counters will be disabled, but "perf stat" typically does not modify the other bits # of the PERF_EVT_SEL MSRs rdmsr -p 1 0x186 rdmsr -p 1 0x187 rdmsr -p 1 0x188 rdmsr -p 1 0x189 rdmsr -p 1 0x18a rdmsr -p 1 0x18b rdmsr -p 1 0x18c rdmsr -p 1 0x18d
Since the definitions of the events are hard-coded into the kernel (and can change from revision to revision) I can't guarantee that the events on your system are the same as the events on my system.
On one of my Haswell systems running RHEL6.6 ( 2.6.32-504.1.3.el6.x86_64) I found:
So the "named" events are correct in 0 of 3 cases --- way to go Ingo Molnar!!!!!
If I recall correctly, a blocked matrix multiply kernel will have 1/2 load misses and 1/2 store misses, which means that your actual L1 data cache read miss rate is a close match to the L2 demand read access rate.
Using the explicitly programmed L2 demand data read and demand data read miss counters it looks like you have a 2% L2 miss rate. This is what I would expect for a matrix multiplication kernel with a block size of 100.
If I recall correctly, the 0x2e events (LONGEST_LAT_CACHE references and misses) also count access due to demand load misses and demand RFO (store) misses, so these need to be considered carefully as well. Dividing the misses by two (assuming 1/2 were due to store misses) brings the numbers much closer to the L2 demand read miss value. It is still about 11% high, but it is hard to tell what is going on from this counter set. Getting the actual L2 demand RFO miss rate would allow you to compare L2 misses with LLC references.
It is going to take a fair amount of careful directed testing to understand these counters, but it looks like you are making a good start.
For more complete information about compiler optimizations, see our Optimization Notice.