I'm trying to analyze and compare two simple geometric multi-grid kernels using performance counters to see how much of the data (in Bytes) is coming from L1, L2, LLC and the DRAM for each implementation.
I realize that getting an accurate count is extremely difficult with so many different things going on underneath (prefetching, instructions, cache lines, etc.), so I am trying to get at least a *rough estimate*.
I'm using LIKWID to analyze my code and I was hoping to get what I need using the following counters:
More precisely, I was hoping that A = B + C + D + E and using these to get a rough estimate on the proportion of data that's coming from L1, L2, L3 and DRAM. e.g. if my total byte accessed was 1 GB, then L1 data will be roughly (1 GB * B / A) or (B / (B + C + D + E)) and so on.
Unfortunately, A > B + C + D + E and I'm not sure why. Maybe B, C, D, E, doesn't count off-core cache access? Or maybe I'm just wrong about what these counters are counting.
So basically I have two questions
1) What are A, B, C, D and E counting exactly? and
2) Is there any way to get a (rough) breakdown of how much of my kernel's required/demanded data is coming from where (L1, L2, L3, or DRAM)?
You are missing an important event -- the name will be something like MEM_UOPS_RETIRED_LFB_HIT. This counts the loads that miss in the L1 cache but which merge into a preceding L1 cache miss. Including this event should help close the gap in the accounting.
If we call MEM_UOPS_RETIRED_LFB_HIT event (F), then you should have A = B + C + D + E + F.
Unfortunately these events won't tell you how much traffic there is between the various levels of cache, for two reasons:
1. These events are supposed to tell where the data was found when a load micro-op is executed. They don't (and can't) say whether the data was found at that location because it was previously used and still in the cache or if the data was found at that location because it was prefetched into that level of cache. So they are good for identifying long-latency loads that may cause processor stalls, but they are not good to determine either total data traffic or data re-use at each level of the cache.
2. On at least some platforms, events C & D (load hit L2 & load hit LLC) are broken for AVX loads -- they always return zero. The equation will still close, but the AVX loads that actually find their data in the L2 or LLC will be counted in bucket E (LLC misses). Oddly, this is not discussed in either the SW developer's guide (Volume 3) or in the processor specification updates (i.e., errata), but is mentioned in the appendix of the performance optimization manual that discusses performance counters.
Fortunately (?), the performance of multi-grid kernels is usually limited by data movement, so re-compiling for SSE4.2 instead of AVX will restore the functionality of the counters with (typically) minimal impact on performance.