Hardware Counters on KNL

Chronus_Taizen · ‎03-11-2017

Here's a question for Intel,

On KNL, running "perf stat -d" on any executable will tell me how many L1 Cache misses I have, but it will never tell me what percentage of all the cache ops(Hits+misses) that is. In fact, there is a counter right underneath that says something along the lines of "Total L1 Cache Loads", I forget the exact wording, and it says that counter is "Not available". Why would you make a counter needed to calculate "cache miss percentage" not available on hardware?

I don't know whether the millions of cache misses I am getting are a problem(80% L1 cache miss, say) or not at all a problem(say, 1% L1 cache miss). Am I missing something?

Thanks.

McCalpinJohn · ‎03-13-2017

There are not a lot of hardware performance counter events available on the Xeon Phi x200 systems. The KNL performance monitoring infrastructure is described in a two-volume set:

"Intel Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 1: Registers", Intel document 332972
"Intel Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 2: Events", Intel document 334480

As the names suggest, the first volume describes the infrastructure, while the second volume describes the actual events that can be measured. (If I recall correctly, the programming of the auxiliary MSRs for the OFFCORE_RESPONSE counter events is described in the first volume, rather than the second.)

From the second volume, only 22 performance counter events are described for the core. Looking over these events, you can count memory operations and you can count L2 accesses, so you can define some metrics yourself. The memory operation counter does not distinguish between operations of different sizes, so it can be tricky to understand what the numbers mean. For example, loading 8-Byte values from the L2 cache should give you 1 miss followed by 7 hits -- or 1 L2 access per 8 memory operations. But if you are loading 64-Byte values, then you expect every memory operation to miss in the L1 -- so 1 L2 access per 1 memory operation. Instrumentation at a loop level (where you can inspect the memory operation types) makes this easier to interpret.