I'm using event based sampling to collect performance metrics (mainly cache misses) for a simple matrix multiplication code. However, the gathered values from Vtune are drastically different from the values i gathered using PAPI (same code, with same matrix dimensions).
I'm running the program on a single core which i have exclusive access to. Is there some detail that i'm missing here?
Example: for a simple matrix multiplication of 1K x 1K elements,
- using vtune: amplxe-cl -collect-with runsa -knob event-config=MEM_LOAD_RETIRED.L1_MISS matmul
Gives me the result: Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------ ------------------------- -------------------------------- -----------------
MEM_LOAD_RETIRED.L1_MISS 100003 1 100003
- and the same code using papi with the event PAPI_L1_TCM:
Event 0 count = 74
Time taken: 3.822580 seconds
GFLOPS = 0.523207
- and again with papi low level API using the the event MEM_LOAD_RETIRED.L1_MISS
Event count 30332
Time taken: 3.077956 seconds
GFLOPS = 0.649782
Some papi forum posts mentioned that vtune gathers system-wide metrics, while papi measures process-based metrics. Even if this was true, the drastic difference still doesn't make sense to me.
Any help/guidance is appreciated!
The sampling methodology used by VTune gives counts that can only be compared across different lines/regions within the same execution of the same program. It is perhaps best to consider the counts as being scaled by a different random value in each run.
If configured correctly, PAPI counts should be the actual (unscaled) values, which can be directly compared across runs.
It is usually a good idea to include code to generate "expected values" for events that are likely to be predictable -- in this case floating-point operations should match the 2*N^2 nominal work estimate, but it is also a good idea to have the code print out how many loads and stores you expect to see, so you can compare these values to the appropriate performance counter events (MEM_INST_RETIRED.ALL_LOADS and MEM_INST_RETIRED.ALL_STORES).
The MEM_LOAD_RETIRED.L1_MISS event counts where the *load instruction* found the data. This is valuable, but it cannot be used to generate estimates of data motion through the cache hierarchy -- if a hardware prefetch operation moves the data from the L2 to the L1 cache before the load instruction is executed, it will count as an L1 hit, even though the data was moved into the L1 from the L2. The same applies for hits/misses at the L2 and L3.
The observed performance is abysmal, even for a naive implementation.
Thanks for your reply! This is very helpful.
Apologies if this seems off topic, but if one had to estimate the TRUE L1 Misses, is it safe to assume that PAPI's preset event "PAPI_L1_TCM" is more reliable?
For the L1 Data Cache, I typically use the L1D.REPLACEMENT event, which is available in PAPI as the native event L1D:REPLACEMENT. This gives the number of cache lines inserted into the L1 Data Cache. In my testing this matches the expected values for cache lines associated with load misses plus cache lines associated with store misses on all the Intel processors that I have tested.
If you want to limit the count to L1 Data Cache misses caused by loads, I would use the PAPI native event L2_RQSTS:ALL_DEMAND_DATA_RD. For L1 Data Cache misses caused by stores, the event is L2_RQSTS:ALL_RFO
If you want to know how many load *instructions* missed in the L1 Data Cache, you need the sum of MEM_LOAD_RETIRED:L1_MISS and MEM_LOAD_RETIRED:FB_HIT. E.g., with 64-bit (8-Byte) loads to contiguous addresses, MEM_LOAD_RETIRED:L1_MISS will be incremented for the first load to miss in the L1 and MEM_LOAD_RETIRED:FB_HIT will be incremented for the next seven loads to the cache cache line.
If the results are confusing, in many cases disabling the HW prefetchers can help. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processo...