We are trying to perfomance monitor our application. Our goal is to measure the number of L3 cache misses that are fulfilled by Local DRAM, Remote DRAM, and Remote Cache on our 4 socket NUMA machine. The hardware counters that we are currently using are called MEM_LOAD_UOPS_RETIRED.LLC_MISS, MEM_LOAD_UOPS_LLC_MISS_ RETIRED.LOCAL_DRAM, MEM_LOAD_UOPS_LLC_MISS_ RETIRED.REMOTE_DRAM in the Intel manual. We have written a small benchmark to verify these results. We frequently see zero local DRAM accesses even when running on a single core, with the memory pinned to that socket. Furthermore, the amount of memory that is written to and read exceeds the size of L3 cache. When we throw MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS into the mix, we find a huge number of L3 misses that are not accounted for by any memory. We have also tried disabling pre-fetchers. What's going on here?
We're currently using 4 Intel Xeon Processor E5-4620.
I addressed part of this yesterday in a forum posting at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/744272#comment-1912557 ;
The Xeon E5-4620 (Sandy Bridge) has a variety of bugs in these performance counter events. Some are documented in the "specification update" document. There are workarounds for some of the bugs -- see https://software.intel.com/en-us/articles/performance-monitoring-on-intel-xeon-processor-e5-family for discussion and use "latego.py" from https://github.com/andikleen/pmu-tools to implement the available workarounds.
Oddly, perhaps the most egregious bug in the MEM_LOAD_UOPS_RETIRED counter is not documented in any of the errata, but is mentioned only in Appendix B of the Intel Optimization Reference Manual (document 248966-037, July 2017, page B-47). On Xeon E5 (v1) processors, 32-Byte AVX loads will only increment the MEM_LOAD_UOPS_RETIRED.L1_HIT counter or the MEM_LOAD_UOPS_RETIRED.HIT_LFB counter -- never any of the other sub-events. From my testing, the L1_HIT values are slightly inflated, but not too badly, while all other 32-Byte AVX loads increment the HIT_LFB counter -- no matter where the data was actually found.
We've enabled the fixes in latego.py and we seem to be getting the correct numbers, but only for sequential code. Is there a reason why MEM_LOAD_UOPS_RETIRED.LLC_MISS would result in a count of 0 for parallel code?