I am trying to profile the memory access of some of our code, on a CPU E5-2690 v2 server (running RHEL6.5) using PCM2.10. It is very attractive to me to be able to profile just a small snippet of our code. I linked the PCM code to our program, and everything compiles fine. I removed the NMI watchdog, used the experimental BIOS to get QPI access, and have sudo access to the machine, so PCM starts without any issue within the code. But the numbers don't pass the smell test, and don't correlate well with other HW profiling tools I have used.
1. getLocalMemoryBW and getRemoteMemoryBW, always return 0, looking at the documentation in the code, it is a bit confusing if these are supposed to work with DEFAULT_EVENTS. Is there a particular setting I should use to measure these?
2. If I re-run the same code under the same circumstances, the numbers are fairly consistent, but when I make a variation like binding the process to a set of CPUs, the number of instructions can change dramatically by 20%, which does not make much sense to me. I am also seeing 20% L3 misses, which I don't see using another hardware counter tool (and all the data fits on L3 and I am limiting myself to a single socket, so there should not be any L3 misses, outside of the first cycle when things are loaded and the number of L3 misses far surpasses the memory use of the test).
Is there anything I can do to further debug this? Any benchmark code I could run to verify if the right counters are measured for the server?
Unfortunately to understand the performance counters you have to dig through the source code to find out exactly which performance counter unit and exactly which event(s) in that unit are being used. Then there is at least a hope of understanding what the event is supposed to count and a hope of discovering whether the counts are believed to be accurate.
I build my own performance monitoring tools to avoid the extra layer of translation, so I have not looked to see how easy it is to figure out what PCM is doing....
I think that Xeon E5 v2 also supports both "early snoop" and "home snoop" modes, and the mode selection will change the types of transactions on the QPI interface. This should not effect data counts, but some of other transaction types will differ substantially.
The primary performance monitoring reference for the core counters is Chapters 18 and 19 of Volume 3 of the Intel Architecture Software Developer's Manual (document 325384). For the Xeon E5 v2, the Uncore Performance Monitoring Reference Manual is document 329468.
I use the "rdmsr.c" and "wrmsr.c" routines from msr-tools to access the MSR registers (requires root access) and use inline assembly to generate the RDPMC instructions to read the core performance counters inline. The uncore performance counters are split between MSRs and PCI configuration space. For PCI configuration space access I use "lspci" and "setpci" as command-line tools, while for inline access I open the device drivers in the "/proc/bus/pci/" hierarchy (in Linux) and use "pread" and "pwrite" calls that look like the ones from "rdmsr.c" and "wrmsr.c". Sometimes I do an "mmap()" on the PCI configuration space device, which allows me to read and write directly from user space (also requires root access) for minimum latency and overhead.
My tools don't do any virtualization or sharing -- they just provide very low overhead access to the counters with a maximum of control and a minimum of indirection.