I have done a few NUMA memory bandwidth tests on our 4-socket Intel testbed using STREAM benchmark and Vtune for hardware counter moniter. Our CPU model is Intel Intel(R) Xeon(R) CPU E5-4620, Sandy Bridge microarchitecture.
All the tests are on the same host using the same settings, exception for different memory and CPU nodes binding combinations(local/remote)
The bandwidth, without expectation is that local memory access bandwidth is two times faster than the remote case. But I find some readings of the hardware counter is somehow hard to explain.
The first one is the LLC MISSES. The Local case have only a half of LLC misses than remote case. Is this related to the prefetch mechanisms? Both remote and local access cases should have similar number of cache misses，right?
The second is the LOAD_HIT_PRE.HW_PF reading. I can see that the local case have only one third prefetch hits of remote case. That is also oppsite of our expectation.
What is the possible explanation on these results?
I have not seen any indication that the Linux perf_events subsystem is going to provide access to either the MSR-based uncore events or the PCI configuration-space uncore events any time soon.The initial perf_events support of uncore PMU for processors based on microarchitecture codenamed Nehalem-EP, Westmere-EP and SandyBridge-EP has been added in the Linux kernel 3.5 . SandyBridge-EP as you know has PCI configuration-space register access for uncore PMUs. Best regards, Roman