LLC misses in Sandy Bridge NUMA hosts

Tan_L_ · ‎09-21-2012

Hi,

I have done a few NUMA memory bandwidth tests on our 4-socket Intel testbed using STREAM benchmark and Vtune for hardware counter moniter. Our CPU model is Intel Intel(R) Xeon(R) CPU E5-4620, Sandy Bridge microarchitecture.

All the tests are on the same host using the same settings, exception for different memory and CPU nodes binding combinations(local/remote)

The bandwidth, without expectation is that local memory access bandwidth is two times faster than the remote case. But I find some readings of the hardware counter is somehow hard to explain.

The first one is the LLC MISSES. The Local case have only a half of LLC misses than remote case. Is this related to the prefetch mechanisms? Both remote and local access cases should have similar number of cache misses，right?

The second is the LOAD_HIT_PRE.HW_PF reading. I can see that the local case have only one third prefetch hits of remote case. That is also oppsite of our expectation.

What is the possible explanation on these results?

Thanks

McCalpinJohn · ‎09-25-2012

Cache misses are relatively hard to interpret on recent Intel processors because of the aggressive hardware prefetchers. With STREAM, the number of cache *fills* and cache *writebacks* should be the same for local and remote data, but you have to be careful with whether the counters you are using are measuring all LLC misses, or just LLC misses due to load/store instructions. If I recall correctly, performance counter event 2Eh, mask 41h only counts LLC misses due to loads & stores, not LLC misses due to hardware prefetches. (1) You might be able to get the desired numbers by properly programming Event B7h (or BBh) along with the correct bits in MSR 01A6h (or MSR 01A7h). The programming of these bits is very confusing, but there are examples in Table 19-5 of Volume 3B of the Intel Architecture software developer's manual. (I am using document 325384-042.) My understanding is that this is supported by the perf_events subsystem of the Linux kernel starting with version 3.3, but I have not confirmed this. Because this is a core performance counter measurement, you would have to duplicate the measurement on all cores to get the total bandwidth. (2) You can measure L3 misses and L3 writebacks using the L3 CBO box uncore counters. The Xeon E5-4620 is similar to the Xeon E5-2600 series, so you should be able to make use of the information in Intel's document "Intel Xeon Processor E-2600 Product Family Uncore Performance Monitoring Guide", document 327043-001. Table 2-15 includes the derived metrics required to obtain local and remote read bandwidth and L3 writeback bandwidth. This is not much easier than using the core counters, since there is one CBox per core and you have to read the counters for all of them to get the total bandwidth, but at least it is available using an MSR interface. (3) You can also measure DRAM bandwidth directly using the performance counters in the integrated Memory Controller (iMC) in the uncore. There are four memory controllers that you have to query, but the performance monitors are in PCI Configuration Space, rather than in MSR space. I have had success setting these using the "setpci" utility in our RHEL 6.2 Linux systems, and have read the results using both "setpci" and using "lspci". It requires root access and is labor intensive in all cases, but it definitely works. I have not seen any indication that the Linux perf_events subsystem is going to provide access to either the MSR-based uncore events or the PCI configuration-space uncore events any time soon. A final note -- the large L3 caches in the Intel Xeon E5 processors mean that you have to increase the array sizes in the STREAM benchmark to obtain useful (and compliant) results. The STREAM run rules require that each array be 4x as large as the aggregate of the largest caches used. On my Xeon E5-2680 2-socket systems, the L3 is 20 MB per socket, so the array sizes need to be at least 80 MB (N=10,000,000) each for single-socket runs and 160MB (N=20,000,000) each for runs using both sockets. The default array size of N=2,000,000 results in L3-contained cases for the "Copy" and "Scale" kernels when run on two (or more) sockets, since each array is only 15.26 MiB. To measure L3 bandwidth, you would want slightly smaller array sizes (so that the kernels using all three arrays don't overflow the L3 caches), while to measure DRAM bandwidth you want much larger arrays to eliminate the possibility of significant L3 cache re-use biasing the results.

Roman_D_Intel · ‎09-26-2012

I have not seen any indication that the Linux perf_events subsystem is going to provide access to either the MSR-based uncore events or the PCI configuration-space uncore events any time soon.

The initial perf_events support of uncore PMU for processors based on microarchitecture codenamed Nehalem-EP, Westmere-EP and SandyBridge-EP has been added in the Linux kernel 3.5 . SandyBridge-EP as you know has PCI configuration-space register access for uncore PMUs. Best regards, Roman

Roman_D_Intel · ‎09-26-2012

You can monitor memory controller bandwidth on Intel(R) Xeon(R) E5-4620 out of the box using pcm.x utility from Intel Performance Counter Monitor

McCalpinJohn · ‎09-26-2012

Hi Roman, Thanks for the note on the Linux 3.5 kernel support for Sandy Bridge UP uncore counters. I have a great deal of trouble finding information about either the current status or future plans for the perf_events subsystem, and my hardware-centric brain finds the source code to be nearly incomprehensible. In any case we are unlikely the be running a 3.5 kernel in production for several years, so we will continue with the "roll your own" approach. You have reminded me that I need to continue investigating the possibilities for using the "Intel Performance Counter Monitor". Right now I remain a bit confused about which pieces of functionality require root access (which we cannot support in our batch production environment) and which pieces are supported via the loadable kernel module and associated device driver. I spent a bit of time trying to understand the device driver interface, but did not make much progress. Perhaps tracing the interactions between pcm.x and the device driver will help me figger out how the interface works....

Tan_L_ · ‎09-26-2012

Dear John and Roman, Thanks so much for your helps! I will definitely read the mentioned docs and have a try on Intel Performance Counter Monitor to get DRAM bandwidth. For STREAM tests, I used OPENMP library for multithreaded tests, and I changed the matrix item size N=50,000,000 to ensure a big size for LLC. As you mentioned the number of cache operations should be the same for local and remote data, so if there are difference for total LLC misses, this should be related to hardware prefetch mechanisms. Am I right? Thanks again for offer various ways to get memory bandwidth and counter readings.

Roman_D_Intel · ‎10-01-2012

John, feel free to look into the "Intel Performance Counter Monitor" code. It is open source and you can reuse the PCM routines. Best regards, Roman