Solved: Thank you for the quick reply

Konstantinos_I_ · ‎09-18-2017

I want to ask if there is a way to flush/ clear all the data in ram and cache memories in order to study more accurately some hardware counters as the MEM_LOAD_UOPS_L3_MISS_RETIRED. My system has haswell architecture (model Xeon E5-2683 v3), centOS7 operating system and I don't have sudo access on it.

McCalpinJohn · ‎09-21-2017

For the Xeon E5 processors, the best way to measure memory traffic is at the memory controllers. The "uncore" performance counters for the Xeon E5 26xx v3 processors are described in a document titled "Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual" (document 331051-002, June 2015). Access to the uncore performance counters depends on both the revision of the operating system and on certain system settings. For Linux systems, the uncore performance counters of Xeon E5 v3 processors are supported by CentOS/RHEL 7.x revisions. For Linux systems, the kernel setting "perf_event_paranoid" must be set to zero for users to access the uncore counters.

If you can't measure at the memory controllers, the issues become more complex. The MEM_LOAD_UOPS_RETIRED events increment based on where the load instruction found the data -- not on where the data actually originated. Consider a sequence of loads to consecutive addresses. The first load will miss the L1 Data Cache, the L2 Cache, and the L3 Cache (so it will increment the MEM_LOAD_UOPS_RETIRED.L3_MISS event). Any subsequent loads from the same cache line will combine with the initial cache miss, but will increment the MEM_LOAD_UOPS_RETIRED.HIT_LFB event (LFB=="Line Fill Buffer" -- an entry in the data structure that tracks L1 cache misses). Loads to subsequent cache lines might also miss in the caches, but while this process is taking place, the various Hardware Prefetchers are tracking the sequence of loads and issuing "hardware prefetches" to move the data from memory into one or more of the caches. The most important of the hardware prefetchers is the "L2 Streaming Prefetcher" (sometimes called the "MLC Streaming Prefetcher"), which will extrapolate along a sequence of load addresses and fetch (expected) future addresses into the L3 cache or into both the L2 and L3 caches. The choice of destination for the prefetched data depends on proprietary algorithms and hidden hardware state, but part of the decision is based on how many L2 cache misses the L2 cache is already tracking. If that buffer is nearly empty, prefetched data will be brought into the L2 (and also placed in the L3, because the L3 is inclusive in the Xeon E5 v3 processors), but if the L2 miss handling buffer starts to get full, the prefetcher will request that data be brought only into the L3 (leaving room in the L2 miss handling buffer for additional demand misses).

There are also L1 Hardware Prefetchers that attempt to load expected lines from the L2 cache in advance of the demand load. These are typically much less aggressive than the L2 Hardware Prefetchers, but if the data consumption rate is low enough, they can get the data from the L2 into the L1 early enough to prevent an L1 miss.

To sum up -- a sequence of loads from consecutive addresses -- all of which are initially uncached -- can be expected to increment any (or all) of the events:

MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.L3_HIT
MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB

Because the hardware prefetch decisions are made dynamically, there is no "right answer" for how the loads will distribute their counts across these event categories.... I typically disable the hardware prefetchers (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors) before trying to use these counters to understand data motion, but this requires root access.

An alternative set of events that may provide enough information is available using the OFFCORE_RESPONSE counters, described in Section 18.11.4 of Volume 3 of the Intel Architectures SW Developers Manual (document 325384-062). These events use an additional register to provide more bits to specify the specific request type, response type, and snoop response for which you want counts. These performance counter events do not require root access, but they are tricky to program, and the syntax used by the Linux "perf stat" tool to select these events is poorly documented.

I would be remiss if I did not take this opportunity to note that any performance counters can have bugs or "idiosyncrasies". These bugs may be disclosed or not disclosed, they may be well-known or poorly-known, and they may have workarounds or not have workarounds. These bugs are sometimes carried from one processor generation to the next, or they may be fixed from one generation to the next, or previously reliable counters may have bugs introduced from one generation to the next. These complexities are part of the reason why I use the memory controller counters whenever I can -- they are much less complex to design (so bugs are rare) and they are much less complex to understand (so it is easier for a user to generate test cases for validation).

View solution in original post

McCalpinJohn · ‎09-19-2017

It is not possible to do this with complete effectiveness at user level. Performance counters in the uncore can be used to derive the mapping of physical addresses to L3 slices (CBos) for any address range that the user can allocate and test, but that only tells you which CBo is being used, not which congruence class within that slice is being used. The size of the L3 slices suggests a straightforward mapping, but I don't know of any demonstrations that confirm the internal mapping.

At the gross level, on Xeon E5 v3 systems, reading an array that is 4x larger than the L3 cache size will clear nearly 100% of the prior data from the L1, L2, and L3 caches. This only requires process binding (e.g., "taskset" or "numactl --physcpubind" on Linux systems).

Konstantinos_I_ · ‎09-21-2017

Thank you for the quick reply.

So I can firstly run a dummy executable that will load an array big enough to fill all the caches and then run the executable that I want to profile, making sure that both are executed on the same set of CPUs.

Now, I wanted to quantify the compute intensity (amount of instructions per input byte/ element) with a metric. I was thinking either:

COMPUTE_INTENSITY1 = INSTRUCTIONS_RETIRED/ MEM_UOPS_RETIRED:ALL_LOADS
COMPUTE_INTENSITY2 = INSTRUCTIONS_RETIRED/ MEM_LOAD_UOPS_RETIRED:L3_MISS

Which one do you think is better? Or is there another way to estimate the amount of input data read from the memory?

McCalpinJohn · ‎09-21-2017

For the Xeon E5 processors, the best way to measure memory traffic is at the memory controllers. The "uncore" performance counters for the Xeon E5 26xx v3 processors are described in a document titled "Intel Xeon Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual" (document 331051-002, June 2015). Access to the uncore performance counters depends on both the revision of the operating system and on certain system settings. For Linux systems, the uncore performance counters of Xeon E5 v3 processors are supported by CentOS/RHEL 7.x revisions. For Linux systems, the kernel setting "perf_event_paranoid" must be set to zero for users to access the uncore counters.

If you can't measure at the memory controllers, the issues become more complex. The MEM_LOAD_UOPS_RETIRED events increment based on where the load instruction found the data -- not on where the data actually originated. Consider a sequence of loads to consecutive addresses. The first load will miss the L1 Data Cache, the L2 Cache, and the L3 Cache (so it will increment the MEM_LOAD_UOPS_RETIRED.L3_MISS event). Any subsequent loads from the same cache line will combine with the initial cache miss, but will increment the MEM_LOAD_UOPS_RETIRED.HIT_LFB event (LFB=="Line Fill Buffer" -- an entry in the data structure that tracks L1 cache misses). Loads to subsequent cache lines might also miss in the caches, but while this process is taking place, the various Hardware Prefetchers are tracking the sequence of loads and issuing "hardware prefetches" to move the data from memory into one or more of the caches. The most important of the hardware prefetchers is the "L2 Streaming Prefetcher" (sometimes called the "MLC Streaming Prefetcher"), which will extrapolate along a sequence of load addresses and fetch (expected) future addresses into the L3 cache or into both the L2 and L3 caches. The choice of destination for the prefetched data depends on proprietary algorithms and hidden hardware state, but part of the decision is based on how many L2 cache misses the L2 cache is already tracking. If that buffer is nearly empty, prefetched data will be brought into the L2 (and also placed in the L3, because the L3 is inclusive in the Xeon E5 v3 processors), but if the L2 miss handling buffer starts to get full, the prefetcher will request that data be brought only into the L3 (leaving room in the L2 miss handling buffer for additional demand misses).

There are also L1 Hardware Prefetchers that attempt to load expected lines from the L2 cache in advance of the demand load. These are typically much less aggressive than the L2 Hardware Prefetchers, but if the data consumption rate is low enough, they can get the data from the L2 into the L1 early enough to prevent an L1 miss.

To sum up -- a sequence of loads from consecutive addresses -- all of which are initially uncached -- can be expected to increment any (or all) of the events:

MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.L3_HIT
MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.L3_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB

Because the hardware prefetch decisions are made dynamically, there is no "right answer" for how the loads will distribute their counts across these event categories.... I typically disable the hardware prefetchers (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors) before trying to use these counters to understand data motion, but this requires root access.

An alternative set of events that may provide enough information is available using the OFFCORE_RESPONSE counters, described in Section 18.11.4 of Volume 3 of the Intel Architectures SW Developers Manual (document 325384-062). These events use an additional register to provide more bits to specify the specific request type, response type, and snoop response for which you want counts. These performance counter events do not require root access, but they are tricky to program, and the syntax used by the Linux "perf stat" tool to select these events is poorly documented.

I would be remiss if I did not take this opportunity to note that any performance counters can have bugs or "idiosyncrasies". These bugs may be disclosed or not disclosed, they may be well-known or poorly-known, and they may have workarounds or not have workarounds. These bugs are sometimes carried from one processor generation to the next, or they may be fixed from one generation to the next, or previously reliable counters may have bugs introduced from one generation to the next. These complexities are part of the reason why I use the memory controller counters whenever I can -- they are much less complex to design (so bugs are rare) and they are much less complex to understand (so it is easier for a user to generate test cases for validation).

Way to flush/ clear the RAM and cache memories